RF100 Dataset Catalog

Overview

The RoboFlow 100 (RF100) benchmark consists of 34 diverse object detection datasets organized into 6 collections. This vignette provides a comprehensive catalog to help you find the right dataset for your task.

The RF100 datasets cover a wide range of domains including:

Example: Finding a Photovoltaic Dataset

One of the motivations for this catalog was answering questions like: “Is there a photovoltaic dataset in torchvision?”

# Search for solar/photovoltaic datasets
search_rf100("solar")
search_rf100("photovoltaic")

# Result shows:
# - solar_panel in infrared collection
# - solar_panel in damage collection

Complete Catalog

Here’s the complete catalog of all RF100 datasets:

library(torchvision)
library(knitr)

catalog <- get_rf100_catalog()

# Display key columns
kable(catalog[, c("collection", "dataset", "description", "total_size_mb", "estimated_images")])

Collections

Biology Collection (9 datasets)

Microscopy and biological imaging datasets for research and diagnostics:

search_rf100(collection = "biology")

Available datasets:

Medical Collection (8 datasets)

Medical imaging datasets for clinical and research applications:

search_rf100(collection = "medical")

Available datasets:

Infrared Collection (4 datasets)

Thermal and infrared imaging datasets:

search_rf100(collection = "infrared")

Available datasets:

Damage Collection (3 datasets)

Infrastructure damage and defect detection:

search_rf100(collection = "damage")

Available datasets:

Underwater Collection (4 datasets)

Marine and underwater imaging datasets:

search_rf100(collection = "underwater")

Available datasets:

Document Collection (6 datasets)

Document analysis and OCR datasets:

search_rf100(collection = "document")

Available datasets:

Usage Example

Once you’ve found a dataset, loading it is straightforward:

library(torchvision)

# Search for blood cell dataset
search_rf100("blood")

# Load the dataset
ds <- rf100_biology_collection(
  dataset = "blood_cell",
  split = "train",
  download = TRUE
)

# Inspect a sample
item <- ds[1]
print(item$y$labels)  # Object classes
print(item$y$boxes)   # Bounding boxes

# Visualize with bounding boxes
boxed <- draw_bounding_boxes(item)
tensor_image_browse(boxed)

Dataset Statistics

catalog <- get_rf100_catalog()

# Total size of all datasets
sum(catalog$total_size_mb) / 1024  # In GB

# Datasets by size
catalog[order(-catalog$total_size_mb), c("dataset", "collection", "total_size_mb")]

# Smallest and largest datasets
catalog[which.min(catalog$total_size_mb), ]
catalog[which.max(catalog$total_size_mb), ]

# Average size by collection
aggregate(total_size_mb ~ collection, data = catalog, FUN = mean)

Filtering and Exploration

The catalog is a regular data frame, so you can use standard R operations:

# Find small datasets (< 20 MB total)
subset(catalog, total_size_mb < 20)

# Find large datasets (> 200 MB total)
subset(catalog, total_size_mb > 200)

# Find datasets with specific keywords
subset(catalog, grepl("tumor|cancer|disease", description, ignore.case = TRUE))

# Datasets with all three splits
subset(catalog, has_train & has_test & has_valid)

Additional Resources

Citation

If you use RF100 datasets in your research, please cite:

@article{roboflow100,
  title={Roboflow 100: A Rich, Multi-Domain Object Detection Benchmark},
  author={Roboflow},
  journal={arXiv preprint},
  year={2022}
}