Datasets

Every dataset worth knowing about.

Datacrawlr harvests structured metadata from nine sources of open data — research repositories, ML community hubs, government portals — and projects every entry into a single canonical schema. You compare them like rows in a database.

Discovery

Type two characters. See what we have.

The hero search runs against every indexed dataset and model. Two-line previews per row. Hit enter for the full search page with AI synthesis and faceted filters.

View codebase on GitHub →

datacrawlr.com

Homepage hero with the type-ahead dropdown open showing dataset matches.

Search + synthesis

An AI summary at the top of every result list.

Searches return real datasets ranked by BM25 + popularity. The synthesis card above the grid explains what the matches share, where they differ, and which to start with — with citations back to the actual rows.

View search implementation on GitHub →

datacrawlr.com/search?q=medical+imaging

Search results with AI synthesis card and ranked dataset cards.

What we capture

Four metadata layers on every entry.

Every dataset is described by the same fields regardless of where it came from — so cross-source comparisons actually work.

Schema + structure

Field names, types, descriptions, and nullability — parsed from source-platform feeds where available, inferred from sampled rows where not.

License + use terms

Normalized license category (permissive / copyleft / non-commercial / restrictive / proprietary / public domain), commercial-use signal, and a risk pill.

Lineage + provenance

Derived-from / similar-to / benchmarked-against edges between datasets, plus the models that report training on them.

Citations + attribution

BibTeX-quality citation strings when the source publishes them, creator credentials with ORCID when known.

Detail page

The dataset page is the work product.

Six tabs cover everything you need to make a build/skip decision: overview with AI summary, schema viewer, source attribution, lineage graph, citations, and a per-dataset discussion thread.

License risk pill prevents you from building on non-commercial data by accident.
Lineage tab links derived-from / similar-to / benchmarked-against datasets.
Models trained on the dataset surface as a rail in the overview.

View datasets code on GitHub →

datacrawlr.com/datasets/<slug>

Dataset detail page with schema, sources, and the models-trained-on rail.

Nine dataset sources, official APIs only

Every connector hits the source's published API or a structured open feed. No scraping behind authentication.

HuggingFace Datasets

Largest open hub for ML datasets — community uploads and benchmarks.

huggingface.co/datasets

OpenML

Research-grade tabular ML datasets with rich schema metadata.

openml.org

Zenodo

CERN open research repository — peer-reviewed datasets with DOIs.

zenodo.org

Kaggle

Competition datasets and community contributions.

kaggle.com/datasets

CKAN portals

Government open-data catalogs — data.gov, EU Open Data, +.

data.gov +

figshare

Research outputs with DOIs and persistent storage.

figshare.com

Harvard Dataverse

Federated academic repository network.

dataverse.harvard.edu

GitHub (datasets-as-repos)

Project repos shipping CSV/JSONL/Parquet alongside training code.

github.com

Schema.org / DCAT

Structured metadata across the open web — institutional + long tail.

schema.org/Dataset

Compliance posture

Index, don't mirror.

Our architecture is the compliance story. Four commitments that apply to every entry in the catalog.

API-first ingestion

Every connector hits the source's official API or a structured open feed (DCAT, OAI-PMH, schema.org). We never scrape behind authentication or paywalls.

Metadata only — no mirroring

We index what a dataset or model is — not the bytes themselves. Every page links to the original host; the host stays the source of truth.

License-aware by default

Each entry is tagged with its license category and use terms. License risk badges surface non-commercial and restrictive terms before you build on them.

Robots.txt + rate limits respected

Identified User-Agent. Backoff on errors. Crawl-delay honored. Where a source publishes quotas, we stay well under them.

Honest about what we are

Datacrawlr indexes metadata, not bytes. Every dataset page links back to the original host; we never mirror, cache, or redistribute the underlying data. License classifications are a starting point for discovery — not legal advice. Verify on the source before commercial use.

Find the right dataset without bookmarks.

Explore the code on GitHub to see how it works.

View GitHub Repo →See model coverage

The dataset page is the work product.

Six tabs cover everything you need to make a build/skip decision: overview with AI summary, schema viewer, source attribution, lineage graph, citations, and a per-dataset discussion thread.

License risk pill prevents you from building on non-commercial data by accident.

Lineage tab links derived-from / similar-to / benchmarked-against datasets.

Models trained on the dataset surface as a rail in the overview.