Datasets
Datacrawlr harvests structured metadata from nine sources of open data — research repositories, ML community hubs, government portals — and projects every entry into a single canonical schema. You compare them like rows in a database.
Discovery
The hero search runs against every indexed dataset and model. Two-line previews per row. Hit enter for the full search page with AI synthesis and faceted filters.
Search + synthesis
Searches return real datasets ranked by BM25 + popularity. The synthesis card above the grid explains what the matches share, where they differ, and which to start with — with citations back to the actual rows.
What we capture
Every dataset is described by the same fields regardless of where it came from — so cross-source comparisons actually work.
Field names, types, descriptions, and nullability — parsed from source-platform feeds where available, inferred from sampled rows where not.
Normalized license category (permissive / copyleft / non-commercial / restrictive / proprietary / public domain), commercial-use signal, and a risk pill.
Derived-from / similar-to / benchmarked-against edges between datasets, plus the models that report training on them.
BibTeX-quality citation strings when the source publishes them, creator credentials with ORCID when known.
Detail page
Six tabs cover everything you need to make a build/skip decision: overview with AI summary, schema viewer, source attribution, lineage graph, citations, and a per-dataset discussion thread.
Every connector hits the source's published API or a structured open feed. No scraping behind authentication.
Largest open hub for ML datasets — community uploads and benchmarks.
Research-grade tabular ML datasets with rich schema metadata.
CERN open research repository — peer-reviewed datasets with DOIs.
Competition datasets and community contributions.
Government open-data catalogs — data.gov, EU Open Data, +.
Research outputs with DOIs and persistent storage.
Federated academic repository network.
Project repos shipping CSV/JSONL/Parquet alongside training code.
Structured metadata across the open web — institutional + long tail.
Compliance posture
Our architecture is the compliance story. Four commitments that apply to every entry in the catalog.
Every connector hits the source's official API or a structured open feed (DCAT, OAI-PMH, schema.org). We never scrape behind authentication or paywalls.
We index what a dataset or model is — not the bytes themselves. Every page links to the original host; the host stays the source of truth.
Each entry is tagged with its license category and use terms. License risk badges surface non-commercial and restrictive terms before you build on them.
Identified User-Agent. Backoff on errors. Crawl-delay honored. Where a source publishes quotas, we stay well under them.
Honest about what we are
Datacrawlr indexes metadata, not bytes. Every dataset page links back to the original host; we never mirror, cache, or redistribute the underlying data. License classifications are a starting point for discovery — not legal advice. Verify on the source before commercial use.
Open the catalog and start finding what you actually need.