About Datacrawlr

An index for every dataset that matters.

Datacrawlr is a metadata layer for the open data ecosystem. We index dataset metadata from across the public internet — what's where, what it contains, what you can do with it — and make it searchable. We don't host data. We tell you where to find it.

What is Datacrawlr?

The open dataset ecosystem is fragmented. HuggingFace knows about HuggingFace, Kaggle knows about Kaggle, every government portal knows about itself, and academic repositories live in their own worlds. An ML engineer trying to pick the right training or evaluation data spends hours bouncing between sites that don't talk to each other — and still finishes with the uneasy feeling that something better exists somewhere they didn't look.

Datacrawlr harvests structured metadata from those platforms through their official APIs, normalizes it into a single schema, enriches it with task classifications, license analysis, and semantic embeddings, and exposes the result as a searchable catalog. Every dataset entry on Datacrawlr is described by the same fields — name, modality, license, size, schema, creators — regardless of which source it came from.

We index two kinds of things: datasets and models. The dataset side maps what data exists; the model side maps what's been trained on it. Together they give you the full picture for any ML decision — pick the right training corpus, then see every model that's been trained on it, with benchmark scores and license terms side by side.

Datacrawlr is not a hosting service, not a scraper, and not a CDN. We never download or redistribute the datasets themselves. Every page on Datacrawlr links back to the original source; we're the index that points there, not the place that holds the files.

The audience is concrete: ML engineers picking corpora to fine-tune on, researchers hunting for the right benchmark, data-science teams comparing licenses before a commercial build, and AI teams choosing pre-training data. If you've ever thought “there has to be a better dataset for this,” Datacrawlr is the place that answers.

How it works

Datacrawlr is built around four pipelines that run continuously.

01
Discover
We index from a curated list of sources via their official APIs — HuggingFace, OpenML, Zenodo, Kaggle, CKAN portals, figshare, Harvard Dataverse, GitHub. For sources without APIs, we follow DOI graphs and schema.org Dataset markup to find new datasets. Every source has a refresh cadence and a rate-limit budget.
02
Normalize
Each source returns metadata in its own format. We map each one to a single canonical schema — name, modality, license, size, schema, creators, citations — so every dataset is comparable. Cross-source duplicates (same dataset published on multiple platforms) are detected and merged.
03
Enrich
We add classifications the source doesn't provide: ML task type, refined modality, license risk analysis, completeness scoring, and semantic embeddings for related-dataset discovery. AI-generated summaries explain why you'd choose each dataset and what to watch out for.
04
Serve
A FastAPI service exposes the index through a REST API. The frontend you're using right now reads from it. Search is BM25 over OpenSearch with vector similarity for related datasets via pgvector.

The model-dataset graph

Datacrawlr's signature feature is the link between models and datasets. When a model's card declares its training data, we connect them. When you view a dataset, you see which models were trained on it. When you view a model, you see what it was trained on. This is the layer that makes Datacrawlr more than a directory — it's an index of ML provenance.

Sources

We focus on official APIs from established repositories. Every source listed below is a place where dataset publishers themselves choose to put their work.

HuggingFace Datasets

The largest open hub for ML datasets — fine-tuning corpora, benchmarks, and community uploads.

huggingface.co/datasets

OpenML

Research-grade catalog of tabular ML datasets with rich schema metadata and benchmarking tasks.

openml.org

Zenodo

CERN-operated open research repository — peer-reviewed datasets with DOIs across scientific domains.

zenodo.org

Kaggle

Competition datasets, community contributions, and corporate releases curated by Kaggle.

kaggle.com/datasets

CKAN open data portals

Government and institutional open-data catalogs — public spending, climate, transit, health, statistics.

data.gov, EU Open Data, data.gov.uk, +

figshare

General-purpose research outputs platform — datasets, figures, posters, with DOIs and persistent storage.

figshare.com

Harvard Dataverse

Federated academic repository network — strong in social science, replication packages, and survey data.

dataverse.harvard.edu

GitHub (datasets-as-repos)

Project repositories that ship CSV/JSONL/Parquet datasets alongside training code and documentation.

github.com

Schema.org / DCAT discovery

Structured metadata published as schema.org Dataset markup and DCAT feeds — institutional and long-tail sources.

the open web

HuggingFace Models

Largest open registry of model weights — checkpoints, configs, and benchmark scores for the open-weights ecosystem.

huggingface.co/models

OpenRouter

Unified pricing + provider catalog for commercial-API models; the canonical source for context-window and per-token costs.

openrouter.ai

If you maintain a dataset repository with an open API and want Datacrawlr to index it, we'd love to add you — there's a link to open an issue at the bottom of this page.

How we stay compliant

Datacrawlr is a metadata index, not a content host. Our compliance posture isn't an afterthought — it's the entire architecture.

API-first.
Every connector uses the official API of its source. We never scrape behind authentication, paywalls, or rate-limit barriers. If a source doesn't offer an API, we look for structured metadata they publish openly (schema.org markup, OAI-PMH, DCAT) instead.
Robots.txt respected.
For the small set of sources we discover via web crawling (schema.org/DCAT pages), we respect robots.txt and crawl-delay directives. We identify ourselves with a clear User-Agent and a contact URL.
Rate limits honored.
Every source has a documented rate limit, and we stay well under it. We use exponential backoff on errors and never retry past the source's published quotas.
Metadata only — no mirroring.
We index metadata, not data. We never download, cache, or redistribute the actual dataset payloads. Every dataset page on Datacrawlr links to the original host.
Attribution preserved.
Every dataset includes its license, creators, and source platform. Citation strings (BibTeX or otherwise) are surfaced when available.
Takedown workflow.
If you're the maintainer of a dataset and want it removed from our index, you can request removal through a documented process. We respond within 5 business days.
No private datasets.
We do not index gated, authenticated, or private datasets — even if our API key would technically allow it. Public listings only.
Public data only.
We index only what is publicly accessible to anonymous users. Datasets requiring institutional credentials, paid subscriptions, or special permissions are not indexed.

Datacrawlr's classification of licenses, recommendations, and AI-generated summaries are aids for discovery — not legal advice. Always verify a dataset's terms on its original source before commercial use.

Mission

Open machine learning runs on open data. But the open data ecosystem is fragmented — HuggingFace knows about HuggingFace, Kaggle knows about Kaggle, every government portal knows about itself. There's no single place to ask “what's the best dataset for [task] under [license]?”

The result: ML engineers default to whatever's most discoverable, not whatever's best. Important datasets sit unused on niche repositories. Researchers reinvent benchmarks because they couldn't find the existing one. Models get trained on whatever was easy to find rather than what was right for the problem.

Datacrawlr's mission is to make the entire open dataset ecosystem visible and comparable — so the question “which dataset should I use?” has a real answer. We're a discovery layer for the open data web, the way Google was a discovery layer for the open web.

Who builds this

Datacrawlr is built and maintained by a small team with a data architecture background and an active interest in applied ML for scientific domains. We use Datacrawlr ourselves when building models — every feature on this site exists because we wanted it for our own work and couldn't find anything that already did the job. If a feature feels missing, it's because we haven't needed it yet — open an issue and tell us why.

Built API-first · Maintained continuously

Get in touch

Found a dataset that should be indexed? A bug? A new source we should add? Open an issue on GitHub or email us.

GitHub Email Request a takedown

What is Datacrawlr?

How it works

Datacrawlr is built around four pipelines that run continuously.

Discover

We index from a curated list of sources via their official APIs — HuggingFace, OpenML, Zenodo, Kaggle, CKAN portals, figshare, Harvard Dataverse, GitHub. For sources without APIs, we follow DOI graphs and schema.org Dataset markup to find new datasets. Every source has a refresh cadence and a rate-limit budget.

Normalize

Each source returns metadata in its own format. We map each one to a single canonical schema — name, modality, license, size, schema, creators, citations — so every dataset is comparable. Cross-source duplicates (same dataset published on multiple platforms) are detected and merged.

Enrich

We add classifications the source doesn't provide: ML task type, refined modality, license risk analysis, completeness scoring, and semantic embeddings for related-dataset discovery. AI-generated summaries explain why you'd choose each dataset and what to watch out for.

Serve

A FastAPI service exposes the index through a REST API. The frontend you're using right now reads from it. Search is BM25 over OpenSearch with vector similarity for related datasets via pgvector.

The model-dataset graph

Sources

We focus on official APIs from established repositories. Every source listed below is a place where dataset publishers themselves choose to put their work.

HuggingFace Datasets

The largest open hub for ML datasets — fine-tuning corpora, benchmarks, and community uploads.

huggingface.co/datasets

OpenML

Research-grade catalog of tabular ML datasets with rich schema metadata and benchmarking tasks.

openml.org

Zenodo

CERN-operated open research repository — peer-reviewed datasets with DOIs across scientific domains.

zenodo.org

Kaggle

Competition datasets, community contributions, and corporate releases curated by Kaggle.

kaggle.com/datasets

CKAN open data portals

Government and institutional open-data catalogs — public spending, climate, transit, health, statistics.

data.gov, EU Open Data, data.gov.uk, +

figshare

General-purpose research outputs platform — datasets, figures, posters, with DOIs and persistent storage.

figshare.com

Harvard Dataverse

Federated academic repository network — strong in social science, replication packages, and survey data.

dataverse.harvard.edu

GitHub (datasets-as-repos)

Project repositories that ship CSV/JSONL/Parquet datasets alongside training code and documentation.

github.com

Schema.org / DCAT discovery

Structured metadata published as schema.org Dataset markup and DCAT feeds — institutional and long-tail sources.

the open web

HuggingFace Models

Largest open registry of model weights — checkpoints, configs, and benchmark scores for the open-weights ecosystem.

huggingface.co/models

OpenRouter

Unified pricing + provider catalog for commercial-API models; the canonical source for context-window and per-token costs.

openrouter.ai

If you maintain a dataset repository with an open API and want Datacrawlr to index it, we'd love to add you — there's a link to open an issue at the bottom of this page.

How we stay compliant

Datacrawlr is a metadata index, not a content host. Our compliance posture isn't an afterthought — it's the entire architecture.

API-first.

Every connector uses the official API of its source. We never scrape behind authentication, paywalls, or rate-limit barriers. If a source doesn't offer an API, we look for structured metadata they publish openly (schema.org markup, OAI-PMH, DCAT) instead.

Robots.txt respected.

For the small set of sources we discover via web crawling (schema.org/DCAT pages), we respect robots.txt and crawl-delay directives. We identify ourselves with a clear User-Agent and a contact URL.

Rate limits honored.

Every source has a documented rate limit, and we stay well under it. We use exponential backoff on errors and never retry past the source's published quotas.

Metadata only — no mirroring.

We index metadata, not data. We never download, cache, or redistribute the actual dataset payloads. Every dataset page on Datacrawlr links to the original host.

Attribution preserved.

Every dataset includes its license, creators, and source platform. Citation strings (BibTeX or otherwise) are surfaced when available.

Takedown workflow.

If you're the maintainer of a dataset and want it removed from our index, you can request removal through a documented process. We respond within 5 business days.

No private datasets.

We do not index gated, authenticated, or private datasets — even if our API key would technically allow it. Public listings only.

Public data only.

We index only what is publicly accessible to anonymous users. Datasets requiring institutional credentials, paid subscriptions, or special permissions are not indexed.

Mission

Who builds this

Built API-first · Maintained continuously