About Datacrawlr
Datacrawlr is a metadata layer for the open data ecosystem. We index dataset metadata from across the public internet — what's where, what it contains, what you can do with it — and make it searchable. We don't host data. We tell you where to find it.
The open dataset ecosystem is fragmented. HuggingFace knows about HuggingFace, Kaggle knows about Kaggle, every government portal knows about itself, and academic repositories live in their own worlds. An ML engineer trying to pick the right training or evaluation data spends hours bouncing between sites that don't talk to each other — and still finishes with the uneasy feeling that something better exists somewhere they didn't look.
Datacrawlr harvests structured metadata from those platforms through their official APIs, normalizes it into a single schema, enriches it with task classifications, license analysis, and semantic embeddings, and exposes the result as a searchable catalog. Every dataset entry on Datacrawlr is described by the same fields — name, modality, license, size, schema, creators — regardless of which source it came from.
We index two kinds of things: datasets and models. The dataset side maps what data exists; the model side maps what's been trained on it. Together they give you the full picture for any ML decision — pick the right training corpus, then see every model that's been trained on it, with benchmark scores and license terms side by side.
Datacrawlr is not a hosting service, not a scraper, and not a CDN. We never download or redistribute the datasets themselves. Every page on Datacrawlr links back to the original source; we're the index that points there, not the place that holds the files.
The audience is concrete: ML engineers picking corpora to fine-tune on, researchers hunting for the right benchmark, data-science teams comparing licenses before a commercial build, and AI teams choosing pre-training data. If you've ever thought “there has to be a better dataset for this,” Datacrawlr is the place that answers.
Datacrawlr is built around four pipelines that run continuously.
We index from a curated list of sources via their official APIs — HuggingFace, OpenML, Zenodo, Kaggle, CKAN portals, figshare, Harvard Dataverse, GitHub. For sources without APIs, we follow DOI graphs and schema.org Dataset markup to find new datasets. Every source has a refresh cadence and a rate-limit budget.
Each source returns metadata in its own format. We map each one to a single canonical schema — name, modality, license, size, schema, creators, citations — so every dataset is comparable. Cross-source duplicates (same dataset published on multiple platforms) are detected and merged.
We add classifications the source doesn't provide: ML task type, refined modality, license risk analysis, completeness scoring, and semantic embeddings for related-dataset discovery. AI-generated summaries explain why you'd choose each dataset and what to watch out for.
A FastAPI service exposes the index through a REST API. The frontend you're using right now reads from it. Search is BM25 over OpenSearch with vector similarity for related datasets via pgvector.
Datacrawlr's signature feature is the link between models and datasets. When a model's card declares its training data, we connect them. When you view a dataset, you see which models were trained on it. When you view a model, you see what it was trained on. This is the layer that makes Datacrawlr more than a directory — it's an index of ML provenance.
We focus on official APIs from established repositories. Every source listed below is a place where dataset publishers themselves choose to put their work.
The largest open hub for ML datasets — fine-tuning corpora, benchmarks, and community uploads.
huggingface.co/datasetsResearch-grade catalog of tabular ML datasets with rich schema metadata and benchmarking tasks.
openml.orgCERN-operated open research repository — peer-reviewed datasets with DOIs across scientific domains.
zenodo.orgCompetition datasets, community contributions, and corporate releases curated by Kaggle.
kaggle.com/datasetsGovernment and institutional open-data catalogs — public spending, climate, transit, health, statistics.
data.gov, EU Open Data, data.gov.uk, +General-purpose research outputs platform — datasets, figures, posters, with DOIs and persistent storage.
figshare.comFederated academic repository network — strong in social science, replication packages, and survey data.
dataverse.harvard.eduProject repositories that ship CSV/JSONL/Parquet datasets alongside training code and documentation.
github.comStructured metadata published as schema.org Dataset markup and DCAT feeds — institutional and long-tail sources.
the open webLargest open registry of model weights — checkpoints, configs, and benchmark scores for the open-weights ecosystem.
huggingface.co/modelsUnified pricing + provider catalog for commercial-API models; the canonical source for context-window and per-token costs.
openrouter.aiIf you maintain a dataset repository with an open API and want Datacrawlr to index it, we'd love to add you — there's a link to open an issue at the bottom of this page.
Datacrawlr is a metadata index, not a content host. Our compliance posture isn't an afterthought — it's the entire architecture.
Every connector uses the official API of its source. We never scrape behind authentication, paywalls, or rate-limit barriers. If a source doesn't offer an API, we look for structured metadata they publish openly (schema.org markup, OAI-PMH, DCAT) instead.
For the small set of sources we discover via web crawling (schema.org/DCAT pages), we respect robots.txt and crawl-delay directives. We identify ourselves with a clear User-Agent and a contact URL.
Every source has a documented rate limit, and we stay well under it. We use exponential backoff on errors and never retry past the source's published quotas.
We index metadata, not data. We never download, cache, or redistribute the actual dataset payloads. Every dataset page on Datacrawlr links to the original host.
Every dataset includes its license, creators, and source platform. Citation strings (BibTeX or otherwise) are surfaced when available.
If you're the maintainer of a dataset and want it removed from our index, you can request removal through a documented process. We respond within 5 business days.
We do not index gated, authenticated, or private datasets — even if our API key would technically allow it. Public listings only.
We index only what is publicly accessible to anonymous users. Datasets requiring institutional credentials, paid subscriptions, or special permissions are not indexed.
Datacrawlr's classification of licenses, recommendations, and AI-generated summaries are aids for discovery — not legal advice. Always verify a dataset's terms on its original source before commercial use.
Open machine learning runs on open data. But the open data ecosystem is fragmented — HuggingFace knows about HuggingFace, Kaggle knows about Kaggle, every government portal knows about itself. There's no single place to ask “what's the best dataset for [task] under [license]?”
The result: ML engineers default to whatever's most discoverable, not whatever's best. Important datasets sit unused on niche repositories. Researchers reinvent benchmarks because they couldn't find the existing one. Models get trained on whatever was easy to find rather than what was right for the problem.
Datacrawlr's mission is to make the entire open dataset ecosystem visible and comparable — so the question “which dataset should I use?” has a real answer. We're a discovery layer for the open data web, the way Google was a discovery layer for the open web.
Datacrawlr is built and maintained by a small team with a data architecture background and an active interest in applied ML for scientific domains. We use Datacrawlr ourselves when building models — every feature on this site exists because we wanted it for our own work and couldn't find anything that already did the job. If a feature feels missing, it's because we haven't needed it yet — open an issue and tell us why.
Found a dataset that should be indexed? A bug? A new source we should add? Open an issue on GitHub or email us.