Datacrawlr indexes every dataset and every open or commercial model worth knowing about — schemas, licenses, benchmark scores, pricing, and the link between the two. Metadata only. We point you to the source.
The problem
HuggingFace knows about HuggingFace, Kaggle knows about Kaggle, every government portal knows about itself. There's no single place to ask which dataset should I train on, and which model should I use it with? So engineers default to whatever's most discoverable, not whatever's right for the problem.
Datacrawlr is the discovery layer that closes the loop — datasets and the models trained on them, in one searchable index.
Search
Type-ahead suggestions while you type. Full-text search across every indexed entry. An AI synthesis card at the top of every result page explains what the matches share and what to actually pick — with citations back to the underlying datasets.
The model-dataset graph
When a model card declares its training data, we connect them. Open a dataset and see which models were trained on it. Open a model and see what it learned from. This is the layer that turns Datacrawlr from a directory into an index of ML provenance.
Models directory
Open-weights and commercial. Benchmark scores normalized across the Open LLM Leaderboard, vendor reports, and Chatbot Arena. License risk pills, per-token pricing, context windows — and a leaderboard surface for every benchmark we track.
Explore
Modality, ML task, source, license, freshness — and now models. The Explore dashboard slices the index by what you care about, so you can answer 'what's actually in here?' without typing a query.
Every connector uses an official API or structured feed — never scraping behind auth.
Largest open hub for ML datasets — community uploads and benchmarks.
Research-grade tabular ML datasets with rich schema metadata.
CERN open research repository — peer-reviewed datasets with DOIs.
Competition datasets and community contributions.
Government open-data catalogs — data.gov, EU Open Data, +.
Research outputs with DOIs and persistent storage.
Federated academic repository network.
Project repos shipping CSV/JSONL/Parquet alongside training code.
Structured metadata across the open web — institutional + long tail.
Largest open registry of model weights, configs, and cards.
Unified pricing + provider catalog for commercial-API models.
Compliance posture
Our architecture is the compliance story. Four commitments that apply to every entry in the catalog.
Every connector hits the source's official API or a structured open feed (DCAT, OAI-PMH, schema.org). We never scrape behind authentication or paywalls.
We index what a dataset or model is — not the bytes themselves. Every page links to the original host; the host stays the source of truth.
Each entry is tagged with its license category and use terms. License risk badges surface non-commercial and restrictive terms before you build on them.
Identified User-Agent. Backoff on errors. Crawl-delay honored. Where a source publishes quotas, we stay well under them.
Open the catalog and start finding what you actually need.