New · Models directory + benchmarks

The dataset and model
intelligence layer.

Datacrawlr indexes every dataset and every open or commercial model worth knowing about — schemas, licenses, benchmark scores, pricing, and the link between the two. Metadata only. We point you to the source.

View GitHub Repo →See features

datacrawlr.com/datasets/hf-meta-llama-llama-4-maverick

Datacrawlr — a dataset detail page showing lineage and the models trained on the dataset

—
Datasets indexed: 9
Dataset sources: —
Models indexed: —
Last refreshed

The problem

The open ML ecosystem is fragmented.

HuggingFace knows about HuggingFace, Kaggle knows about Kaggle, every government portal knows about itself. There's no single place to ask which dataset should I train on, and which model should I use it with? So engineers default to whatever's most discoverable, not whatever's right for the problem.

Datacrawlr is the discovery layer that closes the loop — datasets and the models trained on them, in one searchable index.

Semantic search with AI synthesis.

Type-ahead suggestions while you type. Full-text search across every indexed entry. An AI synthesis card at the top of every result page explains what the matches share and what to actually pick — with citations back to the underlying datasets.

BM25 over OpenSearch with vector similarity for related entries.
Filters by modality, license, source, and freshness.
Free-tier results respect the same ranking as paid ones.

View search code on GitHub →

datacrawlr.com/search?q=medical+imaging

Search results for 'medical imaging' showing an AI synthesis card and ranked dataset cards.

The model-dataset graph

See what was trained on what.

When a model card declares its training data, we connect them. Open a dataset and see which models were trained on it. Open a model and see what it learned from. This is the layer that turns Datacrawlr from a directory into an index of ML provenance.

Trained-on / fine-tuned-on / evaluated-on, with confidence per link.
Bidirectional: every dataset surfaces its downstream models.
Lineage edges between related datasets where source platforms expose them.

View graph database code on GitHub →

datacrawlr.com/datasets/<slug>

Dataset detail page with overview, schema, and the 'Models trained on this' rail.

Models directory

Every model worth knowing about.

Open-weights and commercial. Benchmark scores normalized across the Open LLM Leaderboard, vendor reports, and Chatbot Arena. License risk pills, per-token pricing, context windows — and a leaderboard surface for every benchmark we track.

Filter by access type, organization, parameters, license, and modality.
Composite score blends MMLU-Pro, GPQA, HumanEval, MATH, and IFEval.
Side-by-side comparison up to four models.

View models code on GitHub →

datacrawlr.com/models/<slug>

Model detail page showing benchmark bars, license risk pill, and pricing rows.

Explore

Every angle on the catalog.

Modality, ML task, source, license, freshness — and now models. The Explore dashboard slices the index by what you care about, so you can answer 'what's actually in here?' without typing a query.

Donut breakdowns by modality and license type.
Per-source health and refresh cadence.
Trending + freshness rails reflect the last index pass.

View dashboard code on GitHub →

datacrawlr.com/explore

Explore dashboard with stats, domain grid, and trending rail.

Where we index from

Every connector uses an official API or structured feed — never scraping behind auth.

HuggingFace Datasets

Largest open hub for ML datasets — community uploads and benchmarks.

huggingface.co/datasets

OpenML

Research-grade tabular ML datasets with rich schema metadata.

openml.org

Zenodo

CERN open research repository — peer-reviewed datasets with DOIs.

zenodo.org

Kaggle

Competition datasets and community contributions.

kaggle.com/datasets

CKAN portals

Government open-data catalogs — data.gov, EU Open Data, +.

data.gov +

figshare

Research outputs with DOIs and persistent storage.

figshare.com

Harvard Dataverse

Federated academic repository network.

dataverse.harvard.edu

GitHub (datasets-as-repos)

Project repos shipping CSV/JSONL/Parquet alongside training code.

github.com

Schema.org / DCAT

Structured metadata across the open web — institutional + long tail.

schema.org/Dataset

HuggingFace Models

Largest open registry of model weights, configs, and cards.

huggingface.co/models

OpenRouter

Unified pricing + provider catalog for commercial-API models.

openrouter.ai

Compliance posture

Index, don't mirror.

Our architecture is the compliance story. Four commitments that apply to every entry in the catalog.

API-first ingestion

Every connector hits the source's official API or a structured open feed (DCAT, OAI-PMH, schema.org). We never scrape behind authentication or paywalls.

Metadata only — no mirroring

We index what a dataset or model is — not the bytes themselves. Every page links to the original host; the host stays the source of truth.

License-aware by default

Each entry is tagged with its license category and use terms. License risk badges surface non-commercial and restrictive terms before you build on them.

Robots.txt + rate limits respected

Identified User-Agent. Backoff on errors. Crawl-delay honored. Where a source publishes quotas, we stay well under them.

Stop bookmarking datasets and models.

Explore the code on GitHub to see how it works.

View GitHub Repo →Read about the index

The open ML ecosystem is fragmented.

Datacrawlr is the discovery layer that closes the loop — datasets and the models trained on them, in one searchable index.

Semantic search with AI synthesis.

BM25 over OpenSearch with vector similarity for related entries.

Filters by modality, license, source, and freshness.

Free-tier results respect the same ranking as paid ones.

See what was trained on what.

Trained-on / fine-tuned-on / evaluated-on, with confidence per link.

Bidirectional: every dataset surfaces its downstream models.

Lineage edges between related datasets where source platforms expose them.

Every model worth knowing about.

Filter by access type, organization, parameters, license, and modality.

Composite score blends MMLU-Pro, GPQA, HumanEval, MATH, and IFEval.

Side-by-side comparison up to four models.

Every angle on the catalog.

Modality, ML task, source, license, freshness — and now models. The Explore dashboard slices the index by what you care about, so you can answer 'what's actually in here?' without typing a query.

Donut breakdowns by modality and license type.

Per-source health and refresh cadence.

Trending + freshness rails reflect the last index pass.

The dataset and modelintelligence layer.

The open ML ecosystem is fragmented.

Semantic search with AI synthesis.

See what was trained on what.

Every model worth knowing about.

Every angle on the catalog.

Where we index from

Index, don't mirror.

Stop bookmarking datasets and models.

The dataset and modelintelligence layer.

The open ML ecosystem is fragmented.

Semantic search with AI synthesis.

See what was trained on what.

Every model worth knowing about.

Every angle on the catalog.

Where we index from

Index, don't mirror.

Stop bookmarking datasets and models.

The dataset and model
intelligence layer.

The dataset and model
intelligence layer.