Models

Every model worth knowing about.

Open-weights from HuggingFace, commercial APIs from OpenRouter and the major vendors, benchmark scores from the Open LLM Leaderboard and Chatbot Arena. One catalog, the same shape for every row.

A directory that respects how you actually pick.

Sort by composite score, parameters, newest, popularity, cheapest output cost, or largest context window. Eight quick-filter chips for the common slices — frontier open-weights, free tiers, multimodal, code-focused, latest releases.

URL-synced filter strip — share a view by pasting the link.
Hover-only 'Compare' button on every card.
Composite score weighted across MMLU-Pro, GPQA, HumanEval, MATH, IFEval, and Arena ELO.

View model directory code on GitHub →

datacrawlr.com/models

Models directory list page sorted by composite score, with filter rail and result cards.

Benchmarks

Normalized, sourced, and honest about the limits.

Each benchmark page links the methodology, surfaces score format (% accuracy / pass@1 / ELO), and flags the source (Open LLM Leaderboard / Artificial Analysis / Vendor-reported / Paper-reported). The trust callout reminds you that public benchmarks are a signal, not the only signal.

View leaderboard code on GitHub →

datacrawlr.com/models/<slug>

Model detail Benchmarks tab with composite header and per-benchmark cards.

What we track per model

Four axes that actually drive decisions.

Benchmark transparency

Twelve benchmarks tracked, with their source labelled on every score. Vendor-reported scores are flagged so you know what to trust.

Pricing without lookup spreadsheets

Per-million-token input/output cost on commercial APIs, plus a calculator that compares your usage mix against the cheapest alternatives.

Side-by-side comparison

Stack up to four models across identity, architecture, benchmarks, pricing, and provenance. Highest score wins each row.

License risk you can scan

Every license is classified into a risk band (low / medium / high) with commercial-use, attribution, redistribution, and modification flags.

Provenance

The link to the data.

Every model's training-data declarations are indexed. Open a model and see the datasets it was trained or fine-tuned on. Open a dataset and see which models picked it up. That bidirectional link is the layer that turns this into provenance, not just listing.

Trained-on / fine-tuned-on / evaluated-on, with per-link confidence.
Cross-page navigation in both directions.
Inferred links are explicitly labelled as such — never silently surfaced.

View graph database code on GitHub →

datacrawlr.com/datasets/<slug>

Dataset detail page showing the 'Models trained on this dataset' rail.

Compare

Four models, every dimension.

Add models from the directory or the detail page. The comparison view stacks Identity, Architecture, Benchmarks, Pricing, and Provenance — with the winning value in each row highlighted by a left-border accent and an AI summary card up top.

Cheapest output cost wins the pricing row.
Highest score wins each benchmark row, with per-cell mini bars.
AI insight card up top summarizes what's actually different.

View model comparison implementation on GitHub →

datacrawlr.com/models/compare?ids=<a>,<b>

Compare view with two models side by side across themed sections.

Two model sources, official feeds only

HuggingFace for open weights, OpenRouter for the commercial pricing + provider catalog.

HuggingFace Models

Largest open registry of model weights, configs, and cards.

huggingface.co/models

OpenRouter

Unified pricing + provider catalog for commercial-API models.

openrouter.ai

Compliance posture

Index, don't mirror.

Our architecture is the compliance story. Four commitments that apply to every entry in the catalog.

API-first ingestion

Every connector hits the source's official API or a structured open feed (DCAT, OAI-PMH, schema.org). We never scrape behind authentication or paywalls.

Metadata only — no mirroring

We index what a dataset or model is — not the bytes themselves. Every page links to the original host; the host stays the source of truth.

License-aware by default

Each entry is tagged with its license category and use terms. License risk badges surface non-commercial and restrictive terms before you build on them.

Robots.txt + rate limits respected

Identified User-Agent. Backoff on errors. Crawl-delay honored. Where a source publishes quotas, we stay well under them.

Benchmarks aren't the whole story

Public benchmarks are subject to contamination, methodology drift, and overfitting. For production decisions we recommend running a private holdout evaluation on prompts that match your actual workload. Datacrawlr is the place you start that decision — not the place it ends.

Pick the right weights without spreadsheets.

Explore the code on GitHub to see how it works.

View GitHub Repo →See dataset coverage

A directory that respects how you actually pick.

URL-synced filter strip — share a view by pasting the link.

Hover-only 'Compare' button on every card.

Composite score weighted across MMLU-Pro, GPQA, HumanEval, MATH, IFEval, and Arena ELO.

Normalized, sourced, and honest about the limits.

The link to the data.

Trained-on / fine-tuned-on / evaluated-on, with per-link confidence.

Cross-page navigation in both directions.

Inferred links are explicitly labelled as such — never silently surfaced.

Four models, every dimension.

Cheapest output cost wins the pricing row.

Highest score wins each benchmark row, with per-cell mini bars.

AI insight card up top summarizes what's actually different.