Benchmarks

Six public benchmarks. Real numbers.

We run ckem on the benchmarks the retrieval community already agrees on — multi-hop reasoning, multi-step composition, reasoning-intensive search, long-document QA, legal contract retrieval, and financial QA. Same corpus, same queries, same judgement files as everyone else.

Two of those — LegalBench-RAG and FinanceBench — live on their own pages because the corpora and the pitch are domain-specific.

MultiHop-RAG MuSiQue BRIGHT Upcoming →LegalBench-RAG →FinanceBench →

General · multi-hop reasoning

MultiHop-RAG

609 news-corpus questions that each require composing evidence from multiple documents — inference, comparison, temporal, and null-answer queries. dataset.

ckem vs. best published

ckembest published

Hit@1098.7%vs.74.7%
Hit@492.7%vs.66.2%
MRR80.3vs.58.6
MAP60.6vs.47.9

What this tells us: when the answer requires composing evidence across documents, the typed graph keeps the right passages within reach instead of burying them under near-duplicates.

Ongoing work: MultiHop-RAG's corpus is short, single-topic news articles — easier territory than the long, multi-topic contracts and filings the domain benchmarks below cover. Hit@1 at 67% leaves the most room: when two articles cover overlapping sub-claims, the second-best passage is often also correct, which the typed graph could resolve by following derived_from edges. A learned reranker over the top-30 candidates is the next lever; a news-domain LoRA on top of the Qwen3 encoder is queued behind it.

General · multi-step composition

MuSiQue-Ans (dev)

2–4 hop questions built so the bridging entity has to be recovered before the answer can be — designed to resist shortcut answering from a single passage. dataset.

ckem vs. MiniLM baseline

ckemMiniLM baseline

Hit@186.6%vs.48%
Hit@597.9%vs.60%
Retrieval F150.9vs.30

What this tells us: the gain over a strong single-vector encoder isn't in Hit@5 — both methods can find the right passage given five chances. It's in Hit@1: the right passage ranked first.

Ongoing work: Retrieval F1 at 50.9 is well above baseline but still leaves room. The remaining failures are chains where the bridging entity is lexically dissimilar from the question — “the composer of the score Spielberg used for his shark film.” Sub-question decomposition (split the hop, retrieve each half, intersect) and a small graph- walk over typed entity edges before the second-hop encode are the two interventions on the bench.

General · reasoning-intensive retrieval

BRIGHT (biology)

A deliberately hard retrieval benchmark — queries where lexical and semantic similarity are not enough; the relevant passage requires reasoning to identify. We run the biology subset. dataset.

ckem vs. best published

ckembest published

nDCG@1023.2vs.17.6

What this tells us: BRIGHT is the benchmark that punishes pure embedding-similarity approaches. Numbers across the field are low here — beating the strongest published baseline by 5+ points is meaningful.

Ongoing work: BRIGHT is the toughest of our public benchmarks — the queries are designed to defeat embedding similarity. The 23.2 nDCG@10 here is from a stronger encoder configuration than the default ckem stack; folding those gains into the main pipeline (so the rest of the table moves with it) is in flight. The other 11 BRIGHT sub-tasks (Earth Science, Economics, StackExchange, etc.) are queued; a reasoning- aware reranker — passing the candidate and the original query back through the encoder with chain-of-thought — is the next algorithmic step.

On the bench · next sprint

Upcoming benchmarks

Runs queued for the next benchmark sprint. Numbers post here when they land.

General · long-document QA

LongRAG

Upcoming

QA over long retrieval units — full Wikipedia documents instead of short passages — designed to test whether a retrieval system keeps the relevant document on top when units are large. dataset.

4–8K-token retrieval units are exactly the regime where averaging-into-noise hurts flat retrieval most. Ingest sprint is mid-flight; we expect this to move the most as the chunking strategy stabilises.

General · grounded visual QA

CiteVQA

Upcoming

Visual question answering with span-level citations into the source page. Every answer has to point at the exact figure, caption, or table cell that supports it — a retrieval problem dressed as a VQA problem.

On the bench: extending the typed graph across page-image regions so a figure caption, the figure itself, and the paragraph that references it stay linked through ingest. Numbers post when the page-region encoder is wired through.

Domain benchmarks

LegalBench-RAG and FinanceBench live on their own pages.

The corpora are domain-specific — long contracts in one case, full 10-K filings in the other — and the pitch is different enough that we keep the numbers next to the surrounding context.

Legal

Live

LegalBench-RAG →

776 query-answer pairs over PrivacyQA, CUAD, MAUD, and ContractNLI.

Finance

Live

FinanceBench →

150 questions over 84 10-K filings (PatronusAI).

Upcoming domain benchmarks

Corpora and judgement files in evaluation. Pages and numbers post when the runs land.

Healthcare

Upcoming

MedQA-RAG

Clinical-guideline retrieval over UpToDate-style references and discharge notes. Span-level evidence requirement.

Pharma & life sciences

Upcoming

ClinTrialBench

Retrieval over trial protocols, FDA labels, and dossiers; inclusion / exclusion criteria with effective-date provenance.

Engineering & compliance

Upcoming

SpecBench

QA over engineering specs, SOPs, and audit trails — long, multi-section technical documents with cross-references.

Public records

Upcoming

CaseLaw-RAG

Statute and case-law retrieval with citation chains. Supersession and amendment modelled as first-class edges.

Methodology

1. Public benchmark, untouched. We use the upstream corpus and judgement files verbatim. No re-labelling, no question filtering, no held-out subsets we picked ourselves.
2. Same encoder where it matters. When we compare against a single-vector baseline, both sides use the same open-source embedding model — the variable is ckem's retrieval layer, not the encoder.
3. Held-out queries. No benchmark query is in the training or tuning loop.
4. Reproducible by us. Scripts and configs available for shared review during a pilot. Underlying corpora are shared subject to their original licensing.
5. Ongoing work disclosed. Each benchmark notes the open metrics and the specific interventions in flight — fine-tuned encoders, learned rerankers, query decomposition, graph-aware traversal. Where ckem ties a baseline or a metric sits below where we want it, we say so.

Bench against your own corpus.

Bring 50–200 documents and a labeled query set. We'll run ckem against your current retrieval and share the comparison.

See the legal pitch →