ckem

Benchmarks

Six public benchmarks. Real numbers.

We run ckem on the benchmarks the retrieval community already agrees on — multi-hop reasoning, multi-step composition, reasoning-intensive search, long-document QA, legal contract retrieval, and financial QA. Same corpus, same queries, same judgement files as everyone else.

Two of those — LegalBench-RAG and FinanceBench — live on their own pages because the corpora and the pitch are domain-specific.

General · multi-hop reasoning

MultiHop-RAG

609 news-corpus questions that each require composing evidence from multiple documents — inference, comparison, temporal, and null-answer queries. dataset.

ckem vs. best published
ckembest published
  • Hit@1098.7%vs.74.7%
  • Hit@492.7%vs.66.2%
  • MRR80.3vs.58.6
  • MAP60.6vs.47.9

What this tells us: when the answer requires composing evidence across documents, the typed graph keeps the right passages within reach instead of burying them under near-duplicates.

Ongoing work: MultiHop-RAG's corpus is short, single-topic news articles — easier territory than the long, multi-topic contracts and filings the domain benchmarks below cover. Hit@1 at 67% leaves the most room: when two articles cover overlapping sub-claims, the second-best passage is often also correct, which the typed graph could resolve by following derived_from edges. A learned reranker over the top-30 candidates is the next lever; a news-domain LoRA on top of the Qwen3 encoder is queued behind it.

General · multi-step composition

MuSiQue-Ans (dev)

2–4 hop questions built so the bridging entity has to be recovered before the answer can be — designed to resist shortcut answering from a single passage. dataset.

ckem vs. MiniLM baseline
ckemMiniLM baseline
  • Hit@186.6%vs.48%
  • Hit@597.9%vs.60%
  • Retrieval F150.9vs.30

What this tells us: the gain over a strong single-vector encoder isn't in Hit@5 — both methods can find the right passage given five chances. It's in Hit@1: the right passage ranked first.

Ongoing work: Retrieval F1 at 50.9 is well above baseline but still leaves room. The remaining failures are chains where the bridging entity is lexically dissimilar from the question — “the composer of the score Spielberg used for his shark film.” Sub-question decomposition (split the hop, retrieve each half, intersect) and a small graph- walk over typed entity edges before the second-hop encode are the two interventions on the bench.

General · reasoning-intensive retrieval

BRIGHT (biology)

A deliberately hard retrieval benchmark — queries where lexical and semantic similarity are not enough; the relevant passage requires reasoning to identify. We run the biology subset. dataset.

ckem vs. best published
ckembest published
  • nDCG@1023.2vs.17.6

What this tells us: BRIGHT is the benchmark that punishes pure embedding-similarity approaches. Numbers across the field are low here — beating the strongest published baseline by 5+ points is meaningful.

Ongoing work: BRIGHT is the toughest of our public benchmarks — the queries are designed to defeat embedding similarity. The 23.2 nDCG@10 here is from a stronger encoder configuration than the default ckem stack; folding those gains into the main pipeline (so the rest of the table moves with it) is in flight. The other 11 BRIGHT sub-tasks (Earth Science, Economics, StackExchange, etc.) are queued; a reasoning- aware reranker — passing the candidate and the original query back through the encoder with chain-of-thought — is the next algorithmic step.

On the bench · next sprint

Upcoming benchmarks

Runs queued for the next benchmark sprint. Numbers post here when they land.

General · long-document QA

LongRAG

Upcoming

QA over long retrieval units — full Wikipedia documents instead of short passages — designed to test whether a retrieval system keeps the relevant document on top when units are large. dataset.

4–8K-token retrieval units are exactly the regime where averaging-into-noise hurts flat retrieval most. Ingest sprint is mid-flight; we expect this to move the most as the chunking strategy stabilises.

General · grounded visual QA

CiteVQA

Upcoming

Visual question answering with span-level citations into the source page. Every answer has to point at the exact figure, caption, or table cell that supports it — a retrieval problem dressed as a VQA problem.

On the bench: extending the typed graph across page-image regions so a figure caption, the figure itself, and the paragraph that references it stay linked through ingest. Numbers post when the page-region encoder is wired through.

Domain benchmarks

LegalBench-RAG and FinanceBench live on their own pages.

The corpora are domain-specific — long contracts in one case, full 10-K filings in the other — and the pitch is different enough that we keep the numbers next to the surrounding context.

Upcoming domain benchmarks

Corpora and judgement files in evaluation. Pages and numbers post when the runs land.

Healthcare
Upcoming
MedQA-RAG

Clinical-guideline retrieval over UpToDate-style references and discharge notes. Span-level evidence requirement.

Pharma & life sciences
Upcoming
ClinTrialBench

Retrieval over trial protocols, FDA labels, and dossiers; inclusion / exclusion criteria with effective-date provenance.

Engineering & compliance
Upcoming
SpecBench

QA over engineering specs, SOPs, and audit trails — long, multi-section technical documents with cross-references.

Public records
Upcoming
CaseLaw-RAG

Statute and case-law retrieval with citation chains. Supersession and amendment modelled as first-class edges.

Methodology

  1. 1. Public benchmark, untouched. We use the upstream corpus and judgement files verbatim. No re-labelling, no question filtering, no held-out subsets we picked ourselves.
  2. 2. Same encoder where it matters. When we compare against a single-vector baseline, both sides use the same open-source embedding model — the variable is ckem's retrieval layer, not the encoder.
  3. 3. Held-out queries. No benchmark query is in the training or tuning loop.
  4. 4. Reproducible by us. Scripts and configs available for shared review during a pilot. Underlying corpora are shared subject to their original licensing.
  5. 5. Ongoing work disclosed. Each benchmark notes the open metrics and the specific interventions in flight — fine-tuned encoders, learned rerankers, query decomposition, graph-aware traversal. Where ckem ties a baseline or a metric sits below where we want it, we say so.

Bench against your own corpus.

Bring 50–200 documents and a labeled query set. We'll run ckem against your current retrieval and share the comparison.

See the legal pitch →