Solutions · Finance

Grounded QA over 10-Ks, not vibes.

A 10-K runs 200–400 pages. The number that answers the question is in one paragraph or one row of one table — usually footnoted, often restated since the prior year, sometimes only implicit in a segment breakdown. A flat-vector retrieval pipeline averages those distinctions away.

ckem returns the span with its source — the filing, the section, the table, the page — so the model that writes the answer can't quietly hallucinate around a missing citation.

See the numbers →

The problem

Why financial QA is a retrieval problem first.

Tables and prose say different things.
Total revenue lives in the consolidated income statement. Segment revenue lives in the MD&A. Revenue recognition policy lives in Note 1. A question about “cloud revenue” needs the segment table and the policy note — not the totals row.
Year-over-year restatements.
The same line item gets restated in the next year's 10-K. Supersession is real; ckem models it as a first-class edge so the current filing surfaces by default and prior statements stay queryable with their original context.
Footnotes carry the answer.
The number in the table is the GAAP number; the footnote explains a one-time charge that flips the comparison. Fine- grained ranking keeps the footnote retrievable instead of averaging it into the surrounding 80-page section.
The citation is the audit trail.
A financial analyst answering on top of an LLM can't ship a number without a pointer to where it came from. ckem returns the passage and its provenance — the merge history, the parent document, the section — so the citation falls out of retrieval, not out of post-hoc reconciliation.

What ckem does

Built for filings, not for product pages.

Span-level retrieval. Indexed at the passage level, with table rows treated as first-class units. The matched span carries its section heading and parent filing.
Provenance on every answer. Each retrieval result resolves to filing → section → span. Every auto-merge writes derived_from edges; originals stay retrievable for audit.
Supersession edges. Prior-period statements supersede; the graph defaults queries to current. Pin a fiscal year and get the corpus as it stood at that filing.
Your filings stay yours. Local embeddings by default. Self-host via Docker, run in your own AWS account with the included Terraform, or managed. Same code path; you pick who operates it.

Measured on FinanceBench

The benchmark, not a demo.

FinanceBench is a 150-question evaluation set assembled by Patronus AI over 84 public 10-K filings — domain-relevant, novel-generation, and metric-calculation questions, each with a ground-truth answer and a cited evidence span. The full write-up is on arXiv (2311.11944) and the dataset lives on Hugging Face. We run the upstream questions and judgement files verbatim.

ckem · retrieval (150 questions)

Hit@167.3%
Hit@595.3%
Hit@1098.7%

ckem vs. published baselines

ckembest published

Recall@595.3%vs.81.6%
Hit@1098.7%vs.80.5%

Recall@5 baseline: hybrid retrieval + neural reranker on FinanceBench documents (arXiv:2603.16877). Hit@10 baseline: VisionRAG's reported Accuracy@10 on FinanceBench. Text- only single-vector stores in the original paper sit ~30pp lower than the VisionRAG line.

ckem Hit@10 by question class

Domain-relevant100.0%
Novel-generated100.0%
Metrics-generated96.0%

What each question class tests

Class	What it tests
Domain-relevant	Direct lookup of a stated fact in the filing.
Novel generation	Synthesis across sections — no single span fully answers.
Metric calculation	Numbers from one or more tables, then arithmetic.

Reading the numbers: These are retrieval metrics — Hit@K measures whether the ground-truth 10-K filing lands in the top-K results. Domain- relevant and novel-generated questions retrieve perfectly; metrics-generated (cross-table arithmetic questions) sit at 96% because the gold evidence is a specific table cell that embeds close to neighbouring filings from the same issuer.

Ongoing work: End-to-end answer accuracy depends on the reader sitting on top of retrieval, which isn't the variable we're testing here. The retrieval gap left to close is on the metrics-generated subset: filings from the same company across years embed close together, and the right table cell can outrank by a year or quarter. Two workstreams target this: issuer-and-period-aware query expansion (HyDE-style decomposition that conditions on company + filing date), and a financial-tables LoRA on top of the Qwen3 encoder fine- tuned to separate table cells from prose. A cross-encoder reranker over the top-30 candidates is the third lever.

In practice

Where teams put ckem in their finance stack.

Analyst copilots over filings.
An equity analyst asks a question, the agent answers with the span and the cited filing. The analyst clicks through to the source paragraph — not a generic page, the exact span ckem returned.
Diligence over a portfolio.
Ask 150 standardised questions across 84 filings and you get a tractable spreadsheet — provided the retrieval layer doesn't lose the answers in the corpus. FinanceBench is the head of that distribution.
Compliance and disclosure checks.
“Did the company disclose this risk factor in the prior year's 10-K?” The supersession graph makes year-over-year comparison a first-class operation, not a regex.

Run ckem on your filings.

Bring a folder of 10-Ks (or any analyst-facing corpus) and a labeled question set. We'll run ckem against your current retrieval and walk through the graph together over MCP.

See all benchmarks →

Grounded QA over 10-Ks, not vibes.

Why financial QA is a retrieval problem first.

Tables and prose say different things.

Year-over-year restatements.

Footnotes carry the answer.

The citation is the audit trail.

Built for filings, not for product pages.

The benchmark, not a demo.

Where teams put ckem in their finance stack.

Analyst copilots over filings.

Diligence over a portfolio.

Compliance and disclosure checks.

Run ckem on your filings.