ckem

Solutions · Finance

Grounded QA over 10-Ks, not vibes.

A 10-K runs 200–400 pages. The number that answers the question is in one paragraph or one row of one table — usually footnoted, often restated since the prior year, sometimes only implicit in a segment breakdown. A flat-vector retrieval pipeline averages those distinctions away.

ckem returns the span with its source — the filing, the section, the table, the page — so the model that writes the answer can't quietly hallucinate around a missing citation.

The problem

Why financial QA is a retrieval problem first.

  • Tables and prose say different things.

    Total revenue lives in the consolidated income statement. Segment revenue lives in the MD&A. Revenue recognition policy lives in Note 1. A question about “cloud revenue” needs the segment table and the policy note — not the totals row.

  • Year-over-year restatements.

    The same line item gets restated in the next year's 10-K. Supersession is real; ckem models it as a first-class edge so the current filing surfaces by default and prior statements stay queryable with their original context.

  • Footnotes carry the answer.

    The number in the table is the GAAP number; the footnote explains a one-time charge that flips the comparison. Fine- grained ranking keeps the footnote retrievable instead of averaging it into the surrounding 80-page section.

  • The citation is the audit trail.

    A financial analyst answering on top of an LLM can't ship a number without a pointer to where it came from. ckem returns the passage and its provenance — the merge history, the parent document, the section — so the citation falls out of retrieval, not out of post-hoc reconciliation.

What ckem does

Built for filings, not for product pages.

  • Span-level retrieval. Indexed at the passage level, with table rows treated as first-class units. The matched span carries its section heading and parent filing.
  • Provenance on every answer. Each retrieval result resolves to filing → section → span. Every auto-merge writes derived_from edges; originals stay retrievable for audit.
  • Supersession edges. Prior-period statements supersede; the graph defaults queries to current. Pin a fiscal year and get the corpus as it stood at that filing.
  • Your filings stay yours. Local embeddings by default. Self-host via Docker, run in your own AWS account with the included Terraform, or managed. Same code path; you pick who operates it.

Measured on FinanceBench

The benchmark, not a demo.

FinanceBench is a 150-question evaluation set assembled by Patronus AI over 84 public 10-K filings — domain-relevant, novel-generation, and metric-calculation questions, each with a ground-truth answer and a cited evidence span. The full write-up is on arXiv (2311.11944) and the dataset lives on Hugging Face. We run the upstream questions and judgement files verbatim.

ckem · retrieval (150 questions)
  • Hit@167.3%
  • Hit@595.3%
  • Hit@1098.7%
ckem vs. published baselines
ckembest published
  • Recall@595.3%vs.81.6%
  • Hit@1098.7%vs.80.5%

Recall@5 baseline: hybrid retrieval + neural reranker on FinanceBench documents (arXiv:2603.16877). Hit@10 baseline: VisionRAG's reported Accuracy@10 on FinanceBench. Text- only single-vector stores in the original paper sit ~30pp lower than the VisionRAG line.

ckem Hit@10 by question class
  • Domain-relevant100.0%
  • Novel-generated100.0%
  • Metrics-generated96.0%
What each question class tests
ClassWhat it tests
Domain-relevantDirect lookup of a stated fact in the filing.
Novel generationSynthesis across sections — no single span fully answers.
Metric calculationNumbers from one or more tables, then arithmetic.

Reading the numbers: These are retrieval metrics — Hit@K measures whether the ground-truth 10-K filing lands in the top-K results. Domain- relevant and novel-generated questions retrieve perfectly; metrics-generated (cross-table arithmetic questions) sit at 96% because the gold evidence is a specific table cell that embeds close to neighbouring filings from the same issuer.

Ongoing work: End-to-end answer accuracy depends on the reader sitting on top of retrieval, which isn't the variable we're testing here. The retrieval gap left to close is on the metrics-generated subset: filings from the same company across years embed close together, and the right table cell can outrank by a year or quarter. Two workstreams target this: issuer-and-period-aware query expansion (HyDE-style decomposition that conditions on company + filing date), and a financial-tables LoRA on top of the Qwen3 encoder fine- tuned to separate table cells from prose. A cross-encoder reranker over the top-30 candidates is the third lever.

In practice

Where teams put ckem in their finance stack.

  • Analyst copilots over filings.

    An equity analyst asks a question, the agent answers with the span and the cited filing. The analyst clicks through to the source paragraph — not a generic page, the exact span ckem returned.

  • Diligence over a portfolio.

    Ask 150 standardised questions across 84 filings and you get a tractable spreadsheet — provided the retrieval layer doesn't lose the answers in the corpus. FinanceBench is the head of that distribution.

  • Compliance and disclosure checks.

    “Did the company disclose this risk factor in the prior year's 10-K?” The supersession graph makes year-over-year comparison a first-class operation, not a regex.

Run ckem on your filings.

Bring a folder of 10-Ks (or any analyst-facing corpus) and a labeled question set. We'll run ckem against your current retrieval and walk through the graph together over MCP.