Finding a Needle in 4,000 Hours of Legal Video

The problem

Legal language is precise. Evidence must be citable. And queries range from simple ("what was voted on?") to research-grade ("show me all property rights arguments across hearings and find related precedents").

A standard RAG pipeline cannot handle this. The data is multimodal (video + audio + transcripts), the queries are diverse, and every answer must trace back to an exact timestamp. We needed a different architecture.

The core retrieval stack

Both Quick and Deep modes share the same retrieval pipeline. The difference is what happens after retrieval.

How retrieval works

Every segment gets enriched before search. A Vision Language Model analyzes keyframes: who is speaking, what is displayed, what visual references matter. Audio analysis extracts prosody, silence patterns, overlapping speech. Both enrich the transcript before embedding.

Then two searches run in parallel. Hybrid Search (BM25 + kNN) over transcripts catches both exact legal terminology and semantic matches. Knowledge Graph over NebulaGraph handles relational queries ("which judges appeared together").

Why both? BM25 catches "statute 18:6-4" exactly. kNN catches "taking of private property" matching "eminent domain." The knowledge graph catches "judges who presided over similar cases." No single method handles all three.

Two modes, one stack

Quick mode vs Deep mode

Quick Mode

For factual queries. "What was voted on?" "Which attorneys testified?"

6.9smean latency

90%accuracy

10.5sp95 latency

One search pass. LLM receives results from hybrid search + knowledge graph, synthesizes a single answer, attributes it using TF-IDF verification. Fast and predictable.

Deep Mode (Agentic)

For research queries. "Find all property rights arguments and show related precedents."

94%accuracy

81%excellent grade

2-3avg iterations

LLM enters a reasoning loop with a stateful Scratchpad. Each iteration it can re-search with refined queries, explore graph edges, or read cached results. Early iterations search broad, later ones drill specific. Budget: 5 iterations max.

Key design choice: the Scratchpad. It is a stateful cache. The LLM does not re-search the same query twice. It reads cached results and explores new angles. This prevents redundant work and keeps latency predictable.

Source attribution

In legal work, every answer must be verifiable. We use three layers of attribution:

Three-layer source attribution

Why TF-IDF, not a neural re-ranker? Explainability. If TF-IDF fails, we know why: terms do not overlap. With a neural model, you are debugging a black box. In legal contexts, that is a liability.

Results

Metric	Quick	Deep (Golden)	Deep (Production)
Dataset size	50 queries	50 queries	928 queries
Accuracy	90.0%	94.0%	89.1%
Mean latency	6.9s	17.9s	~10s
p95 latency	10.5s	64.8s	18.3s
Excellent (Grade 3)	38%	81.4%	71.4%
Adequate (Grade 2)	41%	14.0%	25.7%
Poor (Grade 1)	21%	4.7%	2.9%

The tradeoff: Quick is fast and predictable. Deep reaches highest accuracy but with variable latency. Deep on production data (928 real queries) shows stable performance: p95 = 18s, no timeouts, 89.1% accuracy.

What we learned

One search often beats multiple. Quick mode has the highest semantic similarity (0.624). For many queries, the first hybrid search + KG query delivers the best context. Iteration adds reasoning but does not always improve retrieval.

Excellence concentrates in Deep mode. 71-81% excellent answers vs 38% in Quick. For research-grade legal work, Deep mode is the clear choice.

Production data is more forgiving. 928 real queries show stable Deep mode performance. Curated golden-set questions sometimes trigger complex reasoning that times out. Real-world queries rarely do.

The architecture generalizes. Nothing here is specific to legal. Point it at government meetings, corporate proceedings, legislative records, or medical depositions. The retrieval pipeline adapts because it is built around query intent, not domain knowledge.