Finding a Needle in 4,000 Hours of Legal Video
AIntropy Engineering · April 2026 · 10 min read
Summary
- 4,000+ hours of legal proceedings: court hearings, government meetings, public testimony across multiple states
- Two retrieval modes on one stack: Quick (6.9s, 90% accuracy) and Deep (agentic multi-pass, 94% accuracy)
- Core stack: vision-enhanced hybrid search + knowledge graph + BM25/kNN blended retrieval
- Every answer attributed to exact video timestamps with three-layer source verification
- Tested on 928 real production queries with stable latency and no accuracy collapse
- Architecture generalizes beyond legal: built around query intent, not specific domains
The problem
Legal language is precise. Evidence must be citable. And queries range from simple ("what was voted on?") to research-grade ("show me all property rights arguments across hearings and find related precedents").
A standard RAG pipeline cannot handle this. The data is multimodal (video + audio + transcripts), the queries are diverse, and every answer must trace back to an exact timestamp. We needed a different architecture.
The core retrieval stack
Both Quick and Deep modes share the same retrieval pipeline. The difference is what happens after retrieval.
Every segment gets enriched before search. A Vision Language Model analyzes keyframes: who is speaking, what is displayed, what visual references matter. Audio analysis extracts prosody, silence patterns, overlapping speech. Both enrich the transcript before embedding.
Then two searches run in parallel. Hybrid Search (BM25 + kNN) over transcripts catches both exact legal terminology and semantic matches. Knowledge Graph over NebulaGraph handles relational queries ("which judges appeared together").
Why both? BM25 catches "statute 18:6-4" exactly. kNN catches "taking of private property" matching "eminent domain." The knowledge graph catches "judges who presided over similar cases." No single method handles all three.
Two modes, one stack
Quick Mode
One search pass. LLM receives results from hybrid search + knowledge graph, synthesizes a single answer, attributes it using TF-IDF verification. Fast and predictable.
Deep Mode (Agentic)
LLM enters a reasoning loop with a stateful Scratchpad. Each iteration it can re-search with refined queries, explore graph edges, or read cached results. Early iterations search broad, later ones drill specific. Budget: 5 iterations max.
Key design choice: the Scratchpad. It is a stateful cache. The LLM does not re-search the same query twice. It reads cached results and explores new angles. This prevents redundant work and keeps latency predictable.
Source attribution
In legal work, every answer must be verifiable. We use three layers of attribution:
Why TF-IDF, not a neural re-ranker? Explainability. If TF-IDF fails, we know why: terms do not overlap. With a neural model, you are debugging a black box. In legal contexts, that is a liability.
Results
| Metric | Quick | Deep (Golden) | Deep (Production) |
|---|---|---|---|
| Dataset size | 50 queries | 50 queries | 928 queries |
| Accuracy | 90.0% | 94.0% | 89.1% |
| Mean latency | 6.9s | 17.9s | ~10s |
| p95 latency | 10.5s | 64.8s | 18.3s |
| Excellent (Grade 3) | 38% | 81.4% | 71.4% |
| Adequate (Grade 2) | 41% | 14.0% | 25.7% |
| Poor (Grade 1) | 21% | 4.7% | 2.9% |
The tradeoff: Quick is fast and predictable. Deep reaches highest accuracy but with variable latency. Deep on production data (928 real queries) shows stable performance: p95 = 18s, no timeouts, 89.1% accuracy.
What we learned
One search often beats multiple. Quick mode has the highest semantic similarity (0.624). For many queries, the first hybrid search + KG query delivers the best context. Iteration adds reasoning but does not always improve retrieval.
Excellence concentrates in Deep mode. 71-81% excellent answers vs 38% in Quick. For research-grade legal work, Deep mode is the clear choice.
Production data is more forgiving. 928 real queries show stable Deep mode performance. Curated golden-set questions sometimes trigger complex reasoning that times out. Real-world queries rarely do.
The architecture generalizes. Nothing here is specific to legal. Point it at government meetings, corporate proceedings, legislative records, or medical depositions. The retrieval pipeline adapts because it is built around query intent, not domain knowledge.