Engineering

Finding a Needle in 4,000 Hours of Legal Video

AIntropy Engineering  ·  April 2026  ·  10 min read

Summary

The problem

Legal language is precise. Evidence must be citable. And queries range from simple ("what was voted on?") to research-grade ("show me all property rights arguments across hearings and find related precedents").

A standard RAG pipeline cannot handle this. The data is multimodal (video + audio + transcripts), the queries are diverse, and every answer must trace back to an exact timestamp. We needed a different architecture.

The core retrieval stack

Both Quick and Deep modes share the same retrieval pipeline. The difference is what happens after retrieval.

How retrieval works
VIDEO SEGMENT 4,000+ hours ENRICH Vision Language Model Audio Analysis HYBRID SEARCH BM25 (exact terms) kNN (semantic match) KNOWLEDGE GRAPH People, cases, orgs NebulaGraph (nGQL) parallel + LLM REASONING Quick OR Deep mode CITED ANSWER + exact timestamps

Every segment gets enriched before search. A Vision Language Model analyzes keyframes: who is speaking, what is displayed, what visual references matter. Audio analysis extracts prosody, silence patterns, overlapping speech. Both enrich the transcript before embedding.

Then two searches run in parallel. Hybrid Search (BM25 + kNN) over transcripts catches both exact legal terminology and semantic matches. Knowledge Graph over NebulaGraph handles relational queries ("which judges appeared together").

Why both? BM25 catches "statute 18:6-4" exactly. kNN catches "taking of private property" matching "eminent domain." The knowledge graph catches "judges who presided over similar cases." No single method handles all three.

Two modes, one stack

Quick mode vs Deep mode
QUICK Search (parallel) LLM synthesize Answer (6.9s)
DEEP (AGENTIC) Search (parallel) REASONING LOOP (up to 5x) Refine query → Re-search Read scratchpad → Explore graph Answer (94% acc)

Quick Mode

For factual queries. "What was voted on?" "Which attorneys testified?"
6.9smean latency
90%accuracy
10.5sp95 latency

One search pass. LLM receives results from hybrid search + knowledge graph, synthesizes a single answer, attributes it using TF-IDF verification. Fast and predictable.

Deep Mode (Agentic)

For research queries. "Find all property rights arguments and show related precedents."
94%accuracy
81%excellent grade
2-3avg iterations

LLM enters a reasoning loop with a stateful Scratchpad. Each iteration it can re-search with refined queries, explore graph edges, or read cached results. Early iterations search broad, later ones drill specific. Budget: 5 iterations max.

Key design choice: the Scratchpad. It is a stateful cache. The LLM does not re-search the same query twice. It reads cached results and explores new angles. This prevents redundant work and keeps latency predictable.

Source attribution

In legal work, every answer must be verifiable. We use three layers of attribution:

Three-layer source attribution
LAYER 1 LLM Citation Parsing [S1], [S2, S3] extracted Explicit signal LAYER 2 TF-IDF Verification > 0.35 = primary citation Explainable, not black-box LAYER 3 Score Weighting OpenSearch rank boost High-rank = trustworthy Result: source type + category (primary/supporting) + relevance score + exact timestamp

Why TF-IDF, not a neural re-ranker? Explainability. If TF-IDF fails, we know why: terms do not overlap. With a neural model, you are debugging a black box. In legal contexts, that is a liability.

Results

MetricQuickDeep (Golden)Deep (Production)
Dataset size50 queries50 queries928 queries
Accuracy90.0%94.0%89.1%
Mean latency6.9s17.9s~10s
p95 latency10.5s64.8s18.3s
Excellent (Grade 3)38%81.4%71.4%
Adequate (Grade 2)41%14.0%25.7%
Poor (Grade 1)21%4.7%2.9%

The tradeoff: Quick is fast and predictable. Deep reaches highest accuracy but with variable latency. Deep on production data (928 real queries) shows stable performance: p95 = 18s, no timeouts, 89.1% accuracy.

What we learned

One search often beats multiple. Quick mode has the highest semantic similarity (0.624). For many queries, the first hybrid search + KG query delivers the best context. Iteration adds reasoning but does not always improve retrieval.

Excellence concentrates in Deep mode. 71-81% excellent answers vs 38% in Quick. For research-grade legal work, Deep mode is the clear choice.

Production data is more forgiving. 928 real queries show stable Deep mode performance. Curated golden-set questions sometimes trigger complex reasoning that times out. Real-world queries rarely do.

The architecture generalizes. Nothing here is specific to legal. Point it at government meetings, corporate proceedings, legislative records, or medical depositions. The retrieval pipeline adapts because it is built around query intent, not domain knowledge.

Underlying technology may be protected by one or more patents pending under USPTO.

© 2026 AIntropy AI. All rights reserved.