The data landscape no one talks about
New Jersey has one of the most comprehensive open data portals in the United States. Over 85 million records spanning pensions, corrections, public utilities, education, healthcare, transportation, and more. Maintained by 23 government agencies. Stored across 789 data sources in spreadsheets, PDFs, JSON files, images, ZIPs, and more.
It is all publicly available. It has always been publicly available. And yet, if you ask any frontier AI model a question that requires connecting two of those agencies — say, comparing pension liability data to education budget allocations — you will get one of three responses: a confident hallucination, an admission of ignorance, or a redirection to "please consult official sources."
The data was always there. The AI was just blind to it.
This is not a New Jersey problem. It is the default state of institutional data everywhere. Governments, hospitals, enterprises — all of them sitting on massive, fragmented data landscapes that no AI system has ever been able to navigate. Not because the data is bad. Because the systems were built in isolation, and AI was never trained to bridge them.
Why frontier models cannot help
When people say "frontier AI," they mean the best models available: GPT-4o, Gemini Ultra, Claude 3.5. These are remarkable systems. They can write code, summarize legal documents, reason through complex problems, and carry on nuanced conversations. But they share a fundamental constraint that never gets talked about.
They were trained on the internet. And your data — your organization's data, your government's data — was not on it.
NJ Open Data contains 23,000+ PDFs packed with embedded images, tables, charts, and scanned documents. It contains spreadsheets with inconsistent schemas across agencies. It contains ZIP archives containing files in formats that overlap but do not align. No frontier model was trained on any of this. And even if you tried to pass this data to them through retrieval-augmented generation, traditional RAG systems fail on exactly the scenarios where the answer requires understanding multiple formats across multiple sources simultaneously.
We tested this ourselves. Take a question like: "Which New Jersey townships have the most employees with terminal leave benefits?" Answering it correctly requires crossing HR filings, payroll records, and benefit disclosures across agencies that have never been designed to talk to each other. We ran this through GPT-4o, Gemini Ultra, and Claude 3.5. Every one of them failed — not with a clear error, but with a confident, plausible, incorrect answer.
Introducing Kurious
We built Kurious to solve exactly this problem. Not as a wrapper around an existing model. Not as a prompt-engineering trick. As a fundamentally different approach to how AI reads and reasons over heterogeneous, real-world data.
What it looks like in practice
We deployed Kurious on NJ Open Data. The entire portal — 85 million records, 789 data sources, 23 agencies, 8+ formats — indexed and queryable in plain language.
The thinking steps are visible. The sources are cited. And the answer is correct — verified against ground truth across multiple agency datasets. No hallucination. No hedging. No "please consult official sources."
The numbers
Why this matters beyond New Jersey
New Jersey is not a special case. It is a representative case. Every state government, every hospital network, every enterprise with more than a decade of data history is sitting on the same problem: massive, fragmented, heterogeneous data that no AI system has ever been able to navigate.
The opportunity is not in building a better chatbot. It is in making the data that already exists — that organizations have spent decades and billions collecting — finally useful to the AI systems being deployed on top of it.
That is what Kurious does. And NJ Open Data is just the first proof point.
Try it yourself
Ask anything about NJ Open Data. Live and ungated.
Open Kurious →