Blog • Science

Why LLMs Can't Predict Legal Outcomes

A language model generates text that looks like an answer. It does not calculate probabilities based on historical claim geometries. Confusing the two is a fast way to misprice your reserves.

  • By David H. Silver
  • Head of AI
  • 28 November 2025
  • 5 min read

TL;DR — Large language models are text generators, not forecasting engines. To predict claims accurately, you must use generative AI to extract facts from case files, then feed those facts into separate mathematical models trained exclusively on resolved historical data.

Ask a large language model to evaluate a severe bodily injury claim, and it will give you a number. It will write a highly convincing paragraph justifying that outcome, citing the plaintiff's age, the venue history, and the severity of the medical records. Claims executives seeing this often assume the machine has analyzed the data and calculated a mathematical forecast. It has not. The model has simply generated a sequence of text tokens that statistically resembles the kind of text humans write when discussing bodily injury claims. It is playing an elaborate game of autocomplete.

Language models are autoregressive text generators. They predict the next word based on the context window of words that came before it. They do not maintain an internal mathematical representation of claim value, venue volatility, or historical settlement distributions. When a model outputs a specific settlement figure, it is not drawing that value from a calibrated regression on resolved claims. It is not evaluating the geometry of past verdicts. It is pulling it from the latent space of generalized internet text, prioritizing linguistic fluency over factual accuracy. This distinction is critical because the insurance industry is currently facing unprecedented reserve volatility driven by social inflation and third-party litigation funding. You cannot manage that severity with a text generator that guesses what a settlement looks like.

A true forecasting system must know what it does not know. In machine learning, we call this calibration. If a model says an escalation event has a 70 percent chance of happening, it should happen exactly 70 out of 100 times across a large sample. Large language models are uncalibrated by design. They are trained to be helpful and fluent. They present wild guesses with the exact same syntactic confidence as absolute facts. In litigation forecasting, an uncalibrated point estimate is worse than useless. It actively misleads the claims professional trying to set an accurate reserve on day one, creating a false sense of precision that collapses during settlement negotiations.

The Architecture of Honest Forecasting

To predict a legal outcome accurately, you have to separate the reading from the math. We use a strictly neuro-symbolic approach. Generative AI is exceptionally good at the reading part. A typical case file contains thousands of pages of unstructured data. You have messy pleadings, dense medical records, disjointed correspondence, and handwritten notes. We use language models strictly as perception engines to parse this text. They read the files and extract the entities, the specific injuries, the venue details, the plaintiff characteristics, and the legal theories. They structure the unstructured mess into a rigid schema. That is where the language model's job definitively ends.

The actual prediction requires a different architecture entirely. We take the structured variables extracted by the language model and feed them into separate mathematical machine-learning models. These are geometric models trained exclusively on large numbers of resolved cases with known outcomes. They map the exact mathematical relationships between specific claim attributes and final financial resolutions. Because these models operate on structured numerical and categorical data rather than free text, they calculate true probabilities based on historical reality. They do not hallucinate because they do not generate language. When we look at a claim, we are looking at its coordinates relative to thousands of previously resolved cases. The prediction emerges from this geometry, identifying exactly how similar historical claims behaved under identical constraints. They compute distances and distributions in high-dimensional space.

The output of this pipeline is never a single point guess. It is a calibrated settlement range. We use conformal prediction techniques to establish bounds that accurately reflect the mathematical uncertainty of the specific case. If the case is highly volatile due to a difficult venue or complex, conflicting medical reports, the range widens to reflect that reality. The system also calculates a specific escalation probability and identifies the exact historical comparable cases driving the forecast. Every number, every probability, and every comparable case is traceable directly back to the source documents the generative model initially read. This creates an honest audit trail that a claims professional can actually trust.

Defending Against Reserve Volatility

This structural separation between reading and predicting fundamentally changes how an insurance carrier handles litigation. When a new claim arrives, the system reads the initial files and generates a baseline reserve delta against the current manual reserve. Claims managers see exactly which factors are pushing the financial exposure up or down. If a plaintiff attorney has a history of dragging out specific types of premises liability claims to maximize fees, the geometric model detects that pattern from the historical data and flags the escalation probability immediately. Carriers can then allocate defense spend proportionally to the actual risk rather than the perceived risk.

The current litigation environment demands this level of precision. Social inflation and third-party litigation funding explicitly exploit the asymmetry of information between plaintiffs and defense desks. Plaintiff firms use aggregated data to push for higher settlements across entire portfolios of cases. Defense desks traditionally rely on the individual gut instinct of the adjuster handling the file. A mathematically grounded forecast levels this imbalance. When you enter a negotiation armed with a calibrated settlement range and the exact historical comparables that justify it, you negotiate from hard data. You detect the cases likely to escalate into nuclear verdicts months before the plaintiff ever files a formal demand letter.

Expecting a text generator to calculate a liability forecast is a category error that will cost you money. Language models are built to talk, but mathematical models are built to measure.

Want to talk to an executive?

Press, partners, investors, candidates — the inbox is monitored. Tell us who you are and we'll route it to the right person within two business days.