Calibration Monitoring in Production

A single point prediction for an insurance settlement is a lie. We monitor the boundaries of our predictions, ensuring a 30 percent probability of nuclear escalation means exactly 30 out of 100 cases will explode.

By Tal Knafo
CTO
4 July 2026
5 min read

TL;DR — Machine learning in claims requires tracking calibration, not just accuracy. We monitor input distribution drift from our generative pipelines and measure predicted probabilities against actual resolutions over multi-year horizons using survival analysis on open claims.

A single point prediction for an insurance settlement is a lie. When a machine learning model tells a claims adjuster that a severe bodily injury case will settle for exactly $450,000, it projects false certainty onto a highly volatile process. Litigation outcomes depend on human variables, jurisdiction shifts, and third-party funding dynamics. We build Canotera to calculate a calibrated settlement range and a specific escalation probability. The engineering challenge is proving to a carrier that a 30 percent probability of nuclear escalation means exactly 30 out of 100 cases will explode. Calibration is the difference between a model you test in a lab and a model you trust with a balance sheet. When a system outputs an 80 percent confidence interval for a settlement range, exactly 80 percent of the resolved claims must fall inside those bounds. If 95 percent land inside the range, the model is underconfident and mathematically useless for setting a precise initial reserve. If only 50 percent fall inside the range, the model is overconfident, exposing the carrier to massive reserve volatility. We monitor this exact ratio continuously.

The Extraction and Prediction Divide

Monitoring machine learning in the casualty insurance domain is fundamentally different from monitoring a recommendation engine. A retail model knows if it was right in milliseconds, while a casualty claim litigates for three to five years. If you wait for ground truth to update your monitoring dashboards, your model will drift into irrelevance before you register a single error. We solve this by physically separating the machine reading pipeline from the mathematical prediction pipeline. The first phase of our system uses generative AI to read the case file. These files contain thousands of pages of pleadings, medical records, and correspondence. The generative models do not predict the outcome. They extract facts and structure the unstructured text into a deterministic schema. Production monitoring starts here, at the ingestion layer. We monitor the distribution of the extracted features daily. If a specific jurisdiction suddenly shows a spike in letters of protection or traumatic brain injury diagnoses, our system flags a distribution shift. We track the embeddings of the incoming documents to detect when plaintiff firms change their arguing style. By isolating the reading phase, we catch structural data drift years before those cases reach a settlement.

The second phase is the prediction. We use separate geometric machine learning models, trained on millions of resolved cases, to produce the settlement ranges and reserve deltas. Because we separate extraction from prediction, we isolate the source of any calibration error. If a prediction looks anomalous, we trace the mathematical drivers directly back to the specific page and paragraph the generative model extracted. Traceability is not a compliance afterthought. It is a core requirement for debugging production systems handling sensitive records.

Censored Data and Ground Truth

To measure calibration on the mathematical models, we calculate Expected Calibration Error across the portfolio. We group predictions into bins based on their calculated escalation probabilities. For the 20 percent probability bin, we observe the actual escalation rate. The difference between the predicted rate and the observed rate is the calibration error. Calculating this on insurance claims requires adjusting for right-censored data. A significant portion of any carrier portfolio consists of open claims, and you cannot ignore them in your monitoring metrics. If you only evaluate calibration on claims that settle quickly, you bias your monitoring pipeline toward simple, low-severity cases. The complex cases driven by social inflation take the longest to resolve. We apply survival analysis to our monitoring pipelines to account for this. We track the aging of the open inventory against our initial predictions. If the model predicted a low escalation probability for a cohort, but those cases cross the two-year mark in litigation at twice the historical rate, the model is drifting. We detect this signal before the carrier writes a settlement check.

The Architecture Tradeoff

This multi-layered approach to monitoring forces us to make specific engineering tradeoffs. We sacrifice the theoretical simplicity of an end-to-end deep learning system. An end-to-end model that reads raw text and spits out a dollar amount is impossible to calibrate and impossible to debug in a production enterprise environment. Our split architecture increases latency during the initial ingestion phase. Processing thousands of pages through generative extraction and then mapping the structured output into a high-dimensional geometric space takes compute time. We accept that latency. A claims floor needs an accurate, traceable reserve delta on day one, not in fifty milliseconds.

This architecture also dictates how we design our API and handle onboarding. When a new carrier integrates with Canotera, we do not just map their historical claims data into our system. We pipe their raw, historical document payloads through our generative ingestion layer to establish a baseline distribution for their specific book of business. Our API endpoints are structured to return not just the settlement ranges and comparable cases, but the exact confidence intervals and feature weights driving them. This allows the carrier actuarial teams to ingest our calibration metrics directly into their own oversight dashboards.

We build for a reality where data drift in litigation is an intentional strategy by plaintiff attorneys to maximize payouts. They continuously probe for new ways to anchor damages. We build our pipelines to expect this adversarial shift by monitoring the raw text distributions, the extracted feature schemas, and the geometric distances between the new claims and the resolved training set.

An uncalibrated model is a financial liability wrapped in a clean user interface. When carriers negotiate from data instead of gut instinct, that data must reflect reality across the entire probability curve. By rigorously monitoring calibration and tracing every driver back to the source documents, we ensure that the initial reserve is grounded in mathematics rather than optimism.

Accuracy tells you what happened yesterday; calibration advises you how to act tomorrow.

Want to talk to an executive?

Press, partners, investors, candidates — the inbox is monitored. Tell us who you are and we'll route it to the right person within two business days.

Book a Demo See Open Roles

Calibration Monitoring in Production

The Extraction and Prediction Divide

Censored Data and Ground Truth

The Architecture Tradeoff

Related articles.

Want to talk to an executive?