Ingesting Thousands of Pages Per Claim Without Losing Signal

A claim file is a chaotic data swamp of pleadings, medical records, and emails. Extracting the structural reality of a case from this mess requires treating ingestion as an engineering discipline, not a generic text-parsing task.

By Tal Knafo
CTO
24 February 2026
5 min read

TL;DR — Feeding raw claim documents into a generic LLM destroys context. True signal extraction requires decoupling the generative reading phase from the mathematical prediction phase, ensuring every data point traces back to the exact page of the source file.

A standard litigated claim file arrives on a desk as a 4,000-page PDF containing twenty years of medical history, handwritten adjuster notes, and repetitive plaintiff pleadings designed to obscure the core facts. Human teams attempt to skim this mass of unstructured paper to set an initial reserve. In doing so, they inevitably miss the single line buried in a deposition transcript where a plaintiff mentions a prior neck surgery. The default industry engineering response is to dump the entire PDF into a generic language model prompt. That approach fails immediately. Context windows overflow. The model loses the thread of the chronological timeline and hallucinates critical dates. When building Canotera, we established a strict architectural boundary to solve this specific failure mode: generative AI reads the case file, but it never predicts the outcome.

Ingestion begins with a deterministic parsing layer. We avoid relying on raw optical character recognition. Standard OCR captures individual text characters but destroys document geometry, which is fatal for claims processing. A medical bill is inherently a table. A legal pleading is a numbered list. Our ingestion pipeline preserves this spatial layout before any language model touches the text. We convert the raw documents into structured data payloads that maintain the strict hierarchical relationship between a date of service, a specific medical procedure, and the billed amount. We route correspondence through a natural language processing pipeline while sending medical billing codes through a rigid tabular extractor. This requires distinct parsing strategies for different file formats and continuous monitoring of document drift.

Processing 4,000 pages of text requires managing state across severe token limits. We implement semantic chunking to handle the volume. We split the document into overlapping contextual windows, ensuring we do not arbitrarily sever a medical diagnosis from its corresponding treatment date. We ask the generative AI to find the date of the accident, the jurisdiction, the specific injuries claimed, and the presence of third-party litigation funding indicators. It outputs these facts into a rigid schema. We force the models to output valid JSON matching our specific internal data structures. If the output fails validation, the pipeline automatically retries the extraction with a higher temperature or a smaller document chunk to guarantee a clean payload.

Traceability as an Engineering Constraint

We constrain generative AI entirely to the task of extraction. It reads the parsed text to identify entities and structure the narrative. If the model encounters an ambiguous statement, it flags the ambiguity rather than guessing. Traceability dictates our entire database architecture. Claims professionals correctly reject black boxes. They need to defend their reserve decisions to actuaries and underwriters. When our system extracts a material fact, it stores the exact bounding box coordinates of the source text. The API delivers the structured data alongside a direct mathematical pointer to the original document. If the platform identifies an escalation driver like a traumatic brain injury diagnosis, the user interface highlights the exact sentence on the specific page of the medical file. This prevents hallucinations from poisoning downstream models and allows humans to verify the machine's work instantly.

Processing thousands of pages of highly sensitive personal health information requires absolute data isolation. We deploy our ingestion pipeline within ephemeral, single-tenant containers. Customer data never crosses tenant boundaries, and we strictly ensure no customer claims data is used to train the underlying foundation models. This architecture introduces a deliberate latency tradeoff. Spinning up isolated instances and processing dense medical files takes several minutes per claim. We accept this delay. Claims organizations operate on cycles of days and weeks, not milliseconds. They value strict security and high precision over sub-second response times. We built asynchronous API endpoints with webhook callbacks so carrier systems can submit a raw file, move on to other tasks, and receive a secure notification when the payload is ready.

Mathematical Prediction Requires Clean Geometry

Once the generative extraction finishes, the resulting structured schema becomes the input for a completely separate forecasting system. This is where we apply mathematical and geometric machine-learning models. These models do not process English text. They process numbers and vectors. They are trained entirely on massive historical datasets of resolved cases with known financial outcomes. They map the extracted features of the current open claim against the multi-dimensional space of closed claims. We represent the active claim as a vector in a high-dimensional space where the axes represent variables like jurisdiction hostility, injury severity, and the track record of the plaintiff's counsel. The model calculates the mathematical distance between the active claim and historical resolved claims to find true comparables.

This strict separation of concerns produces mathematically calibrated outputs. The forecasting models calculate a settlement range based on historical precedent rather than generating an arbitrary point guess. They calculate the explicit mathematical probability of a nuclear verdict or protracted litigation. Because they retrieve comparable resolved cases based on geometric similarity, the adjuster can see exactly how similar venues treated identical injury profiles. The API returns a reserve delta, comparing the mathematical forecast against the carrier's current booked reserve. Every dollar in that forecasted range is tied explicitly to the facts extracted during the ingestion phase, maintaining an unbroken chain of logic from the final prediction back to the raw PDF.

We engineered this architecture to address physical market realities. Social inflation and third-party litigation funding inject severe volatility into claims portfolios. Carriers historically reacted to this volatility by adjusting reserves upward late in the life cycle of a claim, often just weeks before a trial date. By ingesting the entire historical file on day one and mapping it against resolved case models, we surface escalation risks months early. Integrating this system requires a pragmatic approach to onboarding. Customers push claim files to our ingestion API, and we return the structured schema and the forecasted ranges without requiring a massive internal IT overhaul. The platform gives claims managers the concrete data they need to allocate defense spend aggressively on high-risk files and settle low-risk files quickly. You cannot price the risk until you parse the reality.

Want to talk to an executive?

Press, partners, investors, candidates — the inbox is monitored. Tell us who you are and we'll route it to the right person within two business days.

Book a Demo See Open Roles

Ingesting Thousands of Pages Per Claim Without Losing Signal

Traceability as an Engineering Constraint

Mathematical Prediction Requires Clean Geometry

Related articles.

Want to talk to an executive?