Context Normalization

The process of converting heterogeneous source data into a uniform, agent-readable format.

Last updated: 2026-04-12

Overview

An agent is only as good as the context it can reason over. Raw data sources — filings, transcripts, PDFs, databases, news articles — each arrive in different formats, schemas, and quality levels. The normalization layer is the infrastructure that converts all of it into something the model can actually use.

At Fintool, all financial data flows into one of three canonical formats:

  • Markdown — for narrative content (SEC filings, earnings transcripts, news articles)
  • CSV / markdown tables — for structured numerical data (financials, segment metrics, comparisons)
  • JSON metadata — for searchability (ticker, date, document type, fiscal period)

LLMs reason well over markdown tables but struggle with raw HTML <table> tags or unformatted CSV dumps. Normalization converts everything to the format the model actually handles well.

Key Points

  • Chunking strategy matters: different documents chunk differently
    • 10-K filings → by regulatory section (Item 1, 1A, 7, 8…)
    • Earnings transcripts → by speaker turn (CEO remarks, CFO, individual analyst Q&A)
    • Press releases → usually one chunk
    • News → paragraph-level
  • Metadata enables retrieval: every document gets a meta.json with ticker, date, document type, and fiscal period. Without it, retrieval degrades to keyword search over a haystack
  • Fiscal period normalization is critical: “Q1 2024” is ambiguous — Apple’s Q1 is Oct–Dec 2023, Microsoft’s is Jul–Sep 2023, calendar Q1 is Jan–Mar 2024. All period references must normalize to absolute date ranges
  • Table extraction is hard: financial tables have merged header cells, footnote markers, parentheses for negatives, mixed units. Fintool scores every extracted table; below 90% confidence, it’s flagged and excluded from agent context
  • SEC filings are adversarial: designed for legal compliance, not machine reading. Multi-page tables with repeated headers, nested footnotes, XBRL tags that are often wrong

Connections

  • agent-sandbox — clean context is injected into the agent’s sandbox environment
  • agent-evaluation — evaluation catches normalization failures before they reach users
  • data-warehouse — similar principle: raw data must be cleaned and structured before it’s useful for analysis

Sources