Article
The token cost of unstructured content at scale
TL;DR: Every token a large language model processes costs money, and that cost compounds at enterprise scale. Unstructured content - PDFs, wiki pages, flat HTML - inflates context windows with noise, degrades retrieval precision, and forces models to process far more tokens than necessary to answer a question. Metadata-rich, well-structured content reduces inference costs by shrinking context windows and improving retrieval signal. At enterprise query volumes, the difference is a meaningful budget line. This article frames the efficiency argument - and is honest about where real production data is still needed to quantify it precisely.
The cost nobody is tracking yet
Most enterprise AI cost conversations focus on the LLM itself - which model provider, which tier, which context window size. These are real decisions. But they are downstream of a more fundamental variable: how efficiently is the content being prepared and retrieved before it reaches the model?
Every token that enters a language model's context window costs money. Input tokens cost less than output tokens, but at the query volumes of a production enterprise application, they add up. A support ticket automation system might process 3,500 tokens per ticket - around 3,150 input, 400 output - and handle thousands of tickets daily. The math on token waste is straightforward: if your retrieval layer returns four to eight documents when one or two would answer the question, you are paying three to seven times more than necessary per query.
Industry analysis from 2026 confirms that RAG pipelines routinely pass far more context than models actually use - and that tightening the retrieval layer can cut input tokens by more than half with no loss in answer quality. The primary variable that determines retrieval precision is metadata quality. Better metadata means better retrieval. Better retrieval means fewer tokens. Fewer tokens means lower cost.
How unstructured content bloats the context window
When a RAG pipeline retrieves content to answer a query, it uses embedding similarity to find the most semantically relevant chunks from the knowledge base. With well-structured, metadata-rich content, this works reliably: the retrieval system finds the right chunks, filters by applicability, and returns a tight, relevant set.
With unstructured content - PDFs, web pages, wikis, pasted documents - the process breaks down in predictable ways:
- Poor chunking. Unstructured documents do not have clean semantic boundaries. Chunks often split mid-sentence, mid-table, or mid-procedure. The retrieval system compensates by returning more chunks, hoping the relevant information is somewhere in them.
- No applicability metadata. Without conditions specifying which product, version, or audience a piece of content applies to, the retrieval system cannot filter by context. It returns all potentially relevant content - much of which may not apply to the specific query.
- Version ambiguity. A knowledge base containing multiple generations of the same procedure cannot be reliably filtered by currency. The retrieval system may return outdated content alongside current content.
- Noise from layout artefacts. PDFs converted to text carry page headers, footers, table fragments, and layout artefacts that inflate token count without adding semantic value.
Each of these problems adds tokens. Tokens that the model must process. Tokens that cost money at inference time.
What structured content does differently
Structured content from a governed CCMS reduces token waste at multiple points in the RAG pipeline:
- Clean chunks. Component-based authoring produces content objects with semantic boundaries already defined - a warning is a warning, a procedure step is a procedure step. Chunks align with topic boundaries rather than arbitrary text windows.
- Applicability filtering. Author-it's conditional publishing logic encodes which content applies to which product, version, region, or audience. AION exports these conditions as structured metadata fields that the retrieval system uses to filter before the model sees anything.
- Governance state filtering. Draft, archived, and superseded content can be excluded from retrieval entirely based on governance state metadata. The model only sees approved, current content.
- No layout noise. AION output contains content, not presentation. No page numbers, no running headers, no orphaned table cells. What the model receives is the semantic content, nothing else.
The cumulative effect is a retrieval layer that returns fewer, better chunks - smaller context windows, lower token counts, and faster response times, without sacrificing answer quality. For high-volume enterprise applications, these improvements translate directly to operating cost.
The sustainability dimension
Token efficiency is not only a cost argument. Data centre energy consumption by AI inference is a growing consideration for enterprise sustainability reporting. Token waste at scale translates directly to energy waste: larger context windows require more compute, more time, and more energy per query.
For organisations with scope 3 emission targets or supply chain sustainability commitments, the energy footprint of AI inference workloads will increasingly be a reportable line. Structuring content to minimise token waste is a genuine lever on that footprint. It is not the primary argument for structured content - but it is a real one, and it resonates with CIO and CFO audiences looking beyond unit economics to total cost of ownership.
What this article cannot yet prove
This argument is directionally sound and grounded in well-established principles of retrieval-augmented generation. But to be transparent: the specific cost savings achievable by moving from unstructured to structured AION content in a production enterprise RAG deployment are not yet quantified with real customer data.
The variables that determine actual cost savings are: query volume (the number of LLM queries made against the knowledge base daily), average context window size before and after structuring, retrieval precision improvement from metadata filtering, and the specific model and pricing tier in use.
When Author-it customers share production data from AION-powered deployments, we will update this article with real numbers. Until then, the efficiency argument is presented as a framework for your own cost modelling - not as a guaranteed reduction. If you are running a production RAG deployment and willing to share token efficiency data, we would love to hear from you.
Starting the cost conversation
For CIOs and CTOs building the business case for content infrastructure investment, the token efficiency argument reframes the question. Not only how much does the CCMS cost - but what is the ongoing operating cost of running AI on our current content, compared to running it on governed structured content?
That calculation requires production data to be precise. But the directional answer is clear: unstructured content imposes a persistent tax on every AI query your organisation runs. Structured, metadata-rich content reduces that tax. At enterprise scale, the difference is worth modelling before the next AI infrastructure contract is signed.
Token Cost FAQ
Q: What is a token in the context of large language models?
A: A token is roughly four characters of text (or three-quarters of a word in English). Language models process input and generate output in tokens, and API pricing is typically charged per million tokens. A typical enterprise AI query with retrieved context might consume between 1,000 and 10,000 tokens depending on how much context is retrieved. At high query volumes, token efficiency becomes a significant operating cost variable.
Q: What is a RAG pipeline and why does it affect token costs?
A: RAG stands for retrieval-augmented generation. It is the pattern used by most enterprise AI systems to ground a language model in specific knowledge base content: the system retrieves relevant content chunks and passes them to the model as context alongside the query. The number and size of those chunks is the primary driver of input token consumption. Retrieval systems that return more chunks than necessary - because the content is poorly structured or lacks filtering metadata - inflate token costs without improving answer quality.
Q: How much can token costs be reduced with better-structured content?
A: Industry analysis suggests that tightening the RAG retrieval layer - returning fewer, better-targeted chunks - can cut input tokens by more than half with no loss in retrieval precision. The primary lever is metadata quality: content with applicability conditions, governance state, and topic classification can be filtered precisely before reaching the model. The exact savings depend on current content quality, query volume, and the specific model tier in use. Real production data from AION deployments will be published as it becomes available.
Q: What is AION and how does it reduce token waste?
A: AION is Author-it's structured JSON publishing format, released in 2026.R1. It exports content as a hierarchical object with metadata fields for content type, applicability conditions, governance state, and structural relationships. In a RAG pipeline, these metadata fields enable the retrieval system to filter content before passing it to the model - returning only approved, applicable, version-current content. This produces smaller, more relevant context windows and lower per-query token costs.
Q: Does structured content also improve AI answer quality, or just reduce cost?
A: Both. Structured content reduces token waste by enabling tighter retrieval, which also improves answer quality by reducing the amount of irrelevant content in the model's context window. Fewer irrelevant chunks means less noise for the model to reason through, fewer false retrieval hits, and lower risk of the model being confused by conflicting or inapplicable content. The cost argument and the quality argument point in the same direction.
Q: Is there a sustainability case for structured content in AI?
A: Yes, though it is secondary to the cost and quality cases. LLM inference consumes data centre compute, which consumes energy. Token waste translates directly to additional compute cycles and energy consumption per query. For organisations with scope 3 emission targets or supply chain sustainability commitments, the energy footprint of AI inference workloads is an emerging consideration. Reducing token waste through better-structured content is a modest but real lever on that footprint.


