Article

How content formats have always chased the reader

Read time:

7 min

Why it matters:

AI cannot reliably use PDFs or plain HTML - the format gap is a content strategy problem.

Who it's for:

Documentation leaders, content architects, and AI project teams evaluating content readiness.

TL;DR: Every major content format emerged to serve a new consumer. PDF gave humans a reliable screen document. HTML gave browsers and search engines something to parse. XML and JSON gave machines a common language for data exchange. AION - Author-it's structured JSON output, released in 2026 - gives large language models a semantically rich, version-anchored content object. This pattern has repeated every decade. Author-it was designed for each transition, and this one most deliberately.

That shift is the whole thesis behind the AI content foundation, and AION is the format that delivers it. See how AION works.

The pattern that nobody names

There is a version of technology history that focuses on invention - the breakthroughs, the patents, the founding myths. It is mostly incomplete. Most technology transitions happen not because something new becomes possible, but because something old stops working. A new consumer arrives. The old format cannot serve them. A new format emerges.

Publishing formats follow this pattern without exception. Each major transition - from print to PDF, from PDF to HTML, from HTML to machine-readable XML and JSON - happened because a new consumer appeared that the existing format could not adequately serve. And each time, the organisations that grasped the shift early had a structural advantage over those that scrambled to retrofit.

We are at one of those transitions now. The consumer is the large language model. The format that serves it is not PDF, HTML, or generic JSON. It is structured, governed, semantically rich content - and most organisations don't have it.

Content format evolution timeline from PDF (1993) to HTML, XML/JSON, and AION (2026), each format emerging to serve a new type of consumer

‍

PDF: built for the human on a screen

Before PDF, documents were designed for printers. PostScript - the dominant page description language of the 1980s - existed to tell a printer exactly how a page should look: fonts, positions, line spacing. It worked perfectly for its intended consumer.

Then organisations started distributing documents digitally. They emailed files, uploaded them to early FTP servers, posted them to the first websites. PostScript broke. It was printer-specific. Fonts rendered differently on different machines. What the author saw was not what the recipient got.

Adobe solved this in 1993 with the Portable Document Format. PDF was built for one specific consumer: the human reader, on a screen, needing a document that looked exactly as intended regardless of operating system, font installation, or printer. It was a massive success. PDF became the default format for every formal document - technical manuals, regulatory submissions, contracts, user guides.

But PDF had a flaw baked into its design. It optimised for visual fidelity, not machine readability. A PDF is a visual description of a page. Text is stored as positioned glyphs, not semantic content. Tables exist as layout objects, not structured data. Metadata is sparse and unreliable. For humans reading on screen, none of this mattered. For machines trying to extract meaning, it was a serious problem - one that only became visible when a new consumer arrived.

PDF delivering fragmented layout noise to an LLM versus an AION object delivering typed, hierarchy-preserving, variable-resolved content with modification history

‍

HTML: built for the browser and the search engine

The web changed how content was consumed - not because HTML was a better reading experience than PDF (early web pages were often ugly by print standards), but because HTML introduced something PDF could not offer: a format that machines could parse for meaning.

HTML uses tags to mark what things are: headings, paragraphs, lists, links. Those tags are instructions to the browser, but they are also signals to other machines - including search engine crawlers. When Google arrived in the late 1990s, it could read HTML and infer what a page was about. A document with an H1 heading, structured paragraphs, and descriptive anchor text was far more findable than a PDF. Structure gave the search engine something to reason about.

The web added a second new consumer to the publishing equation: the search engine. And content format had to evolve to serve it. Meta tags, canonical URLs, structured HTML, alt text - all of these emerged to make content more readable to machines searching on behalf of humans.

PDF persisted for formal and regulatory documents. HTML became the default for anything that needed to be discovered. Both served their consumers well. Neither was built for what came next.

‍

XML and JSON: built for machines talking to machines

By the 2000s, a new consumer had appeared: software systems that needed to exchange data with other software systems. APIs, enterprise integrations, CRM-to-ERP handoffs, localisation workflows, content management pipelines. The consumer was no longer a human or a search engine - it was a machine reading content in order to act on it.

XML emerged as the standard for structured data exchange. Verbose and rigid, but unambiguous: every element named, typed, and nested within a defined hierarchy. DITA applied XML to technical documentation specifically, creating a standard for tagging topics as tasks, concepts, and references.

JSON arrived later and won on developer preference. More compact than XML, easier to work with in modern languages, and perfectly suited to the web services and APIs that became the connective tissue of enterprise software. Where XML described a document, JSON described an object - a structured data entity with properties, values, and nested relationships.

The common thread across XML and JSON was identical to HTML: structure as a first-class concern. You were not just encoding content - you were encoding what that content was, how its parts related to each other, and what consuming systems needed to know about it. This is the tradition AION builds on. But AION goes further, because the consumer it is designed for is far more demanding.

‍

AION: built for the large language model

Large language models consume content at a scale and fidelity that no previous machine consumer required. An LLM needs to understand not just what a piece of content says, but what it is - its type, its purpose, its scope of applicability, its version history, and whether it has been through a governance process that makes it trustworthy.

A PDF gives an LLM visual noise: reflow artefacts, table fragments, inconsistent encoding. HTML gives it layout semantics designed for browsers, not reasoning. Even standard JSON gives it content without provenance or governance state.

AION gives it all of it:

Full content hierarchy - topics, subtopics, the structural relationships between them
Resolved variables - product names, version numbers, region-specific values, already substituted at publish time
Metadata - content type, applicability conditions, audience tags, domain classification
Version and governance state - which version this is, what its approval state is, when it was last reviewed
Authorship provenance - who created it, who reviewed it, what it traces back to in the source library

This is not a reformatting exercise. AION represents a deliberate architectural decision to treat the LLM as a first-class consumer - one with the same requirements for structure, governance, and provenance that regulated industries have always demanded from their content.

AION content object anatomy: five data categories exported to AI - content hierarchy, resolved variables, topic type, modification history, and folder path

‍

Why most organisations are already behind

The pattern is uncomfortable in retrospect. Every time a new consumer arrived, organisations scrambled to retrofit their content. When search engines appeared, companies had warehouses of PDFs and had to laboriously convert or accept invisibility. When XML and JSON became the currency of enterprise integration, organisations built extraction layers on top of unstructured documents and hoped for the best.

Most organisations are doing the same thing with AI right now. They are feeding LLMs with PDFs, HTML exports, and poorly structured knowledge bases - then wondering why their AI systems hallucinate, produce outdated information, or fail to distinguish between an approved procedure and an archived draft.

The answer is not better prompt engineering. The answer is better content. Specifically: structured, governed, metadata-rich content that an LLM can trust.

The organisations that grasp this shift now will have the same structural advantage that early web publishers had over those still distributing PDFs in 1997.

‍

This is not a new idea - it is a recurring one

Author-it has published to print, PDF, HTML5, SCORM, Help portals, and now AI via AION over 25 years. The output format has changed with each era. The underlying architecture has not.

Author-it's content is structured from the moment it is created. Topics are typed. Components are reusable. Variables are resolved at publish time. Metadata is applied at authoring time, not retrofitted. Governance is built in - every content object has a version, an approval state, and a review history.

AION does not change what Author-it is. It exposes what Author-it has always been: a content system built for precision, governed by structure, and ready for whatever consumer arrives next. The format changes. The foundation does not.

That foundation is what AION exposes for AI. Explore AION, or take the 90-second Structured Content Challenge to see structured content in action.

Content Formats FAQ

Q: What is AION?

A: AION is Author-it's structured JSON publishing format, released in 2026.R1. It exports content as a hierarchical object that includes the full content structure, resolved variables, metadata, version history, and approval state - everything a large language model needs to understand what a piece of content is, where it came from, and whether it can be trusted. AION is included as standard across all Author-it environments at no additional cost.

Q: Why can't AI just read PDFs or HTML?

A: AI can read PDFs and HTML, but doing so forces the model to reconstruct meaning from presentation. PDFs encode visual layout - paragraphs reflow arbitrarily, tables fragment, metadata is absent or unreliable. HTML carries more structure but is designed for rendering in browsers, not machine reasoning. Neither format natively carries provenance, version state, or relationship information. AION carries all of these as first-class data.

Q: What does AI-ready content mean?

A: AI-ready content is content that a large language model can consume accurately and reliably without additional preprocessing. It is structured (broken into semantic components), governed (every piece has a known version, approval state, and owner), and metadata-rich (topics carry type, audience, applicability, and relationships). Content that lacks these properties forces AI to guess - which introduces hallucinations and inaccuracies.

Q: Is AION the same as JSON-LD or schema.org markup?

A: No. JSON-LD annotates existing web pages with semantic labels for search engines. AION is a publishing output format - it replaces or supplements HTML as the delivery format for machine consumption. AION carries the full content hierarchy, not a metadata overlay on top of an existing document.

Q: How does content format affect RAG performance?

A: Retrieval-augmented generation works by retrieving relevant content chunks and providing them to a language model as context. The quality of retrieval depends directly on how well content is chunked, labelled, and described. Unstructured content produces poor chunks with no metadata. Structured content like AION produces clean, labelled, version-anchored chunks that retrieval systems can rank and filter accurately. Teams using better-structured content report significantly fewer irrelevant retrievals and lower inference token costs.

Q: Does Author-it still publish PDFs and HTML alongside AION?

A: Yes. Author-it publishes to PDF, HTML5, Word, SCORM, Help portals via Magellan, and AION for AI consumption - all from the same single source. AION is an additive output, not a replacement. Most organisations continue publishing PDFs for regulatory submissions and HTML5 for web-facing documentation while also publishing AION for their AI systems.

Q: What industries benefit most from AION?

A: Any organisation where content accuracy matters and AI is being evaluated or deployed. The clearest early use cases are in manufacturing (product documentation, service manuals, compliance content), software (API docs, release notes, support knowledge bases), and utilities (operational procedures, safety documentation, regulatory filings). In each case, the challenge is the same: getting AI to work from authorised, versioned content rather than a haystack of mixed sources.

Published on:

Author:

May 12, 2026

Ben Harris

Marketing Lead

Related Resources

FrameMaker migration without the horror story - what a realistic phased migration looks like and how to de-risk it.

Guide

FrameMaker migration to a modern CCMS

Manufacturing

Software

AI Content Foundation

Compliance

Knowledge bases

Translation

User guides

Migration anxiety keeps FrameMaker teams put. A phased, supported move drops the risk sharply.

Paligo alternatives - where modern CCMSs differ and how to compare on what's hard to retrofit.

Article

Paligo alternatives: what teams evaluate

Manufacturing

Software

Utilities

AI Content Foundation

Compliance

Knowledge bases

Translation

User guides

A shortlist of one isn't an evaluation - compare on what's hard to retrofit, not the demo.

Confluence is a great wiki but breaks for product docs - where it stops being enough and what to move to.

Article

When Confluence stops working for technical docs

Software

Manufacturing

Knowledge bases

A page-based wiki is the wrong shape for structured, reusable product documentation.

‍

How content formats have always chased the reader

The pattern that nobody names

PDF: built for the human on a screen

HTML: built for the browser and the search engine

XML and JSON: built for machines talking to machines

AION: built for the large language model

Why most organisations are already behind

This is not a new idea - it is a recurring one

Content Formats FAQ

Related Resources

FrameMaker migration to a modern CCMS

Paligo alternatives: what teams evaluate

When Confluence stops working for technical docs

Cookie Settings