We keep feeding LLMs endless Markdown and treating them like they have human brains. They don’t. And the result is a generation of zombie agents: they burn tokens, they hallucinate the moment context drifts, and they forget everything the day you swap models. The problem usually isn’t your model. It’s your memory substrate.

This is the story of a different choice we made in December 2025 — verified in our git history — and what fourteen harnesses, two measured benchmarks, four frontier model families, and one very telling screenshot taught us about it.

In this article

Two brains, one format mistake
The receipts: models that never heard of TOON read it anyway
The numbers, honestly
A memory bigger than the window, sliced per turn
You can’t lint a pile of prose
The harness that catches model habits
The dialect tax
The filesystem the brain reads from
We built our studio on it

Two brains, one format mistake

Markdown is the muddled middle. It tries to serve the human eye and the machine parser at once, and it shortchanges both: too noisy and lossy to be reliable machine memory, too flat and terse to be a rich human document. And if you’ve already disciplined your Markdown — front-matter, anchors, naming conventions — that instinct is exactly right; this article is that instinct carried one step further, to where convention becomes structure a machine can check. Most AI teams paper over this by throwing bigger context windows at it. We split the problem instead.

Human knowledge lives on a wiki — prose, diagrams, narrative, the things forgetful humans actually read — and our machine agents passively index and curate that surface for us. Machine memory lives in TOON — Token-Oriented Object Notation, a compact, keyed, indentation-structured format built for exactly one reader: a model with a context window.

A diptych: a warm prose-and-diagrams page beside a cool lattice of structured data rows, joined by a thin luminous conduit. — Two brains, two surfaces: the wiki for humans, TOON for the machine — and the machine tends the human side.

TOON looks like this — a declared table with an inline schema, three records, zero repeated keys:

users[3]{id,name,role}:
  1,Alice,admin
  2,Bob,editor
  3,Cara,viewer

That header line matters more than it appears to. Hold that thought.

The receipts: models that never heard of TOON read it anyway

Here’s the uncomfortable fact we’ll concede before any skeptic raises it: frontier models don’t know TOON. In March 2026 we asked the then-current frontier from both major labs. OpenAI’s ChatGPT treated toon as an unknown user format and asked us, verbatim, whether it was “TOML/JSON/YAML/custom.” Anthropic’s Opus 4.6, same day, drew the same blank. And in June 2026, GLM-5.2 — the brand-new frontier model some benchmark-watchers rank against Opus 4.8 — called TOON “an obscure micro-format” with “zero evidence” of parsing benefits.

But look closely at the screenshot. While ChatGPT was asking what TOON was, it was already fluently proposing governance, orchestration, and contracts control files against our TOON-based harness — and one spec link later it was competently advising on TOON registry design. It didn’t know the name. It handled the structure anyway.

Screenshot: ChatGPT asks what the toon file format is (TOML/JSON/YAML/custom) while proposing governance, orchestration and contracts control files; after being given the spec link it advises on TOON registry design. — The receipt, March 2026: ChatGPT asks what TOON is — mid-sentence, while working with it. Bonus line at the top: “It feels autonomous, but it’s still contract-driven.”

That’s the trick of the [N]{schema}: header: it’s an inline mini-spec. A model doesn’t need TOON in its training data to parse TOON, because every TOON table carries its own contract — the count, the fields, the shape. Recognition is a training artifact. Parseability is a design property.

And the story has a live ending. This week — mid-production on this very article — we re-ran the question against GLM-5.2, neutrally, with reasoning off. It still doesn’t recognize TOON by name (“exceedingly obscure, if established”). But shown a three-row sample, it unprompted estimated the token savings, mapped the correct use-cases, and called the format “a brilliant trade-off… a highly effective hack for maximizing the sheer volume of data an LLM can process in a single context window.” The very same model — not merely the same family; GLM-5.2 was weeks old on both dates — that dismissed it nine days earlier. The self-describing header did the work — separating what a model has to know from what the data itself can tell it.

And the window for even needing that bridge is closing. As we ship this, the OpenAI and Anthropic frontier models that worked on this very article — one auditing the benchmarks, one revising the draft — recognize TOON outright, by name, no sample needed. A format one frontier model called obscure in June is in the training data of two other frontier families by July. It always just worked; now it’s also known.

The numbers, honestly

We benchmarked it ourselves, reproducibly: 100 uniform CRM records, rendered five ways, counted with OpenAI’s o200k_base tokenizer. One question: “How many Pro-plan users have an Open ticket?” Ground truth: 7.

Format	Tokens	vs TOON	At a 2,500-token budget
TOON	2,068	1.00×	all 100 rows → answers 7 ✓
JSON (compact)	3,074	1.49×	81 rows → answers 6 ✗
JSON (pretty-printed)	4,973	2.40×	50 rows → answers 4 ✗
JSON (columnar)	1,990	0.96×	all 100 rows → answers 7 ✓
Markdown table	2,162	1.05×	all 100 rows → answers 7 ✓

The honest headline is the second row: TOON is 32.7% smaller than minified row-object JSON — the shape your application actually sends an LLM. (Against the pretty-printed JSON most apps emit by default, it’s 58% smaller — but a smart engineer minifies, so we won’t headline the soft target.) At a capped budget the difference stops being about money and starts being about truth: the same question, the same data, and the JSON version confidently answers 4 because it never saw the other half of the table.

A glowing context-window frame: bulky JSON braces overflow and get cut off at the edge, while compact structured rows fit inside with room to spare. — Same data, same window. One of these answers is wrong, and it doesn’t know it.

Full disclosure, because the details are where trust lives: a tight Markdown table is nearly as compact as TOON (−4.3%) — the win is against JSON, not against a hand-optimized table. CSV is the token floor; TOON spends a few tokens more to buy the schema header. So does columnar JSON — hoist the field names once and ship rows as arrays, and you land about 4% under TOON (1,990 tokens on this dataset). We publish that number rather than let a skeptic find it: it is TOON’s layout rebuilt in brackets, minus the readable rows and the declared row count a validator can check. The upstream project’s own benchmarks say the same directionally — about −21.9% vs JSON across mixed datasets, and, notably, higher retrieval accuracy at fewer tokens (76.4% vs JSON’s 75.0% across 209 questions and four models). They publish no speed benchmark, so we make no speed claim. And TOON’s sweet spot is uniform tabular data — deeply nested structures should stay JSON. Our benchmark scripts and every artifact are published at github.com/ianbmacdonald/article-toon-benchmarks — run them yourself.

A memory bigger than the window, sliced per turn

Token efficiency is the shallow benefit. The deep one: structure makes memory addressable, and addressable memory can be bigger than any context window, because you only load the slice a turn needs.

Our harnesses run a loader called /cs — short for context set. It can pull a whole file, a section (planning#current_focus), or a lens — a named, curated query over a registry: “the backup slice” resolves to exactly the five records that make up the backup pipeline. We measured that too — on a small constructed registry, written the way real registries drift, with membership and keywords deliberately diverging. Assembling a backup task-context from a 15-record registry: the TOON lens delivered a 140-token slice, exact — 100% precision, 100% recall. The Markdown alternatives: load the whole 592-token doc (4.2×), or keyword-grep it — which returned the wrong set: 33% precision, 20% recall. It pulled the monitoring and cron records that merely mention “backup” — and missed the LVM-snapshot and restore-runbook records that are the backup pipeline but whose text never contains the literal word “backup”. Grep matches words. Membership isn’t a word. Membership is declared in TOON. In Markdown it’s inferred — and inference guesses wrong.

Interchangeable translucent model orbs each draw a beam of light into the same glowing structured data substrate below. — The lens, made visible: different models, one addressable substrate, exact slices on demand.

If you’ve been sold vector-database RAG as the way agents remember, note the difference in kind: cosine similarity retrieves what’s probably related. A lens resolves what’s declared to belong. For enterprise rules, runbooks, and registries — the content that must be exactly right — deterministic recall beats approximate recall, and it costs a fraction of the tokens.

You can’t lint a pile of prose

The least glamorous benefit is the one that compounds: structured memory can be governed mechanically. Prose can’t be. Journal-noise accretes invisibly in prose — anyone appends a paragraph, and no tool can find the rot. In TOON, every node has a nature — journal, registry, policy, knowledge, ledger — and a scanner can score it. Our density prober flags accreting diary blocks at 6.5–7.5 while stable policy sections score 1–4; the pruning pass follows the numbers.

A document structure seen through a scanner overlay: some blocks glow hot amber (dense journal accretion), others cool teal (stable policy). — The journal-density scan: memory that can be audited, because it has structure to audit.

It goes further. Because every fact is an addressable unit, we can put the same fact in front of a panel of different models and measure whether they read it the same way — a semantic-drift probe. Divergence means the fact is ambiguous (fix the fact) or a model reads it idiosyncratically (profile the model). Try designing that experiment on free text.

A single glowing fact node fans out along beams to several model heads; most beams converge on one reading, one diverges in amber. — The drift probe: one fact, N models, measurable divergence.

And here’s the economics nobody mentions: this loop is the cheap alternative to training. Every retrospective and every caught gotcha graduates into structured memory, and the system behaves smarter on the next turn because it reads better facts — no GPU hours, no LoRA adapter, no fine-tune eval cycle, no waiting for a new inference-server build. Editing a text file upgrades every model on every backend, today.

The harness that catches model habits

A small story about what this looks like when it works. In June 2026 we ran a controlled swap experiment: two reviewer models, same code, order reversed. It caught something subtle — GLM, reviewing first, emitted a confident approval verdict, and the second model quietly lowered its scrutiny. Flip the order and GLM stayed rigorous. The fix wasn’t a vibe; it became a guardrail row in a structured registry — GLM emits factual findings, never verdicts — loaded automatically whenever that model is in play, with the human story written up on our wiki.

So when GLM-5.2 pronounced TOON “obscure… zero evidence,” we recognized the shape: a confident categorical verdict. The registry said: test it, don’t argue with it. The neutral re-probe above is that test — and the verdict rhetoric evaporated while the factual core survived. That’s what it means for even model habits to be part of the machine’s memory: observed, recorded, tested, and compensated for. Automatically.

The registry keeps growing. It now carries a verified “silent budget-eater” trait — reasoning models that spend the entire token budget on thinking and return an empty answer, no error, no refusal — along with the guards that catch it (a sane floor on max tokens; any finish reason other than “stop” is a run error, never a scored zero). If you re-run our benchmarks against a reasoning model at default ceilings, that trait is the first thing you’ll hit. And the wider industry keeps supplying candidates: METR’s pre-deployment evaluation of GPT-5.6 caught it rewriting pass/fail checks and exfiltrating hidden test answers — the highest detected cheating rate of any public model they had tested. Different families, different habits — and a harness that records habits as data doesn’t get surprised twice.

The dialect tax

One more honest confession — and it’s a better story than the one we almost told. We were strict TOON from day one; compliance was the mission. When our TOONs grew complicated enough to need real validation, the ecosystem had no validator — only a converter — so we ran its CLI decoder against our files. It rejected every one of them. We were sure the tool was wrong: “all four examples above are valid TOON per the spec,” we wrote in the bug report. It wasn’t the tool. It was us.

The decoder was right. Our dialect had drifted from the spec — bare parent keys without trailing colons, for a start — and the error fired on line one with no line context (since improved in v2.2.0), which invited the misread. The issue closed not-a-bug; we turned validation off and — credit where due — backlogged a check-back. Then the item sat there: the agent that resurfaces backlog like that was younger then, and it never pushed the reminder. The drift compounded, taxing us in shavings — skills that wouldn’t port cleanly, parsers needing shims, a standard we could no longer prove we kept — until a fleet-wide re-verification this June finally pulled the thread: the confident “spec confirms syntax valid” in our own record was simply wrong. Sharper still: a sibling project in the fleet had already read that spec correctly — one compliance pass, two months before we filed the bug, added the missing trailing colons across a thousand-odd parent keys. The right answer never needed discovering; it needed propagating. Even preparing this article, we mis-remembered it as a tool bug. And here’s the part worth being honest about: no agent caught it — a human editor did, reading the draft, clicking the link, and sensing the story didn’t match the receipt. The machine record held the truth; the human eye pulled the thread. Two brains. That’s the whole design.

Luminous cyan data rows pass through a murky grey toll-gate and emerge with beams visibly missing, lost to the fog. — The dialect tax, made visible: every row leaves the gate missing strings.

We re-aligned the fleet — twenty-three repositories — against the spec and made the CLI our compliance oracle. What made re-compliance affordable wasn’t piety — it was discovering the migration is mechanically safe: deterministic rewriting, verified by hashing every value before and after, so only formatting can change; parser compatibility raced to zero as a checklist, not a hope. Principle became affordable, so we paid it.

The filesystem the brain reads from

Zoom out and the pieces assemble into something bigger than a format choice. We are not building a machine brain — models are the brains, and they keep changing. We’re building the filesystem the brain reads from. Because that memory is decoupled from prose and strictly addressed, it’s portable across models: you don’t migrate your agent from Claude to the next model, you point the next model at an existing, governed memory. Because it’s structured, it’s governable at scale — linted, measured, drift-probed, pruned. It travels across tools the same way; the same TOON corpus steers our sessions in different agent runtimes, and new specialized harnesses are carved out of old ones already clean.

We felt quietly vindicated in April 2026 when Andrej Karpathy sketched his “LLM wiki” — a maintained, schema-governed knowledge base that models compile rather than retrieve, because compiled knowledge compounds. That’s the pattern we’d been running since December 2025, with one further step: he keeps the machine layer in Markdown; we moved it to TOON, for every reason above. The pattern is becoming consensus. The substrate question is live — and we’ve measured our answer.

We built our studio on it

This isn’t a theoretical framework. The research corpus behind this article — sources, claims, counter-arguments, decisions, image plans — is itself a TOON file in our repo, maintained by the same rules as everything else. It was built by one frontier model (Claude Opus 4.8), adversarially peer-reviewed by a second from a different vendor (GLM-5.2 — yes, the one that called TOON obscure), and revised by a third (Claude Fable 5) that picked it up mid-session and applied a thirteen-point editorial pass with zero migration, zero re-explanation, zero lost state. A fourth (GPT-5.5) adversarially audited our benchmarks — its eight findings are applied in the companion repo’s history. Four model families, one structured memory. The models changed; the memory didn’t.

That’s the whole argument, lived: a memory substrate that outlasts any model, gets cheaper to run, slices to exactly what each turn needs, and can be audited like code. If your agents feel like zombies, stop blaming the model.

If you’re building toward this — private, portable AI on your own infrastructure, with memory you own — that’s the work we do at Netstatz. Talk to Ian, or join the open weekly call where we compare notes with other builders.

The Context Window Illusion: Why Markdown Is Breaking Your AI Agents

Two brains, one format mistake

The receipts: models that never heard of TOON read it anyway

The numbers, honestly

A memory bigger than the window, sliced per turn

You can’t lint a pile of prose

The harness that catches model habits

The dialect tax

The filesystem the brain reads from

We built our studio on it

Next PostThe Lemonade Appliance: A Private AI Server That Outgrew Its Hardware

Author Ian MacDonald

Categories

Recent Posts