milebits
8 min readmemory · context · agents · architecture

The context window is not your memory

Million-token context windows did not remove the need for memory architecture. They hid the bill for a while. A context window is what the model can see right now. Memory is what it can get back later.

Written by the milebits founders.

The million-token context window is no longer exotic. Claude and Gemini both run million-token windows at standard pricing, and the frontier keeps climbing. That abundance produced a specific kind of architectural shortcut. If the model can read a million tokens, the reasoning goes, you do not need a memory system. You just put everything in the prompt. The whole conversation history, the whole knowledge base, the whole user record, all of it, every turn.

This works in the demo. It works for the first month. Then the bill arrives, the latency creeps up, the model starts missing things that are clearly in the context, and the team discovers that the context window was never a substitute for memory. It was a way to defer building memory until the deferral got expensive.

A context window is what the model can see right now. Memory is what it can get back later. Confusing the two is an architecture decision you make by accident, and it is worth making on purpose instead.

What stuffing the context actually costs

The everything-in-the-prompt pattern has four costs, and they arrive in roughly this order.

Money. You pay for input tokens on every call. If every turn of a conversation re-sends the entire history plus the entire retrieved knowledge base, your token spend per turn grows as the conversation grows, and the cumulative spend over a long session climbs fast. We covered the shape of this in the agent cost note: long contexts are one of the quiet multipliers on the bill.

Latency. Bigger inputs take longer to process. A prompt that has swollen to several hundred thousand tokens adds real time to first token, on every turn, and in anything interactive the user feels it. The context you are carrying for safety is the latency you are paying for forever.

Degradation. Models do not attend uniformly across a very long context. The "lost in the middle" finding, that information in the middle of a large input is recalled less reliably than information at the start or end, has held up across every generation of larger windows. The newer benchmarks are blunter about it. NoLiMa strips the keyword overlap that lets models cheat at retrieval, and under it most long-context models fall below half their short-context accuracy by 32K tokens. GPT-4o drops from 99.3% accuracy at short context to 69.7% at 32K. Chroma's Context Rot study put eighteen current models through the same kind of pressure and found the same shape: bigger context does not mean uniform recall, and accuracy decays as the input grows even on easy tasks. This is counterintuitive in production. The fact is right there in the context, the model still gets it wrong, and the team wastes a day assuming retrieval was broken when the real problem is that the relevant fact was buried in the middle of two hundred thousand tokens of other material.

No persistence. The context window evaporates when the session ends. If your only memory is the context, then your system forgets everything the moment the user closes the tab. Anything you want to know about this user next week has to live somewhere that is not the context window. That somewhere is the memory system you were trying not to build.

Working memory and long-term memory are different things

The distinction teams skip is the one between working memory and long-term memory, and a lot of memory bugs trace back to collapsing the two.

Working memory is what the model needs for the task in front of it right now. The current conversation, the documents relevant to the current question, the immediate state. It belongs in the context window. It is small, current, and task-scoped.

Long-term memory is what the system should be able to recall later. Facts about the user, decisions made in past sessions, the history of the relationship. It does not belong in the context window by default. It belongs in storage, and the right pieces of it get pulled into working memory when they are relevant to the current task.

The mistake is treating the context window as both. When long-term memory lives only in the context, it is expensive, it does not persist, and it crowds out the working memory that the current task actually needs. When you separate them, the context stays lean and the long-term store does the remembering.

What memory architecture actually looks like

This is the same conclusion behind Anthropic's writing on context engineering, where they describe a model's context as a finite "attention budget" that every added token depletes, and argue for finding the smallest set of high-signal tokens rather than the largest set you can fit. A small ecosystem of memory tooling has grown up around exactly this problem, Letta, Zep, Mem0, and the memory features now built into the major model platforms, but the tools are implementations of an architecture you still have to design. Memory is not one thing. It is a few mechanisms, each doing a job.

Retrieval is recall. The long-term store holds the facts, and at the moment they are relevant, you retrieve the few that matter into the context. This is the same machinery as retrieval-augmented generation, pointed at the user's own history instead of a document corpus. The user mentioned last week they are on the enterprise plan; that fact lives in storage and gets retrieved when billing comes up, not carried in every prompt forever.

Summarisation is compression. A long conversation does not need to be carried verbatim. After a stretch of turns, the system writes a compact summary of what happened and carries the summary forward instead of the raw transcript. The model keeps the thread without paying to re-read the whole thing every turn. Done well, the user cannot tell. Done badly, the summary drops the one detail that mattered, which is why what to summarise and what to keep verbatim is a design decision, not a default.

Structured state is truth. Some things should not live in prose at all. The user's plan tier, their account status, the current step in a workflow, these belong in a database with a schema, queried like any other application state. The model reads them as facts, not as remembered text. Anything you need to be exactly right, every time, should be structured state, because prose memory is approximate and structured state is not.

The architecture is choosing, for each kind of information, which of these three it belongs in. That choice is the memory system. The context window is just the small, current working set that these three feed.

Designing what to forget

The part teams never plan for is forgetting. A memory system that only ever accumulates becomes slow, expensive, and full of stale facts that actively mislead. The user changed plans; the old plan is still in memory; the agent retrieves the wrong one.

Forgetting is a feature you design. Facts get timestamps and authority levels so the current one wins. Summaries replace the raw history they were made from. Stale state is expired or overwritten, not kept forever out of a vague sense that more memory is better. The same discipline that makes retrieval work, where a deprecated document must not surface as a current answer, applies to a user's own history. What the system remembers has to be governed, or the memory becomes a liability instead of an asset.

Memory is a governance surface, not just a performance one

Once a system remembers, memory becomes something an enterprise buyer has to be able to reason about, and the questions are not about latency. If anything that can write to long-term memory is also untrusted, memory can be poisoned: an attacker plants a false fact that the agent later retrieves and acts on as if the user had confirmed it. This is the agent attack surface pointed at the memory store, and it is easy to miss because the write looks like normal use. Memory has to be scoped per tenant and per user with the same isolation you put on any datastore, because a memory that leaks across tenants is a data breach that happens to be phrased helpfully. Users have to be able to see and delete what the system holds about them, which regulation increasingly demands and which a single prose blob makes nearly impossible. And stale facts have to expire, because a memory that confidently recalls last year's plan tier is worse than one that recalls nothing.

This is the practical reason the three mechanisms are worth separating. Each one is governed differently:

WhatWhere it livesHow it's governed
Working memoryThe context window, this turnEvicted when the task ends. Nothing to retain or leak.
Long-term factsRetrievable store (vectors or rows)Scoped per tenant and user, timestamped, deletable on request
SummariesCompacted records that replace raw historyRegenerated from source, versioned, supersede what they compress
Structured stateA database with a schemaAccess-controlled, audited, the single source of truth

A prose blob that holds all four is the version that cannot answer "what do you know about me, and delete it." The separated version can.

The large context window is a genuine gift. It makes working memory roomier, it makes retrieval more forgiving, it lets you carry more of the current task without splitting it across calls. It is a bad long-term memory and a worse database, and using it as either is a decision that feels free in the demo and bills you in production.

Build the memory system. Let the context window do the job it is good at, which is holding the current task. Let storage do the remembering. The systems that still feel coherent in a long conversation, and still know who the user is next week, are the ones that drew that line on purpose.

Designing agent memory? We can review the architecture before you build.

Book a 20-min call