The hardest part of building a useful AI assistant isn’t intelligence. It’s memory.
Every coding agent I have used starts each session blank. You tell it your project uses Bun. You tell it your commit style. You tell it the architecture. Next session, you tell it again. The assistant is stateless by default, and the user pays the cost every time.
The industry response to this problem has been context compaction: summarize older messages, truncate tool output, compress the conversation into fewer tokens so more of it fits in the window. That approach is fundamentally wrong.
Compaction solves the wrong problem
Context compaction treats the conversation as the thing worth preserving. It tries to keep as much of the transcript alive as possible by making it smaller. Truncation, summarization, sliding windows: these are all variations of the same idea. Shrink the conversation so more of it fits.
The problem is not size. The problem is judgment.
Deciding what matters requires understanding the content. Recency-based truncation drops old messages that may contain critical context. Token-based summarization compresses everything uniformly, treating a verbose tool output and a one-line architectural decision as equally important. Sliding windows lose whatever falls off the edge.
These are brute-force methods applied to a judgment problem. They will get it wrong, and there is no algorithmic tweak that fixes that.
A conversation is mostly noise: exploratory reads, failed attempts, debugging detours, redundant confirmations. The facts that actually matter are scattered unpredictably through that noise. Preserving the conversation is the wrong goal.
Acolyte takes a different approach. It extracts the facts and discards the conversation.
Distillation, not compression
Compaction asks: “how do I fit this conversation into fewer tokens?” Distillation asks: “what knowledge from this conversation is worth keeping?”
The distinction matters. Acolyte runs a distill pipeline after every task. Two background agents process the recent conversation:
The observer reads the last 20 messages and extracts factual statements. These are concrete observations: “the user prefers Bun over Node”, “the project uses ast-grep for code editing”, “the test suite runs with bun run verify”. Each observation is tagged with a scope: session, project, or user.
The reflector runs when observations accumulate past a token threshold. It consolidates multiple observations into a single, denser summary, removing redundancy while preserving critical facts. Old observations are garbage-collected once absorbed.
The conversation is disposable. The facts are durable.
Three tiers of scope
Not every fact belongs everywhere. Acolyte separates memory into three scopes:
- Session — what the assistant is currently working on. Task state, next steps, continuation context.
- Project — architectural decisions, testing conventions, file structure patterns. Keyed to the workspace.
- User — personal preferences, role, tools, workflow habits. Follows the user across projects.
The observer assigns scope during extraction.
Project facts like “this repo uses Conventional Commits” promote to project memory. User facts like “the user prefers terse output” promote to user memory. Session facts remain ephemeral.
Untagged facts are dropped. This is intentional. Strict promotion prevents scope creep and keeps each tier focused.
Less upfront configuration
This changes what needs to go in rules files.
Tools like AGENTS.md, CLAUDE.md, and .cursorrules exist because agents cannot learn. They front-load context the agent will need later.
The concept is sound. The problem is that these files grow without bound. Every preference, convention, and constraint must be written down upfront or the agent will get it wrong. I have maintained these files myself and the overhead is real.
With scoped memory, the rules file stays small. It holds what is genuinely static: project identity, non-obvious constraints, entry points. Everything else is learned through use. You correct the agent once. The correction persists.
Semantic recall
Storing the right facts is only half the problem. The other half is retrieving the right ones at the right time. Recency is a poor proxy for relevance once memory grows.
Acolyte uses semantic recall. Each record is embedded at write time. At query time, the user’s message is embedded once, and all stored facts are ranked by cosine similarity.
The implementation is intentionally simple. Embeddings are stored as BLOBs in SQLite, and cosine similarity is computed in JavaScript. No vector database, no extensions, no new dependencies. Records without embeddings fall back to recency, and continuation entries always rank first.
When the user says “add a config field for the embedding model,” the system surfaces relevant project knowledge, not a stale debugging note from last week.
The full pipeline
When memory is loaded for a new task, five stages run:
- Ingest — load entries from all sources
- Normalize — trim, deduplicate, estimate tokens
- Select — rank by semantic similarity and pack into budget
- Inject — format into the system prompt
- Commit — distill new facts after the task
Each stage has explicit contracts. The selection strategy is injectable, not hardcoded.
Tradeoffs
Distillation is not perfect. It depends on the model correctly identifying what is worth keeping. Important facts can be missed if extraction fails, and over-aggressive consolidation can lose nuance.
The system mitigates this with scoped memory to limit the blast radius of mistakes, reflection thresholds to avoid premature summarization, and graceful fallback to recency when embeddings are unavailable.
But this is still a judgment problem. I would rather have a system that is honest about that than one that pretends compaction solves it.
Why compaction keeps failing
Context compaction conflates two problems: context management (fitting information into the model window) and knowledge persistence (retaining facts across sessions). Compaction tries to solve both by shrinking the conversation. It slightly improves the first and does nothing for the second. The user still starts cold next session.
Distillation separates the problems. Knowledge is extracted into a persistent layer. Context management becomes simpler when the system retrieves only what is relevant to the current task.
Design principles
Distillation over compaction: extract the facts, discard the conversation. Strict scoping: not everything belongs everywhere. Relevance over recency: what matters now is not what happened last. And graceful degradation: if embeddings fail, recency still works.
The system disappears. The knowledge stays.