When MemPalace showed up with 20k stars and claimed a 34% retrieval improvement through topic-based structural filtering, I realized I had nothing to compare against. Acolyte’s memory system stores observations as individual facts, embeds them at write time, and retrieves them by cosine similarity. Pure cosine over text-embedding-3-small. No keyword matching, no re-ranking. I had no measurement of how much that simplicity costs in retrieval quality.
So I built a benchmark.
Ground truth
Two public datasets provide conversations with labeled retrieval questions. LongMemEval embeds 500 questions in multi-session conversation histories with distractor sessions. LoCoMo provides 10 long conversations with QA pairs pointing to specific dialogue turns as evidence.
Both test the same question: given a pool of stored memories and a natural language query, does the system surface the right ones?
The pipeline
The harness exercises Acolyte’s real production code path. No mocks, no shortcuts.
For each scenario, it creates an isolated SQLite store, writes every conversation turn as a memory record, embeds each one through the actual embedding API, then runs every query through searchMemories. The retrieved results are compared against ground truth using standard information retrieval metrics.
Recall@K measures how many of the relevant items appear in the top K results. NDCG@K measures whether the relevant items are ranked near the top. Both matter. High recall with poor ranking means the system finds the right memories but buries them. Low recall means it misses them entirely.
Two modes
The benchmark runs in two modes. The first stores raw conversation turns as memories. This is the pessimistic test: noisy, unprocessed dialogue. The second stores pre-extracted observations, factual statements distilled from those same conversations. This mirrors what Acolyte’s distiller actually produces.
The difference matters because it validates the core architecture. If distilled facts retrieve better than raw turns, the distillation step is earning its keep.
Baseline numbers
LoCoMo across all 10 conversations with text-embedding-3-small. No hybrid scoring, no re-ranking. The important comparison is not the exact numbers but the gap between raw turns and distilled observations.
Raw conversation turns (1977 queries, 5882 memories):
R@3: 0.506 NDCG@3: 0.442
R@5: 0.599 NDCG@5: 0.480
R@10: 0.694 NDCG@10: 0.514
Distilled observations (1650 queries, 2541 memories):
R@3: 0.590 NDCG@3: 0.555
R@5: 0.650 NDCG@5: 0.580
R@10: 0.730 NDCG@10: 0.609
What the numbers reveal
Distillation works. Observations beat raw turns across the board, with the biggest gain in ranking quality: NDCG jumps roughly 10 points. Extracted facts embed more precisely and match queries more accurately, with less than half the data to search through.
The raw numbers also show where pure cosine similarity falls short. At K=5, roughly one third of relevant memories are missed. The gap between R@10 and R@3 tells us the right memories are often present but buried below the top positions. Ranking is the bigger problem than coverage.
Queries that reference specific names, dates, or events are the hardest. Embeddings capture semantic similarity well but struggle with exact entity matching. A query like “When did Caroline go to the LGBTQ support group?” depends on the embedding space placing that specific entity close to the relevant observation. Sometimes it does. Sometimes it does not.
What the benchmark found
I ran seven experiments. Two shipped, five were dropped.
Hybrid scoring blends cosine similarity with keyword token overlap. The token component catches exact matches that embeddings miss — proper nouns, tool names, file paths. NDCG@5 improved by 2.2%.
TF-IDF token weighting makes the keyword component smarter by scoring rare tokens higher than common ones. Another 2.0% NDCG@5 gain on top of hybrid scoring.
The rest did not survive. Re-ranking over-promoted keyword matches. Temporal boosting had too many false positives. Contradiction detection could not distinguish stale facts from related-but-valid ones. Topic-based structural filtering — the most anticipated experiment, inspired by MemPalace’s reported 34% gain — hurt results because the derived topics were too coarse. And a larger embedding model gained roughly 5% but at 13x the cost.
The experiments that failed taught me more than the ones that shipped.
Best numbers
Full LoCoMo (1650 queries, 2541 observations) with all shipped improvements:
R@5: 0.705 NDCG@5: 0.651
R@10: 0.764 NDCG@10: 0.673
From pure cosine on raw turns (R@5: 0.599) to hybrid scoring with TF-IDF on distilled observations — a 10.6% absolute improvement.
The biggest single gain was not an algorithm. It was input quality.
Distilled facts outperform raw conversation turns by a wider margin than any retrieval technique.
What this means
I spent most of the time optimizing retrieval. The two algorithmic improvements that shipped account for about 4% combined. Distillation — the step that runs before retrieval even starts — accounts for 7%.
The same pattern held for topic filtering. MemPalace’s structural filtering works because their topic assignments are high quality. My keyword-derived topics were noise. The mechanism is right. The data quality is the bottleneck.
Every future improvement to Acolyte’s memory follows from this: invest in what the distiller produces, not in how the retrieval searches it. Better observations make every retrieval technique work better. Better retrieval cannot fix bad observations.