Episode Details

Experimental Results from a Self-Improving Retrieval System for Conversational Memory

Published 1 month, 1 week ago

Description

This story was originally published on HackerNoon at: https://hackernoon.com/experimental-results-from-a-self-improving-retrieval-system-for-conversational-memory.
Eighteen retrieval experiments on agent memory: why BM25 dominates, what clustered retrieval-induced forgetting actually does, and the Rust port that shipped.
Check more stories related to tech-stories at: https://hackernoon.com/c/tech-stories. You can also check exclusive content about #agent-memory, #rag, #bm25, #retrieval-systems, #cross-encoder-reranking, #longmemeval, #faiss, #hackernoon-top-story, and more.

This story was written by: @teimurjan. Learn more about this writer by checking @teimurjan's about page, and for more stories, please visit hackernoon.com.

The biology-inspired mutation layer didn't work. A learned MLP adapter and segmentation mutation both produced ~zero NDCG lift on LongMemEval. The control loop was sound; the perturbations weren't load-bearing. A recall diagnostic reframed the project: 78% of relevant entries never reached the cross-encoder. Bi-encoder recall was the ceiling, not the mutation layer. Standard IR wins compounded: 0.95-cosine dedup plus BM25 alongside vector plus cross-encoder rerank took NDCG@10 from 0.22 to 0.34. BM25 alone beat pretrained embeddings by 76% on this corpus. Clustered retrieval-induced forgetting (Anderson 1994, ported as far as I can tell for the first time) added +1.9pp NDCG with p=0.0001 on LongMemEval. Regresses on NFCorpus: the mechanism is scoped to single-user long-term conversation memory, not general IR. Write-time LLM enrichment (gist plus anticipated queries via Haiku) was the biggest single lever: +8.3pp NDCG on covered queries. A regex-tokenizer fix that BM25 had been missing was worth +1.4pp NDCG on the headline benchmark. Six independent ablations (reranker swap, BGE bi-encoder, multi-field BM25, field-boosted BM25, late chunking on a GPU, k_deep sweep) all bounced off the same ceiling: BM25 supplies the candidates the reranker is already ranking well. Model-layer swaps are theatre when one component dominates. Ported the whole stack to Rust: single binary, ratatui TUI, PyO3 plus napi-rs bindings, Claude Code plus Codex CLI plugins. Cross-project search dropped from 6–7s to 1.7s. Lesson: check the bottleneck before extending the mechanism.

Episode Details

Experimental Results from a Self-Improving Retrieval System for Conversational Memory

Description

Listen Now

Love PodBriefly?