A single-agent persistent memory system for LLMs, modeled on CPU cache architecture. Memories flow from a fast volatile ring buffer through a vector store into a knowledge-graph-backed registry, with retrieval always pulling from the most recently hot tier first.
- Architecture overview
- Memory tiers
- Write path
- Retrieval pipeline
- Knowledge graph
- Scoring and ranking
- Background jobs
- Design decisions and trade-offs
- Configuration reference
- Running
- API reference
- Directory layout
┌─────────────────────────────────────────────────────────────────────┐
│ USER MESSAGE │
│ │ │
│ ▼ │
│ RELEVANCE GATE ──reject──▶ dropped (logged in skipped[]) │
│ (heuristic + dup check) │
│ │ pass │
│ ▼ │
│ L0 Active deque(20) ◀── background: judge_importance() updates │
│ │ item.importance on live object │
│ │ eviction (FIFO, head drops when deque is full) │
│ │ importance ≥ 0.5 → write to L1 │
│ │ importance < 0.5 → discard │
│ ▼ │
│ L1 Recent ChromaDB ◀── decay sweep every 5 min │
│ │ │
│ │ cold items (recency < 0.2, age > 5 min) │
│ │ pair user+agent turns │
│ │ embed combined text │
│ │ find_or_create KG node (cosine cluster, threshold=0.70) │
│ │ write to Registry │
│ ▼ │
│ L2 Registry (SQLite) ←─→ L2 KG (SQLite) │
│ raw paired turns cluster centroids │
│ embeddings + metadata co_occurs edges │
│ │
│ HOT BUFFER (in-memory LRU, 20 slots) │
│ ← populated from L1 and L2 hits during retrieval │
│ → checked first on every query │
└─────────────────────────────────────────────────────────────────────┘
- Implementation:
memory/l0_active.py—collections.deque(maxlen=20) - Lifetime: current process only; lost on restart
- Contents: raw MemoryItem objects (user messages, agent replies, promoted recalls)
- Capacity: 20 items (configurable
L0_MAXLEN) - Eviction: FIFO — oldest item drops when the deque is full, triggering
_l0_evicted()
Every item enters L0 with importance=0.5. A background Ollama job (judge_importance()) updates the live object's importance score. When the item is later evicted, the final importance value decides promotion vs discard.
Promoted entries from L1 or L2 (source="promoted_l1" / "promoted_l2") are written directly into L0's deque but bypass the evict-to-L1 path — they're cache recalls, not new information, so re-promoting them would create duplicates.
- Implementation:
memory/l1_recent.py - Lifetime: survives restarts; persists in
data/{agent_id}/l1.chroma/ - Contents: items that survived L0 eviction with importance ≥ 0.5
- Capacity: soft cap at 40 items (
L1_MAXLEN); primary drain is the decay sweep - Search: BGE-large-en-v1.5 embeddings (1024-dim), cosine similarity via ChromaDB
L1 is the "recently relevant" tier — content that's been scored as important enough to keep but hasn't aged into long-term storage yet. The decay sweep migrates cold items to L2 every 5 minutes.
L1 supports two search modes:
- Query-space search (
search()): uses the BGE query prefix ("Represent this sentence for searching...") for retrieval - Document-space search (
search_doc_space()): raw document encoder, used by the duplicate gate in the relevance check
- Implementation:
memory/hot_buffer.py - Lifetime: current process only
- Contents: items recently retrieved from L1 or L2 this session, stored with their embeddings
- Capacity: 20 slots (LRU eviction)
- Hit threshold: cosine ≥ 0.6 to the current query embedding
The hot buffer is the fastest tier — a flat cosine scan over ≤ 20 embeddings. It prevents re-traversing the KG and re-querying the registry for content that was already fetched this session.
Entries are populated by the retrieval pipeline:
- L1 hits → stored with their full embedding so future queries fire as HOT
- L2 hits → row fetched from registry, stored with registry embedding
The hot_nodes set on the buffer tracks which KG node IDs have been fetched this session, allowing the L2 search to skip nodes already in the buffer.
L2 has two components that work together:
Registry (memory/registry.py)
- Table:
registry— one row per conversation turn pair (user message + agent reply) - Each row stores: user_content, agent_content, combined embedding (float32 BLOB), importance, timestamps, access_count, kg_node_id
- Embedding is
embed("User: {msg}\nAgent: {reply}")— the combined turn embedding - Retrieved via KG node membership lookup, ranked by
0.5·sub_sim + 0.3·recency + 0.2·importance
Knowledge Graph (memory/l4_kg.py)
- Table:
kg_nodes— cluster centroids, one node per topic cluster - Table:
kg_edges— relations between nodes (co_occurs,derived_from,abstracted_from) - Table:
kg_node_members— registry row → KG node membership - Node centroids are running means of the embeddings of all their member turns
- Nodes split when
member_count > 20(re-clustered at threshold + 0.05)
The KG is shared across the agent's lifetime and acts as the index for the registry. Registry rows without a KG node cannot be retrieved by semantic search.
user_msg
│
▼
relevance gate (should_store)
├── too short / all stopwords → drop
├── infra noise (connection errors etc.) → drop
└── doc-space dup in L1 (cosine ≥ 0.92) → drop, bump existing
│ pass
▼
MemoryItem(importance=0.5, source="user")
│
├── appended to L0 deque
└── background job submitted: judge_importance(content)
→ updates item.importance on live L0 object
→ if already evicted to L1, updates there instead
(L0 fills up → head evicted)
│
├── source is "promoted_*" → skip (don't re-promote)
├── importance < L0_PROMOTE_THRESHOLD (0.5) → discard
└── importance ≥ 0.5 → L1.write(item)
└── background: condense(content) → set item.summary
(decay sweep fires every 5 min)
│
└── for each cold L1 item (recency_score < 0.2 AND age > 5 min):
pair user+agent items by timestamp proximity
embed("User: {user}\nAgent: {agent}")
find_or_create_node(emb, threshold=0.70)
→ if existing node centroid cosine ≥ 0.70: update centroid (running mean)
→ else: create new node with label=user_text[:80]
registry.write(row_id, user_content, agent_content, ...)
kg.add_member(node_id, row_id)
if node.member_count > 20: kg.split_node(...)
l1.delete(item_ids)
Key property: Nothing on the critical path (chat response) blocks on LLM except the main reply generation. Importance scoring and summary condensing are always off-thread.
retrieval/pipeline.py:explain_retrieve() — called on every chat turn.
query_emb = embed_query(user_msg) # BGE query prefix applied
1. HOT BUFFER (in-memory cosine scan, ≤ 20 entries)
│ threshold: cosine ≥ 0.6
│ returns entries with sub_sim set to computed cosine
│
2. L0 tail (last 5 items from conversation deque)
│ always shown verbatim — not scored, not filtered
│
3. L1 search (ChromaDB ANN, k=10)
│ filters out items with superseded_by set
│ returns (item, sim, embedding) triples
│
4. L2: KG node search → registry dereference
│ only when L1 produces fewer than k_total=6 items above 0.6 sim
│ kg.search_nodes(query_emb, k=6, exclude=hot_nodes)
│ → top-6 nodes by centroid cosine, skipping hot_nodes
│ graph expansion: expand_with_neighbors(found_ids, query_emb)
│ → also fetch neighbor nodes (via any edge) with sim ≥ 0.6
│ registry.get_members(node_id, query_emb, top_n=2) per node
│ → ranked by 0.5·sub_sim + 0.3·recency + 0.2·importance
│
▼
all_candidates = hot + l1 + l2
sort by fused_score = 0.6·sim + 0.3·recency + 0.1·importance
FILTER: sim < RETRIEVAL_MIN_SIM (0.6) → mark "below_sim_threshold", never inject
DEDUPE + SELECT TOP-6:
hash content, skip duplicates
first 6 non-dropped items → selected=True
SIDE EFFECTS (only for selected items):
L1 hits → hot_buffer.put(entry with embedding)
→ promote summary to L0 as source="promoted_l1"
L2 hits → hot_buffer.put(registry row)
→ registry.bump_access_count()
→ promote "User: … → Agent: …" summary to L0 as source="promoted_l2"
→ hot_nodes.add(kg_node_id)
CO_OCCURS EDGES:
all node_ids where sim ≥ 0.6 (selected or not) form co_occurs pairs
kg.bump_cooccurs(a, b) for every pair above threshold
— captures semantic co-relevance even for nodes crowded out of top-k
Similarity cutoff of 0.6: anything below this threshold is excluded from the prompt entirely. This prevents memories from unrelated domains (e.g., database notes bleeding into a C/C++ discussion) from polluting the context.
Graph expansion: when a KG node is found by cosine search, its graph neighbors (via any edge relation) are also scored. Neighbors that score ≥ 0.6 to the query get their registry rows fetched too. This ensures that once two nodes build up a co_occurs edge (proven co-relevance), a query hitting one will pull in the other.
Each node is a topic cluster centroid. When a new paired turn is consolidated from L1:
- Its embedding is compared to all existing node centroids
- If any centroid has cosine ≥ 0.70 → the turn joins that node (centroid updated via running mean)
- If none match → a new node is created, labeled with the first 80 chars of the user message
The 0.70 threshold means nodes are topically tight — "we use Redis for session caching" and "when do we use MongoDB" produce separate nodes despite both being database-related.
co_occurs — The primary and most semantically meaningful edge type. Created when two nodes both have cosine ≥ 0.6 to the same retrieval query in the same batch. This is a proven signal: both nodes were genuinely relevant to the same question.
- Weight incremented with each co-retrieval event
- Edge is undirected (canonical ordering by node ID)
- Used by graph expansion to discover related nodes
derived_from — Created by add_node() (the old path used by abstract_l2). Points from a newer node to an older similar node. Not used in normal consolidation flow.
abstracted_from — Created by POST /api/abstract/{agent}. Synthetic concept nodes (LLM-summarized labels) point to their constituent cluster nodes. Manual trigger only.
BGE-large-en-v1.5 produces high cosine similarity (0.60–0.70) between any two short tech questions regardless of topic. "When do we use MongoDB?" and "Is Python the fastest language?" score ~0.67. A fixed cosine threshold would create nonsensical edges between unrelated topics. Only co_occurs edges — which require actual retrieval co-occurrence — carry reliable semantic meaning.
When a node's member_count exceeds KG_NODE_MEMBER_CAP (20), split_node() re-clusters its members at KG_CLUSTER_THRESHOLD + 0.05. If two or more sub-clusters form, new nodes replace the old one and registry rows are reassigned.
Fused score (used for final candidate ranking):
score = 0.6 × sim + 0.3 × recency + 0.1 × importance
Recency score (exponential decay):
recency = exp(-age_days / half_life)
half_life: L1=7d, L2=30d
Importance score (0.0–1.0):
- Default: 0.5 on entry to L0
- Updated by
judge_importance()(Ollama) in background - High (0.7–1.0): facts, decisions, rules, technical choices, deadlines
- Medium (0.4–0.6): useful context, questions with domain content
- Low (0.0–0.3): small talk, filler — discarded at L0 eviction
Registry member ranking (used within a single KG node):
score = 0.5 × sub_sim + 0.3 × recency + 0.2 × importance
The concurrency.py module manages a ThreadPoolExecutor(max_workers=4) with named job tracking and per-agent RLocks.
| Job | Trigger | Does |
|---|---|---|
importance-judge:{id} |
Every remember() call |
Calls judge_importance(content) via Ollama; updates item.importance on live L0 object or L1 if already evicted |
l1-finalize:{agent}:{id} |
Every L0→L1 promotion | Calls condense(content) via Ollama; sets item.summary on the L1 entry |
enforce-caps:{agent} |
Every 20th turn | Runs evictor.enforce_l1() — evicts oldest low-importance L1 items if count cap is hit |
| Decay sweep | threading.Timer, every 300s |
Calls move_decayed_l1_to_registry() for cold L1 items |
Lock discipline: locks are held only around store writes, never during LLM calls. This prevents a slow Ollama response from blocking the chat turn.
Race condition (importance judge vs L0 eviction): if the deque fills before the importance judge returns, the item evicts with importance=0.5 (neutral) and still promotes to L1. The judge then finds it in L1 via l1.get(item.id) and updates it there. Acceptable outcome.
The original design had two agents (planner + implementation) sharing an L3 warm tier. The single-agent redesign drops this complexity: one implementation agent, no cross-agent sharing, no L3.
Why: The two-agent model added coordination overhead without clear benefit for the use cases being targeted. L3 is replaced by the shared L2 KG which already provides cross-session persistence.
The relevance gate does NOT call judge_importance() synchronously before L0 write. Only a fast heuristic check (length, stopwords, infra noise regex) runs inline.
Why: On CPU, a single Ollama call takes 10–30 seconds. Putting two LLM calls per chat turn (relevance check + reply generation) made the system unusable. The heuristic blocks obvious garbage; low-importance items are discarded at L0 eviction time instead.
Nothing below 0.6 cosine similarity is injected into the LLM prompt, even if it would otherwise rank in the top-k.
Why: BGE-large embeddings in the 0.5–0.6 range often represent "same general domain" rather than "same topic". Injecting these caused the LLM to hallucinate connections — e.g., talking about MongoDB when asked about C++ because session-caching notes scored 0.52 to the C++ query. The 0.6 floor keeps the context tight.
Trade-off: Some genuinely relevant memories with similarity just below 0.6 get filtered. This is preferable to polluting the context with loosely related content.
co_occurs edges form only between nodes that BOTH score ≥ 0.6 to the same query. This is stricter than earlier designs that only used selected (top-k) nodes.
Why: A node that scores 0.52 to a database query is only loosely relevant — forming a co_occurs edge based on it would create semantically weak connections. Nodes that both clear the 0.6 bar are genuinely about the same topic.
Two turns must have combined embedding cosine ≥ 0.70 to land in the same KG node.
Why: At 0.70, nodes represent tight topic clusters. Lowering it would merge "we use Redis for session caching" with "can I use MongoDB for session caching" into one node — they're related but distinct questions that deserve separate registry entries. The KG's job is indexing, not deduplication.
Dropped. When two contradicting memories exist (e.g., "we use PostgreSQL" and "we switched to MySQL"), the LLM reconciles them in-context based on recency/importance scores.
Why: Detecting contradiction requires an LLM call per new write against all similar existing memories. The latency cost outweighs the benefit, especially since importance scoring already causes low-relevance or stale items to decay out.
Items write to L1 on eviction from L0 (write-back), not on entry to L0 (write-through).
Why: Write-through would persist everything including small talk that the importance judge later scores as low. Write-back ensures the background judge has had a chance to update the importance score before the promotion decision is made.
All knobs are in config.py:
| Name | Default | Description |
|---|---|---|
L0_MAXLEN |
20 | L0 deque capacity |
L0_PROMOTE_THRESHOLD |
0.5 | Min importance to promote from L0 to L1 |
L1_MAXLEN |
40 | L1 hard cap (fallback evictor; decay sweep is primary) |
L1_DECAY_SWEEP_SECS |
300 | How often the decay sweep runs (seconds) |
L1_DECAY_THRESHOLD |
0.2 | recency_score below this = cold |
L1_MIN_AGE_MINS |
5 | Item must also be this old before migrating to L2 |
K_L1 |
10 | L1 candidates fetched per query |
KG_CLUSTER_THRESHOLD |
0.70 | Cosine threshold for node membership |
KG_NODE_MEMBER_CAP |
20 | Max registry rows per node before split |
K_L2_NODES |
6 | KG nodes searched at retrieval time |
REGISTRY_RESULTS_PER_NODE |
2 | Registry rows fetched per KG node |
HOT_BUFFER_SIZE |
20 | Hot buffer LRU capacity |
HOT_BUFFER_HIT_THRESHOLD |
0.6 | Min cosine to return from hot buffer |
RETRIEVAL_MIN_SIM |
0.6 | Min similarity to inject memory into prompt |
W_SEMANTIC |
0.6 | Score fusion: semantic weight |
W_RECENCY |
0.3 | Score fusion: recency weight |
W_IMPORTANCE |
0.1 | Score fusion: importance weight |
OLLAMA_MODEL |
qwen2.5:3b |
LLM for replies + importance scoring |
EMBED_MODEL |
BAAI/bge-large-en-v1.5 |
Sentence-transformers embedding model |
Prerequisites:
- Python 3.12
- Ollama running on
localhost:11434 - The model pulled:
ollama pull qwen2.5:3b(orqwen2.5:0.5bfor faster CPU testing)
# One-time setup
cd simple_memory
python3.12 -m venv .venv
.venv/bin/pip install -r requirements.txt
# Start the web UI (from the ai-twinnn/ parent directory)
OLLAMA_MODEL=qwen2.5:0.5b simple_memory/.venv/bin/uvicorn simple_memory.server:app --host 0.0.0.0 --port 8000
# Open http://localhost:8000First run downloads the embedding model (~1.3 GB for BAAI/bge-large-en-v1.5).
Environment variables:
| Variable | Default | Notes |
|---|---|---|
OLLAMA_URL |
http://localhost:11434/api/generate |
|
OLLAMA_MODEL |
qwen2.5:3b |
Use qwen2.5:0.5b for faster CPU testing |
EMBED_MODEL |
BAAI/bge-large-en-v1.5 |
|
EMBED_QUERY_PREFIX |
Represent this sentence for searching relevant passages: |
Set to "" for non-BGE models |
| Method | Path | Description |
|---|---|---|
| GET | / |
Web UI |
| GET | /api/meta |
Embed model, LLM model, agent list |
| GET | /api/agents |
["implementation"] |
| GET | /api/state |
Full tier state for all agents + KG |
| GET | /api/state/{agent} |
L0/L1/hot_buffer/registry counts + KG |
| GET | /api/l4 |
KG nodes + edges |
| GET | /api/jobs |
{"in_flight": N} background job count |
| GET | /api/registry/{agent} |
Recent registry rows (debug view) |
| POST | /api/chat/{agent} |
{"message": "…"} → reply + retrieval trace |
| POST | /api/explain/{agent} |
{"query": "…"} → retrieval trace without LLM |
| POST | /api/consolidate/{agent} |
Force L0+L1 → L2 migration now (bypasses age check) |
| POST | /api/abstract/{agent} |
Super-cluster KG nodes → concept nodes (manual) |
| POST | /api/kg/dedupe |
Merge near-duplicate KG nodes (cosine ≥ 0.93) |
| POST | /api/seed/{agent} |
{"facts": ["…"]} → bulk-remember, skips relevance gate |
| POST | /api/reset |
Drain background jobs, close stores, wipe data/ |
simple_memory/
config.py # all tunable knobs
models.py # MemoryItem dataclass
concurrency.py # ThreadPoolExecutor + per-agent RLocks
llm/
embed.py # embed() / embed_query() / embed_many()
ollama.py # chat() / condense() / judge_importance()
memory/
l0_active.py # deque ring buffer + evict callback + get_by_id + drain
l1_recent.py # ChromaDB wrapper; search / search_doc_space / upsert
registry.py # SQLite paired-turn store
hot_buffer.py # in-memory LRU; get_nearby returns sub_sim-annotated copies
l4_kg.py # SQLite KG: nodes, edges, members; find_or_create, expand_with_neighbors
lifecycle/
relevance.py # should_store(): heuristic gate + L1 dup check
consolidator.py # move_decayed_l1_to_registry() + abstract_l2()
decay_sweep.py # DecaySweep: background timer → consolidator
evictor.py # enforce_l1() + enforce_registry() (cap fallbacks)
retrieval/
scorer.py # recency_score(), fused_score()
pipeline.py # explain_retrieve(): 4-tier hierarchy + graph expansion
agents/
agent.py # Agent: write path, chat(), background jobs, shutdown()
registry.py # AgentSpec: "implementation" agent definition
static/
index.html # single-page UI (Tailwind + Chart.js + vis-network)
server.py # FastAPI app
cli.py # interactive terminal chat
README.md # this file
data/ # gitignored — created on first write
implementation/
l1.chroma/ # ChromaDB vector store
registry.sqlite # paired-turn registry
shared/
l4.sqlite # KG nodes, edges, members