simple_memory

A single-agent persistent memory system for LLMs, modeled on CPU cache architecture. Memories flow from a fast volatile ring buffer through a vector store into a knowledge-graph-backed registry, with retrieval always pulling from the most recently hot tier first.

Architecture overview

┌─────────────────────────────────────────────────────────────────────┐
│  USER MESSAGE                                                        │
│       │                                                              │
│       ▼                                                              │
│  RELEVANCE GATE  ──reject──▶  dropped (logged in skipped[])         │
│  (heuristic + dup check)                                             │
│       │ pass                                                         │
│       ▼                                                              │
│  L0 Active  deque(20)  ◀── background: judge_importance() updates   │
│       │                    item.importance on live object            │
│       │ eviction (FIFO, head drops when deque is full)               │
│       │   importance ≥ 0.5  →  write to L1                          │
│       │   importance < 0.5  →  discard                              │
│       ▼                                                              │
│  L1 Recent  ChromaDB   ◀── decay sweep every 5 min                  │
│       │                                                              │
│       │ cold items (recency < 0.2, age > 5 min)                     │
│       │   pair user+agent turns                                      │
│       │   embed combined text                                        │
│       │   find_or_create KG node (cosine cluster, threshold=0.70)   │
│       │   write to Registry                                          │
│       ▼                                                              │
│  L2 Registry (SQLite)  ←─→  L2 KG (SQLite)                          │
│  raw paired turns              cluster centroids                     │
│  embeddings + metadata         co_occurs edges                       │
│                                                                      │
│  HOT BUFFER (in-memory LRU, 20 slots)                               │
│  ← populated from L1 and L2 hits during retrieval                   │
│  → checked first on every query                                      │
└─────────────────────────────────────────────────────────────────────┘

Memory tiers

L0 — Active context (RAM, volatile)

Implementation: memory/l0_active.py — collections.deque(maxlen=20)
Lifetime: current process only; lost on restart
Contents: raw MemoryItem objects (user messages, agent replies, promoted recalls)
Capacity: 20 items (configurable L0_MAXLEN)
Eviction: FIFO — oldest item drops when the deque is full, triggering _l0_evicted()

Every item enters L0 with importance=0.5. A background Ollama job (judge_importance()) updates the live object's importance score. When the item is later evicted, the final importance value decides promotion vs discard.

Promoted entries from L1 or L2 (source="promoted_l1" / "promoted_l2") are written directly into L0's deque but bypass the evict-to-L1 path — they're cache recalls, not new information, so re-promoting them would create duplicates.

L1 — Recent vector store (ChromaDB, persistent)

Implementation: memory/l1_recent.py
Lifetime: survives restarts; persists in data/{agent_id}/l1.chroma/
Contents: items that survived L0 eviction with importance ≥ 0.5
Capacity: soft cap at 40 items (L1_MAXLEN); primary drain is the decay sweep
Search: BGE-large-en-v1.5 embeddings (1024-dim), cosine similarity via ChromaDB

L1 is the "recently relevant" tier — content that's been scored as important enough to keep but hasn't aged into long-term storage yet. The decay sweep migrates cold items to L2 every 5 minutes.

L1 supports two search modes:

Query-space search (search()): uses the BGE query prefix ("Represent this sentence for searching...") for retrieval
Document-space search (search_doc_space()): raw document encoder, used by the duplicate gate in the relevance check

Hot Buffer — Session LRU cache (RAM, volatile)

Implementation: memory/hot_buffer.py
Lifetime: current process only
Contents: items recently retrieved from L1 or L2 this session, stored with their embeddings
Capacity: 20 slots (LRU eviction)
Hit threshold: cosine ≥ 0.6 to the current query embedding

The hot buffer is the fastest tier — a flat cosine scan over ≤ 20 embeddings. It prevents re-traversing the KG and re-querying the registry for content that was already fetched this session.

Entries are populated by the retrieval pipeline:

L1 hits → stored with their full embedding so future queries fire as HOT
L2 hits → row fetched from registry, stored with registry embedding

The hot_nodes set on the buffer tracks which KG node IDs have been fetched this session, allowing the L2 search to skip nodes already in the buffer.

L2 — Long-term storage (SQLite, persistent)

L2 has two components that work together:

Registry (memory/registry.py)

Table: registry — one row per conversation turn pair (user message + agent reply)
Each row stores: user_content, agent_content, combined embedding (float32 BLOB), importance, timestamps, access_count, kg_node_id
Embedding is embed("User: {msg}\nAgent: {reply}") — the combined turn embedding
Retrieved via KG node membership lookup, ranked by 0.5·sub_sim + 0.3·recency + 0.2·importance

Knowledge Graph (memory/l4_kg.py)

Table: kg_nodes — cluster centroids, one node per topic cluster
Table: kg_edges — relations between nodes (co_occurs, derived_from, abstracted_from)
Table: kg_node_members — registry row → KG node membership
Node centroids are running means of the embeddings of all their member turns
Nodes split when member_count > 20 (re-clustered at threshold + 0.05)

The KG is shared across the agent's lifetime and acts as the index for the registry. Registry rows without a KG node cannot be retrieved by semantic search.

Write path

user_msg
    │
    ▼
relevance gate (should_store)
    ├── too short / all stopwords → drop
    ├── infra noise (connection errors etc.) → drop
    └── doc-space dup in L1 (cosine ≥ 0.92) → drop, bump existing

    │ pass
    ▼
MemoryItem(importance=0.5, source="user")
    │
    ├── appended to L0 deque
    └── background job submitted: judge_importance(content)
            → updates item.importance on live L0 object
            → if already evicted to L1, updates there instead

    (L0 fills up → head evicted)
    │
    ├── source is "promoted_*" → skip (don't re-promote)
    ├── importance < L0_PROMOTE_THRESHOLD (0.5) → discard
    └── importance ≥ 0.5 → L1.write(item)
            └── background: condense(content) → set item.summary

    (decay sweep fires every 5 min)
    │
    └── for each cold L1 item (recency_score < 0.2 AND age > 5 min):
            pair user+agent items by timestamp proximity
            embed("User: {user}\nAgent: {agent}")
            find_or_create_node(emb, threshold=0.70)
                → if existing node centroid cosine ≥ 0.70: update centroid (running mean)
                → else: create new node with label=user_text[:80]
            registry.write(row_id, user_content, agent_content, ...)
            kg.add_member(node_id, row_id)
            if node.member_count > 20: kg.split_node(...)
            l1.delete(item_ids)

Key property: Nothing on the critical path (chat response) blocks on LLM except the main reply generation. Importance scoring and summary condensing are always off-thread.

Retrieval pipeline

retrieval/pipeline.py:explain_retrieve() — called on every chat turn.

query_emb = embed_query(user_msg)   # BGE query prefix applied

1. HOT BUFFER  (in-memory cosine scan, ≤ 20 entries)
   │  threshold: cosine ≥ 0.6
   │  returns entries with sub_sim set to computed cosine
   │
2. L0 tail  (last 5 items from conversation deque)
   │  always shown verbatim — not scored, not filtered
   │
3. L1 search  (ChromaDB ANN, k=10)
   │  filters out items with superseded_by set
   │  returns (item, sim, embedding) triples
   │
4. L2: KG node search → registry dereference
   │  only when L1 produces fewer than k_total=6 items above 0.6 sim
   │  kg.search_nodes(query_emb, k=6, exclude=hot_nodes)
   │    → top-6 nodes by centroid cosine, skipping hot_nodes
   │  graph expansion: expand_with_neighbors(found_ids, query_emb)
   │    → also fetch neighbor nodes (via any edge) with sim ≥ 0.6
   │  registry.get_members(node_id, query_emb, top_n=2) per node
   │    → ranked by 0.5·sub_sim + 0.3·recency + 0.2·importance
   │
▼
all_candidates = hot + l1 + l2
sort by fused_score = 0.6·sim + 0.3·recency + 0.1·importance

FILTER: sim < RETRIEVAL_MIN_SIM (0.6) → mark "below_sim_threshold", never inject

DEDUPE + SELECT TOP-6:
  hash content, skip duplicates
  first 6 non-dropped items → selected=True

SIDE EFFECTS (only for selected items):
  L1 hits → hot_buffer.put(entry with embedding)
           → promote summary to L0 as source="promoted_l1"
  L2 hits → hot_buffer.put(registry row)
           → registry.bump_access_count()
           → promote "User: … → Agent: …" summary to L0 as source="promoted_l2"
           → hot_nodes.add(kg_node_id)

CO_OCCURS EDGES:
  all node_ids where sim ≥ 0.6 (selected or not) form co_occurs pairs
  kg.bump_cooccurs(a, b) for every pair above threshold
  — captures semantic co-relevance even for nodes crowded out of top-k

Similarity cutoff of 0.6: anything below this threshold is excluded from the prompt entirely. This prevents memories from unrelated domains (e.g., database notes bleeding into a C/C++ discussion) from polluting the context.

Graph expansion: when a KG node is found by cosine search, its graph neighbors (via any edge relation) are also scored. Neighbors that score ≥ 0.6 to the query get their registry rows fetched too. This ensures that once two nodes build up a co_occurs edge (proven co-relevance), a query hitting one will pull in the other.

Knowledge graph

Nodes

Each node is a topic cluster centroid. When a new paired turn is consolidated from L1:

Its embedding is compared to all existing node centroids
If any centroid has cosine ≥ 0.70 → the turn joins that node (centroid updated via running mean)
If none match → a new node is created, labeled with the first 80 chars of the user message

The 0.70 threshold means nodes are topically tight — "we use Redis for session caching" and "when do we use MongoDB" produce separate nodes despite both being database-related.

Edges

co_occurs — The primary and most semantically meaningful edge type. Created when two nodes both have cosine ≥ 0.6 to the same retrieval query in the same batch. This is a proven signal: both nodes were genuinely relevant to the same question.

Weight incremented with each co-retrieval event
Edge is undirected (canonical ordering by node ID)
Used by graph expansion to discover related nodes

derived_from — Created by add_node() (the old path used by abstract_l2). Points from a newer node to an older similar node. Not used in normal consolidation flow.

abstracted_from — Created by POST /api/abstract/{agent}. Synthetic concept nodes (LLM-summarized labels) point to their constituent cluster nodes. Manual trigger only.

Why not cosine-based "related_to" edges?

BGE-large-en-v1.5 produces high cosine similarity (0.60–0.70) between any two short tech questions regardless of topic. "When do we use MongoDB?" and "Is Python the fastest language?" score ~0.67. A fixed cosine threshold would create nonsensical edges between unrelated topics. Only co_occurs edges — which require actual retrieval co-occurrence — carry reliable semantic meaning.

Node splits

When a node's member_count exceeds KG_NODE_MEMBER_CAP (20), split_node() re-clusters its members at KG_CLUSTER_THRESHOLD + 0.05. If two or more sub-clusters form, new nodes replace the old one and registry rows are reassigned.

Scoring and ranking

Fused score (used for final candidate ranking):

score = 0.6 × sim + 0.3 × recency + 0.1 × importance

Recency score (exponential decay):

recency = exp(-age_days / half_life)
half_life: L1=7d, L2=30d

Importance score (0.0–1.0):

Default: 0.5 on entry to L0
Updated by judge_importance() (Ollama) in background
High (0.7–1.0): facts, decisions, rules, technical choices, deadlines
Medium (0.4–0.6): useful context, questions with domain content
Low (0.0–0.3): small talk, filler — discarded at L0 eviction

Registry member ranking (used within a single KG node):

score = 0.5 × sub_sim + 0.3 × recency + 0.2 × importance

Background jobs

The concurrency.py module manages a ThreadPoolExecutor(max_workers=4) with named job tracking and per-agent RLocks.

Job	Trigger	Does
`importance-judge:{id}`	Every `remember()` call	Calls `judge_importance(content)` via Ollama; updates `item.importance` on live L0 object or L1 if already evicted
`l1-finalize:{agent}:{id}`	Every L0→L1 promotion	Calls `condense(content)` via Ollama; sets `item.summary` on the L1 entry
`enforce-caps:{agent}`	Every 20th turn	Runs `evictor.enforce_l1()` — evicts oldest low-importance L1 items if count cap is hit
Decay sweep	`threading.Timer`, every 300s	Calls `move_decayed_l1_to_registry()` for cold L1 items

Lock discipline: locks are held only around store writes, never during LLM calls. This prevents a slow Ollama response from blocking the chat turn.

Race condition (importance judge vs L0 eviction): if the deque fills before the importance judge returns, the item evicts with importance=0.5 (neutral) and still promotes to L1. The judge then finds it in L1 via l1.get(item.id) and updates it there. Acceptable outcome.

Design decisions and trade-offs

Single agent, no L3

The original design had two agents (planner + implementation) sharing an L3 warm tier. The single-agent redesign drops this complexity: one implementation agent, no cross-agent sharing, no L3.

Why: The two-agent model added coordination overhead without clear benefit for the use cases being targeted. L3 is replaced by the shared L2 KG which already provides cross-session persistence.

No LLM on the write hot path

The relevance gate does NOT call judge_importance() synchronously before L0 write. Only a fast heuristic check (length, stopwords, infra noise regex) runs inline.

Why: On CPU, a single Ollama call takes 10–30 seconds. Putting two LLM calls per chat turn (relevance check + reply generation) made the system unusable. The heuristic blocks obvious garbage; low-importance items are discarded at L0 eviction time instead.

RETRIEVAL_MIN_SIM = 0.6

Nothing below 0.6 cosine similarity is injected into the LLM prompt, even if it would otherwise rank in the top-k.

Why: BGE-large embeddings in the 0.5–0.6 range often represent "same general domain" rather than "same topic". Injecting these caused the LLM to hallucinate connections — e.g., talking about MongoDB when asked about C++ because session-caching notes scored 0.52 to the C++ query. The 0.6 floor keeps the context tight.

Trade-off: Some genuinely relevant memories with similarity just below 0.6 get filtered. This is preferable to polluting the context with loosely related content.

co_occurs threshold = RETRIEVAL_MIN_SIM

co_occurs edges form only between nodes that BOTH score ≥ 0.6 to the same query. This is stricter than earlier designs that only used selected (top-k) nodes.

Why: A node that scores 0.52 to a database query is only loosely relevant — forming a co_occurs edge based on it would create semantically weak connections. Nodes that both clear the 0.6 bar are genuinely about the same topic.

KG_CLUSTER_THRESHOLD = 0.70

Two turns must have combined embedding cosine ≥ 0.70 to land in the same KG node.

Why: At 0.70, nodes represent tight topic clusters. Lowering it would merge "we use Redis for session caching" with "can I use MongoDB for session caching" into one node — they're related but distinct questions that deserve separate registry entries. The KG's job is indexing, not deduplication.

No contradiction detection

Dropped. When two contradicting memories exist (e.g., "we use PostgreSQL" and "we switched to MySQL"), the LLM reconciles them in-context based on recency/importance scores.

Why: Detecting contradiction requires an LLM call per new write against all similar existing memories. The latency cost outweighs the benefit, especially since importance scoring already causes low-relevance or stale items to decay out.

L0 eviction as the L1 write trigger

Items write to L1 on eviction from L0 (write-back), not on entry to L0 (write-through).

Why: Write-through would persist everything including small talk that the importance judge later scores as low. Write-back ensures the background judge has had a chance to update the importance score before the promotion decision is made.

Configuration reference

All knobs are in config.py:

Name	Default	Description
`L0_MAXLEN`	20	L0 deque capacity
`L0_PROMOTE_THRESHOLD`	0.5	Min importance to promote from L0 to L1
`L1_MAXLEN`	40	L1 hard cap (fallback evictor; decay sweep is primary)
`L1_DECAY_SWEEP_SECS`	300	How often the decay sweep runs (seconds)
`L1_DECAY_THRESHOLD`	0.2	recency_score below this = cold
`L1_MIN_AGE_MINS`	5	Item must also be this old before migrating to L2
`K_L1`	10	L1 candidates fetched per query
`KG_CLUSTER_THRESHOLD`	0.70	Cosine threshold for node membership
`KG_NODE_MEMBER_CAP`	20	Max registry rows per node before split
`K_L2_NODES`	6	KG nodes searched at retrieval time
`REGISTRY_RESULTS_PER_NODE`	2	Registry rows fetched per KG node
`HOT_BUFFER_SIZE`	20	Hot buffer LRU capacity
`HOT_BUFFER_HIT_THRESHOLD`	0.6	Min cosine to return from hot buffer
`RETRIEVAL_MIN_SIM`	0.6	Min similarity to inject memory into prompt
`W_SEMANTIC`	0.6	Score fusion: semantic weight
`W_RECENCY`	0.3	Score fusion: recency weight
`W_IMPORTANCE`	0.1	Score fusion: importance weight
`OLLAMA_MODEL`	`qwen2.5:3b`	LLM for replies + importance scoring
`EMBED_MODEL`	`BAAI/bge-large-en-v1.5`	Sentence-transformers embedding model

Running

Prerequisites:

Python 3.12
Ollama running on localhost:11434
The model pulled: ollama pull qwen2.5:3b (or qwen2.5:0.5b for faster CPU testing)

# One-time setup
cd simple_memory
python3.12 -m venv .venv
.venv/bin/pip install -r requirements.txt

# Start the web UI (from the ai-twinnn/ parent directory)
OLLAMA_MODEL=qwen2.5:0.5b simple_memory/.venv/bin/uvicorn simple_memory.server:app --host 0.0.0.0 --port 8000

# Open http://localhost:8000

First run downloads the embedding model (~1.3 GB for BAAI/bge-large-en-v1.5).

Environment variables:

Variable	Default	Notes
`OLLAMA_URL`	`http://localhost:11434/api/generate`
`OLLAMA_MODEL`	`qwen2.5:3b`	Use `qwen2.5:0.5b` for faster CPU testing
`EMBED_MODEL`	`BAAI/bge-large-en-v1.5`
`EMBED_QUERY_PREFIX`	`Represent this sentence for searching relevant passages:`	Set to `""` for non-BGE models

API reference

Method	Path	Description
GET	`/`	Web UI
GET	`/api/meta`	Embed model, LLM model, agent list
GET	`/api/agents`	`["implementation"]`
GET	`/api/state`	Full tier state for all agents + KG
GET	`/api/state/{agent}`	L0/L1/hot_buffer/registry counts + KG
GET	`/api/l4`	KG nodes + edges
GET	`/api/jobs`	`{"in_flight": N}` background job count
GET	`/api/registry/{agent}`	Recent registry rows (debug view)
POST	`/api/chat/{agent}`	`{"message": "…"}` → reply + retrieval trace
POST	`/api/explain/{agent}`	`{"query": "…"}` → retrieval trace without LLM
POST	`/api/consolidate/{agent}`	Force L0+L1 → L2 migration now (bypasses age check)
POST	`/api/abstract/{agent}`	Super-cluster KG nodes → concept nodes (manual)
POST	`/api/kg/dedupe`	Merge near-duplicate KG nodes (cosine ≥ 0.93)
POST	`/api/seed/{agent}`	`{"facts": ["…"]}` → bulk-remember, skips relevance gate
POST	`/api/reset`	Drain background jobs, close stores, wipe `data/`

Directory layout

simple_memory/
  config.py                   # all tunable knobs
  models.py                   # MemoryItem dataclass
  concurrency.py              # ThreadPoolExecutor + per-agent RLocks

  llm/
    embed.py                  # embed() / embed_query() / embed_many()
    ollama.py                 # chat() / condense() / judge_importance()

  memory/
    l0_active.py              # deque ring buffer + evict callback + get_by_id + drain
    l1_recent.py              # ChromaDB wrapper; search / search_doc_space / upsert
    registry.py               # SQLite paired-turn store
    hot_buffer.py             # in-memory LRU; get_nearby returns sub_sim-annotated copies
    l4_kg.py                  # SQLite KG: nodes, edges, members; find_or_create, expand_with_neighbors

  lifecycle/
    relevance.py              # should_store(): heuristic gate + L1 dup check
    consolidator.py           # move_decayed_l1_to_registry() + abstract_l2()
    decay_sweep.py            # DecaySweep: background timer → consolidator
    evictor.py                # enforce_l1() + enforce_registry() (cap fallbacks)

  retrieval/
    scorer.py                 # recency_score(), fused_score()
    pipeline.py               # explain_retrieve(): 4-tier hierarchy + graph expansion

  agents/
    agent.py                  # Agent: write path, chat(), background jobs, shutdown()
    registry.py               # AgentSpec: "implementation" agent definition

  static/
    index.html                # single-page UI (Tailwind + Chart.js + vis-network)

  server.py                   # FastAPI app
  cli.py                      # interactive terminal chat
  README.md                   # this file

  data/                       # gitignored — created on first write
    implementation/
      l1.chroma/              # ChromaDB vector store
      registry.sqlite         # paired-turn registry
    shared/
      l4.sqlite               # KG nodes, edges, members

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agents		agents
examples		examples
lifecycle		lifecycle
llm		llm
memory		memory
retrieval		retrieval
static		static
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Major Project Mid Sem.txt		Major Project Mid Sem.txt
README.md		README.md
__init__.py		__init__.py
architecture.md		architecture.md
cli.py		cli.py
concurrency.py		concurrency.py
config.py		config.py
models.py		models.py
requirements.txt		requirements.txt
server.py		server.py

Folders and files

Latest commit

History

Repository files navigation

simple_memory

Table of Contents

Architecture overview

Memory tiers

L0 — Active context (RAM, volatile)

L1 — Recent vector store (ChromaDB, persistent)

Hot Buffer — Session LRU cache (RAM, volatile)

L2 — Long-term storage (SQLite, persistent)

Write path

Retrieval pipeline

Knowledge graph

Nodes

Edges

Why not cosine-based "related_to" edges?

Node splits

Scoring and ranking

Background jobs

Design decisions and trade-offs

Single agent, no L3

No LLM on the write hot path

RETRIEVAL_MIN_SIM = 0.6

co_occurs threshold = RETRIEVAL_MIN_SIM

KG_CLUSTER_THRESHOLD = 0.70

No contradiction detection

L0 eviction as the L1 write trigger

Configuration reference

Running

API reference

Directory layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages