Skip to content

barkhaaroraa/simple-memory

Repository files navigation

simple_memory

A single-agent persistent memory system for LLMs, modeled on CPU cache architecture. Memories flow from a fast volatile ring buffer through a vector store into a knowledge-graph-backed registry, with retrieval always pulling from the most recently hot tier first.


Table of Contents

  1. Architecture overview
  2. Memory tiers
  3. Write path
  4. Retrieval pipeline
  5. Knowledge graph
  6. Scoring and ranking
  7. Background jobs
  8. Design decisions and trade-offs
  9. Configuration reference
  10. Running
  11. API reference
  12. Directory layout

Architecture overview

┌─────────────────────────────────────────────────────────────────────┐
│  USER MESSAGE                                                        │
│       │                                                              │
│       ▼                                                              │
│  RELEVANCE GATE  ──reject──▶  dropped (logged in skipped[])         │
│  (heuristic + dup check)                                             │
│       │ pass                                                         │
│       ▼                                                              │
│  L0 Active  deque(20)  ◀── background: judge_importance() updates   │
│       │                    item.importance on live object            │
│       │ eviction (FIFO, head drops when deque is full)               │
│       │   importance ≥ 0.5  →  write to L1                          │
│       │   importance < 0.5  →  discard                              │
│       ▼                                                              │
│  L1 Recent  ChromaDB   ◀── decay sweep every 5 min                  │
│       │                                                              │
│       │ cold items (recency < 0.2, age > 5 min)                     │
│       │   pair user+agent turns                                      │
│       │   embed combined text                                        │
│       │   find_or_create KG node (cosine cluster, threshold=0.70)   │
│       │   write to Registry                                          │
│       ▼                                                              │
│  L2 Registry (SQLite)  ←─→  L2 KG (SQLite)                          │
│  raw paired turns              cluster centroids                     │
│  embeddings + metadata         co_occurs edges                       │
│                                                                      │
│  HOT BUFFER (in-memory LRU, 20 slots)                               │
│  ← populated from L1 and L2 hits during retrieval                   │
│  → checked first on every query                                      │
└─────────────────────────────────────────────────────────────────────┘

Memory tiers

L0 — Active context (RAM, volatile)

  • Implementation: memory/l0_active.pycollections.deque(maxlen=20)
  • Lifetime: current process only; lost on restart
  • Contents: raw MemoryItem objects (user messages, agent replies, promoted recalls)
  • Capacity: 20 items (configurable L0_MAXLEN)
  • Eviction: FIFO — oldest item drops when the deque is full, triggering _l0_evicted()

Every item enters L0 with importance=0.5. A background Ollama job (judge_importance()) updates the live object's importance score. When the item is later evicted, the final importance value decides promotion vs discard.

Promoted entries from L1 or L2 (source="promoted_l1" / "promoted_l2") are written directly into L0's deque but bypass the evict-to-L1 path — they're cache recalls, not new information, so re-promoting them would create duplicates.


L1 — Recent vector store (ChromaDB, persistent)

  • Implementation: memory/l1_recent.py
  • Lifetime: survives restarts; persists in data/{agent_id}/l1.chroma/
  • Contents: items that survived L0 eviction with importance ≥ 0.5
  • Capacity: soft cap at 40 items (L1_MAXLEN); primary drain is the decay sweep
  • Search: BGE-large-en-v1.5 embeddings (1024-dim), cosine similarity via ChromaDB

L1 is the "recently relevant" tier — content that's been scored as important enough to keep but hasn't aged into long-term storage yet. The decay sweep migrates cold items to L2 every 5 minutes.

L1 supports two search modes:

  • Query-space search (search()): uses the BGE query prefix ("Represent this sentence for searching...") for retrieval
  • Document-space search (search_doc_space()): raw document encoder, used by the duplicate gate in the relevance check

Hot Buffer — Session LRU cache (RAM, volatile)

  • Implementation: memory/hot_buffer.py
  • Lifetime: current process only
  • Contents: items recently retrieved from L1 or L2 this session, stored with their embeddings
  • Capacity: 20 slots (LRU eviction)
  • Hit threshold: cosine ≥ 0.6 to the current query embedding

The hot buffer is the fastest tier — a flat cosine scan over ≤ 20 embeddings. It prevents re-traversing the KG and re-querying the registry for content that was already fetched this session.

Entries are populated by the retrieval pipeline:

  • L1 hits → stored with their full embedding so future queries fire as HOT
  • L2 hits → row fetched from registry, stored with registry embedding

The hot_nodes set on the buffer tracks which KG node IDs have been fetched this session, allowing the L2 search to skip nodes already in the buffer.


L2 — Long-term storage (SQLite, persistent)

L2 has two components that work together:

Registry (memory/registry.py)

  • Table: registry — one row per conversation turn pair (user message + agent reply)
  • Each row stores: user_content, agent_content, combined embedding (float32 BLOB), importance, timestamps, access_count, kg_node_id
  • Embedding is embed("User: {msg}\nAgent: {reply}") — the combined turn embedding
  • Retrieved via KG node membership lookup, ranked by 0.5·sub_sim + 0.3·recency + 0.2·importance

Knowledge Graph (memory/l4_kg.py)

  • Table: kg_nodes — cluster centroids, one node per topic cluster
  • Table: kg_edges — relations between nodes (co_occurs, derived_from, abstracted_from)
  • Table: kg_node_members — registry row → KG node membership
  • Node centroids are running means of the embeddings of all their member turns
  • Nodes split when member_count > 20 (re-clustered at threshold + 0.05)

The KG is shared across the agent's lifetime and acts as the index for the registry. Registry rows without a KG node cannot be retrieved by semantic search.


Write path

user_msg
    │
    ▼
relevance gate (should_store)
    ├── too short / all stopwords → drop
    ├── infra noise (connection errors etc.) → drop
    └── doc-space dup in L1 (cosine ≥ 0.92) → drop, bump existing

    │ pass
    ▼
MemoryItem(importance=0.5, source="user")
    │
    ├── appended to L0 deque
    └── background job submitted: judge_importance(content)
            → updates item.importance on live L0 object
            → if already evicted to L1, updates there instead

    (L0 fills up → head evicted)
    │
    ├── source is "promoted_*" → skip (don't re-promote)
    ├── importance < L0_PROMOTE_THRESHOLD (0.5) → discard
    └── importance ≥ 0.5 → L1.write(item)
            └── background: condense(content) → set item.summary

    (decay sweep fires every 5 min)
    │
    └── for each cold L1 item (recency_score < 0.2 AND age > 5 min):
            pair user+agent items by timestamp proximity
            embed("User: {user}\nAgent: {agent}")
            find_or_create_node(emb, threshold=0.70)
                → if existing node centroid cosine ≥ 0.70: update centroid (running mean)
                → else: create new node with label=user_text[:80]
            registry.write(row_id, user_content, agent_content, ...)
            kg.add_member(node_id, row_id)
            if node.member_count > 20: kg.split_node(...)
            l1.delete(item_ids)

Key property: Nothing on the critical path (chat response) blocks on LLM except the main reply generation. Importance scoring and summary condensing are always off-thread.


Retrieval pipeline

retrieval/pipeline.py:explain_retrieve() — called on every chat turn.

query_emb = embed_query(user_msg)   # BGE query prefix applied

1. HOT BUFFER  (in-memory cosine scan, ≤ 20 entries)
   │  threshold: cosine ≥ 0.6
   │  returns entries with sub_sim set to computed cosine
   │
2. L0 tail  (last 5 items from conversation deque)
   │  always shown verbatim — not scored, not filtered
   │
3. L1 search  (ChromaDB ANN, k=10)
   │  filters out items with superseded_by set
   │  returns (item, sim, embedding) triples
   │
4. L2: KG node search → registry dereference
   │  only when L1 produces fewer than k_total=6 items above 0.6 sim
   │  kg.search_nodes(query_emb, k=6, exclude=hot_nodes)
   │    → top-6 nodes by centroid cosine, skipping hot_nodes
   │  graph expansion: expand_with_neighbors(found_ids, query_emb)
   │    → also fetch neighbor nodes (via any edge) with sim ≥ 0.6
   │  registry.get_members(node_id, query_emb, top_n=2) per node
   │    → ranked by 0.5·sub_sim + 0.3·recency + 0.2·importance
   │
▼
all_candidates = hot + l1 + l2
sort by fused_score = 0.6·sim + 0.3·recency + 0.1·importance

FILTER: sim < RETRIEVAL_MIN_SIM (0.6) → mark "below_sim_threshold", never inject

DEDUPE + SELECT TOP-6:
  hash content, skip duplicates
  first 6 non-dropped items → selected=True

SIDE EFFECTS (only for selected items):
  L1 hits → hot_buffer.put(entry with embedding)
           → promote summary to L0 as source="promoted_l1"
  L2 hits → hot_buffer.put(registry row)
           → registry.bump_access_count()
           → promote "User: … → Agent: …" summary to L0 as source="promoted_l2"
           → hot_nodes.add(kg_node_id)

CO_OCCURS EDGES:
  all node_ids where sim ≥ 0.6 (selected or not) form co_occurs pairs
  kg.bump_cooccurs(a, b) for every pair above threshold
  — captures semantic co-relevance even for nodes crowded out of top-k

Similarity cutoff of 0.6: anything below this threshold is excluded from the prompt entirely. This prevents memories from unrelated domains (e.g., database notes bleeding into a C/C++ discussion) from polluting the context.

Graph expansion: when a KG node is found by cosine search, its graph neighbors (via any edge relation) are also scored. Neighbors that score ≥ 0.6 to the query get their registry rows fetched too. This ensures that once two nodes build up a co_occurs edge (proven co-relevance), a query hitting one will pull in the other.


Knowledge graph

Nodes

Each node is a topic cluster centroid. When a new paired turn is consolidated from L1:

  1. Its embedding is compared to all existing node centroids
  2. If any centroid has cosine ≥ 0.70 → the turn joins that node (centroid updated via running mean)
  3. If none match → a new node is created, labeled with the first 80 chars of the user message

The 0.70 threshold means nodes are topically tight — "we use Redis for session caching" and "when do we use MongoDB" produce separate nodes despite both being database-related.

Edges

co_occurs — The primary and most semantically meaningful edge type. Created when two nodes both have cosine ≥ 0.6 to the same retrieval query in the same batch. This is a proven signal: both nodes were genuinely relevant to the same question.

  • Weight incremented with each co-retrieval event
  • Edge is undirected (canonical ordering by node ID)
  • Used by graph expansion to discover related nodes

derived_from — Created by add_node() (the old path used by abstract_l2). Points from a newer node to an older similar node. Not used in normal consolidation flow.

abstracted_from — Created by POST /api/abstract/{agent}. Synthetic concept nodes (LLM-summarized labels) point to their constituent cluster nodes. Manual trigger only.

Why not cosine-based "related_to" edges?

BGE-large-en-v1.5 produces high cosine similarity (0.60–0.70) between any two short tech questions regardless of topic. "When do we use MongoDB?" and "Is Python the fastest language?" score ~0.67. A fixed cosine threshold would create nonsensical edges between unrelated topics. Only co_occurs edges — which require actual retrieval co-occurrence — carry reliable semantic meaning.

Node splits

When a node's member_count exceeds KG_NODE_MEMBER_CAP (20), split_node() re-clusters its members at KG_CLUSTER_THRESHOLD + 0.05. If two or more sub-clusters form, new nodes replace the old one and registry rows are reassigned.


Scoring and ranking

Fused score (used for final candidate ranking):

score = 0.6 × sim + 0.3 × recency + 0.1 × importance

Recency score (exponential decay):

recency = exp(-age_days / half_life)
half_life: L1=7d, L2=30d

Importance score (0.0–1.0):

  • Default: 0.5 on entry to L0
  • Updated by judge_importance() (Ollama) in background
  • High (0.7–1.0): facts, decisions, rules, technical choices, deadlines
  • Medium (0.4–0.6): useful context, questions with domain content
  • Low (0.0–0.3): small talk, filler — discarded at L0 eviction

Registry member ranking (used within a single KG node):

score = 0.5 × sub_sim + 0.3 × recency + 0.2 × importance

Background jobs

The concurrency.py module manages a ThreadPoolExecutor(max_workers=4) with named job tracking and per-agent RLocks.

Job Trigger Does
importance-judge:{id} Every remember() call Calls judge_importance(content) via Ollama; updates item.importance on live L0 object or L1 if already evicted
l1-finalize:{agent}:{id} Every L0→L1 promotion Calls condense(content) via Ollama; sets item.summary on the L1 entry
enforce-caps:{agent} Every 20th turn Runs evictor.enforce_l1() — evicts oldest low-importance L1 items if count cap is hit
Decay sweep threading.Timer, every 300s Calls move_decayed_l1_to_registry() for cold L1 items

Lock discipline: locks are held only around store writes, never during LLM calls. This prevents a slow Ollama response from blocking the chat turn.

Race condition (importance judge vs L0 eviction): if the deque fills before the importance judge returns, the item evicts with importance=0.5 (neutral) and still promotes to L1. The judge then finds it in L1 via l1.get(item.id) and updates it there. Acceptable outcome.


Design decisions and trade-offs

Single agent, no L3

The original design had two agents (planner + implementation) sharing an L3 warm tier. The single-agent redesign drops this complexity: one implementation agent, no cross-agent sharing, no L3.

Why: The two-agent model added coordination overhead without clear benefit for the use cases being targeted. L3 is replaced by the shared L2 KG which already provides cross-session persistence.

No LLM on the write hot path

The relevance gate does NOT call judge_importance() synchronously before L0 write. Only a fast heuristic check (length, stopwords, infra noise regex) runs inline.

Why: On CPU, a single Ollama call takes 10–30 seconds. Putting two LLM calls per chat turn (relevance check + reply generation) made the system unusable. The heuristic blocks obvious garbage; low-importance items are discarded at L0 eviction time instead.

RETRIEVAL_MIN_SIM = 0.6

Nothing below 0.6 cosine similarity is injected into the LLM prompt, even if it would otherwise rank in the top-k.

Why: BGE-large embeddings in the 0.5–0.6 range often represent "same general domain" rather than "same topic". Injecting these caused the LLM to hallucinate connections — e.g., talking about MongoDB when asked about C++ because session-caching notes scored 0.52 to the C++ query. The 0.6 floor keeps the context tight.

Trade-off: Some genuinely relevant memories with similarity just below 0.6 get filtered. This is preferable to polluting the context with loosely related content.

co_occurs threshold = RETRIEVAL_MIN_SIM

co_occurs edges form only between nodes that BOTH score ≥ 0.6 to the same query. This is stricter than earlier designs that only used selected (top-k) nodes.

Why: A node that scores 0.52 to a database query is only loosely relevant — forming a co_occurs edge based on it would create semantically weak connections. Nodes that both clear the 0.6 bar are genuinely about the same topic.

KG_CLUSTER_THRESHOLD = 0.70

Two turns must have combined embedding cosine ≥ 0.70 to land in the same KG node.

Why: At 0.70, nodes represent tight topic clusters. Lowering it would merge "we use Redis for session caching" with "can I use MongoDB for session caching" into one node — they're related but distinct questions that deserve separate registry entries. The KG's job is indexing, not deduplication.

No contradiction detection

Dropped. When two contradicting memories exist (e.g., "we use PostgreSQL" and "we switched to MySQL"), the LLM reconciles them in-context based on recency/importance scores.

Why: Detecting contradiction requires an LLM call per new write against all similar existing memories. The latency cost outweighs the benefit, especially since importance scoring already causes low-relevance or stale items to decay out.

L0 eviction as the L1 write trigger

Items write to L1 on eviction from L0 (write-back), not on entry to L0 (write-through).

Why: Write-through would persist everything including small talk that the importance judge later scores as low. Write-back ensures the background judge has had a chance to update the importance score before the promotion decision is made.


Configuration reference

All knobs are in config.py:

Name Default Description
L0_MAXLEN 20 L0 deque capacity
L0_PROMOTE_THRESHOLD 0.5 Min importance to promote from L0 to L1
L1_MAXLEN 40 L1 hard cap (fallback evictor; decay sweep is primary)
L1_DECAY_SWEEP_SECS 300 How often the decay sweep runs (seconds)
L1_DECAY_THRESHOLD 0.2 recency_score below this = cold
L1_MIN_AGE_MINS 5 Item must also be this old before migrating to L2
K_L1 10 L1 candidates fetched per query
KG_CLUSTER_THRESHOLD 0.70 Cosine threshold for node membership
KG_NODE_MEMBER_CAP 20 Max registry rows per node before split
K_L2_NODES 6 KG nodes searched at retrieval time
REGISTRY_RESULTS_PER_NODE 2 Registry rows fetched per KG node
HOT_BUFFER_SIZE 20 Hot buffer LRU capacity
HOT_BUFFER_HIT_THRESHOLD 0.6 Min cosine to return from hot buffer
RETRIEVAL_MIN_SIM 0.6 Min similarity to inject memory into prompt
W_SEMANTIC 0.6 Score fusion: semantic weight
W_RECENCY 0.3 Score fusion: recency weight
W_IMPORTANCE 0.1 Score fusion: importance weight
OLLAMA_MODEL qwen2.5:3b LLM for replies + importance scoring
EMBED_MODEL BAAI/bge-large-en-v1.5 Sentence-transformers embedding model

Running

Prerequisites:

  • Python 3.12
  • Ollama running on localhost:11434
  • The model pulled: ollama pull qwen2.5:3b (or qwen2.5:0.5b for faster CPU testing)
# One-time setup
cd simple_memory
python3.12 -m venv .venv
.venv/bin/pip install -r requirements.txt

# Start the web UI (from the ai-twinnn/ parent directory)
OLLAMA_MODEL=qwen2.5:0.5b simple_memory/.venv/bin/uvicorn simple_memory.server:app --host 0.0.0.0 --port 8000

# Open http://localhost:8000

First run downloads the embedding model (~1.3 GB for BAAI/bge-large-en-v1.5).

Environment variables:

Variable Default Notes
OLLAMA_URL http://localhost:11434/api/generate
OLLAMA_MODEL qwen2.5:3b Use qwen2.5:0.5b for faster CPU testing
EMBED_MODEL BAAI/bge-large-en-v1.5
EMBED_QUERY_PREFIX Represent this sentence for searching relevant passages: Set to "" for non-BGE models

API reference

Method Path Description
GET / Web UI
GET /api/meta Embed model, LLM model, agent list
GET /api/agents ["implementation"]
GET /api/state Full tier state for all agents + KG
GET /api/state/{agent} L0/L1/hot_buffer/registry counts + KG
GET /api/l4 KG nodes + edges
GET /api/jobs {"in_flight": N} background job count
GET /api/registry/{agent} Recent registry rows (debug view)
POST /api/chat/{agent} {"message": "…"} → reply + retrieval trace
POST /api/explain/{agent} {"query": "…"} → retrieval trace without LLM
POST /api/consolidate/{agent} Force L0+L1 → L2 migration now (bypasses age check)
POST /api/abstract/{agent} Super-cluster KG nodes → concept nodes (manual)
POST /api/kg/dedupe Merge near-duplicate KG nodes (cosine ≥ 0.93)
POST /api/seed/{agent} {"facts": ["…"]} → bulk-remember, skips relevance gate
POST /api/reset Drain background jobs, close stores, wipe data/

Directory layout

simple_memory/
  config.py                   # all tunable knobs
  models.py                   # MemoryItem dataclass
  concurrency.py              # ThreadPoolExecutor + per-agent RLocks

  llm/
    embed.py                  # embed() / embed_query() / embed_many()
    ollama.py                 # chat() / condense() / judge_importance()

  memory/
    l0_active.py              # deque ring buffer + evict callback + get_by_id + drain
    l1_recent.py              # ChromaDB wrapper; search / search_doc_space / upsert
    registry.py               # SQLite paired-turn store
    hot_buffer.py             # in-memory LRU; get_nearby returns sub_sim-annotated copies
    l4_kg.py                  # SQLite KG: nodes, edges, members; find_or_create, expand_with_neighbors

  lifecycle/
    relevance.py              # should_store(): heuristic gate + L1 dup check
    consolidator.py           # move_decayed_l1_to_registry() + abstract_l2()
    decay_sweep.py            # DecaySweep: background timer → consolidator
    evictor.py                # enforce_l1() + enforce_registry() (cap fallbacks)

  retrieval/
    scorer.py                 # recency_score(), fused_score()
    pipeline.py               # explain_retrieve(): 4-tier hierarchy + graph expansion

  agents/
    agent.py                  # Agent: write path, chat(), background jobs, shutdown()
    registry.py               # AgentSpec: "implementation" agent definition

  static/
    index.html                # single-page UI (Tailwind + Chart.js + vis-network)

  server.py                   # FastAPI app
  cli.py                      # interactive terminal chat
  README.md                   # this file

  data/                       # gitignored — created on first write
    implementation/
      l1.chroma/              # ChromaDB vector store
      registry.sqlite         # paired-turn registry
    shared/
      l4.sqlite               # KG nodes, edges, members

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors