mem0ai · rohram04 · Jun 20, 2026 · Jun 22, 2026 · Jun 22, 2026 · Jun 23, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,106 @@
+# memory-benchmarks — CLAUDE.md
+
+Harness for memory-system benchmarks (LongMemEval, LoCoMo, BEAM). Upstream it
+benchmarks **mem0**; this repo has been extended with a **MemoryManager (MM)**
+backend so the two can be compared head-to-head on LongMemEval, varying only the
+memory system. See the repo `README.md` for the original mem0 usage.
+
+## The comparison goal
+
+Fair LongMemEval head-to-head: **only the memory system differs.** Conversation
+chunks, the answerer (`gpt-4o`), the judge (`gpt-4o-mini`), and the embedder
+(`text-embedding-3-small`) are identical across both backends. mem0 OSS's defaults
+(`gpt-4o-mini` extraction + `text-embedding-3-small`) already match, and MM is
+configured to the same models.
+
+**Everything routes through OpenRouter on one key — no OpenAI key.** MM chat, MM
+embeddings, the harness answerer/judge, and the mem0 server all hit OpenRouter's
+OpenAI-compatible endpoint.
+
+## How the MM integration works
+
+MM lives in a **separate repo** (`~/MemoryManager`, override `MEMORYMANAGER_PATH`)
+and is a stateful managed-context system, not a stateless extract→search store. So
+the integration is **at the runner's call sites**, not a drop-in `Mem0Client`:
+
+- `benchmarks/common/mm_bridge.py` — builds a fresh, isolated `Agent` per question
+  (in-memory `LongTermStore`), and exposes `mm_ingest` / `mm_surface_and_format`
+  (PREP) / `mm_persist` (PERSIST). The **harness answerer** generates the answer
+  from MM's surfaced blocks (MM's own REPLY phase is bypassed — the fairness crux);
+  the **judge is unchanged**.
+- `benchmarks/longmemeval/run.py` — `--backend memorymanager` branch at the
+  ingest/answer call sites; all mem0 paths untouched. MM yields one managed-context
+  window (no top-k cutoffs), so it reports a single cutoff.
+
+MM-side dependencies (the OpenRouter `LLMBackend` and the `memory/embeddings.py`
+Embedder abstraction) are on MM's **`main`** branch — run against a `main` checkout.
+
+## All-OpenRouter routing (3 points + env)
+
+| Component | How it routes to OpenRouter |
+|---|---|
+| MM chat (gpt-4o / gpt-4o-mini) | `make_llm_backend(backend="openrouter")` (reads `OPENROUTER_API_KEY` from MM's `.env`) |
+| MM embedder | `--mm-embedding-model openrouter:openai/text-embedding-3-small` (default) → `OpenAIEmbedder(base_url=OpenRouter)` |
+| Harness answerer + judge | `LLMClient` (provider `openai`) honors `OPENAI_BASE_URL` + `OPENAI_API_KEY` env |
+| mem0 OSS server | `mem0-config.yaml` (`openai_base_url: …openrouter…`) mounted via `docker-compose.yml`; key via `OPENAI_API_KEY` |
+
+`.env` (repo root, gitignored — **create it yourself**, the key comes from MM's `.env`):
+```
+OPENAI_API_KEY=<your OpenRouter key>
+OPENAI_BASE_URL=https://openrouter.ai/api/v1
+OPENROUTER_API_KEY=<your OpenRouter key>
+```
+One-liner to create it from MM's `.env`:
+```
+K=$(grep -E '^OPENROUTER_API_KEY=' ~/MemoryManager/.env | head -1 | cut -d= -f2-) && \
+printf 'OPENAI_API_KEY=%s\nOPENAI_BASE_URL=https://openrouter.ai/api/v1\nOPENROUTER_API_KEY=%s\n' "$K" "$K" > .env
+```
+
+## Environment
+
+No `.venv` is committed here; the default `python3` may be too new for some wheels.
+**Reuse MM's venv** (Python 3.12, already has every dep + `aiolimiter` added):
+`~/MemoryManager/.venv/bin/python`. mem0 is reached over raw HTTP, so no mem0 SDK
+is needed. Ensure `~/MemoryManager` is on `main` (has `memory/embeddings.py`).
+
+## Running
+
+```bash
+cd ~/memory-benchmarks
+PY=~/MemoryManager/.venv/bin/python
+
+# mem0 OSS server (OpenRouter-backed via mem0-config.yaml), at localhost:8888
+docker compose up -d
+
+# MemoryManager backend
+$PY -m benchmarks.longmemeval.run --backend memorymanager \
+  --answerer-model openai/gpt-4o --judge-model openai/gpt-4o-mini --provider openai \
+  --mm-max-tokens 8000 --mm-embedding-model openrouter:openai/text-embedding-3-small \
+  --per-type 1 --max-workers 2 --project-name mm_smoke
+
+# mem0 OSS backend (same flags)
+$PY -m benchmarks.longmemeval.run --backend oss \
+  --answerer-model openai/gpt-4o --judge-model openai/gpt-4o-mini --provider openai \
+  --per-type 1 --max-workers 4 --project-name mem0_smoke
+```
+
+MM fires several gpt-4o calls per question (PREP/PERSIST tool loops, and LLM-mode
+ingest is ~2 calls/haystack-pair) → **expensive**. Scale up deliberately:
+`--per-type 1` smoke → `--per-type 5` → `--all-questions` (500) only when intended.
+Use a lower `--max-workers` for MM than for mem0.
+
+### Key MM flags (`memorymanager` backend)
+- `--mm-max-tokens` (default 8000) — MM context-window budget; set near the token
+  size of the mem0 cutoff you compare against.
+- `--mm-model` / `--mm-util-model` — default `openai/gpt-4o` / `openai/gpt-4o-mini`.
+- `--mm-embedding-model` — default `openrouter:openai/text-embedding-3-small`; also
+  `openai:<m>`/`text-embedding-3-small` (OpenAI direct) or a sentence-transformers
+  name (local, keyless).
+
+## Fairness notes / known caveats
+- **Single managed window vs mem0 cutoffs:** MM has no top-k; it reports one cutoff.
+  Compare against the mem0 cutoff whose token footprint ≈ `--mm-max-tokens`.
+- **Per-memory dates:** MM blocks carry `created_at: None`, so the answer prompt
+  can't date-group MM memories (mem0 does) — a temporal-reasoning disadvantage.
+  Tracked follow-up in MM (thread session timestamps onto blocks).
+- **`--mode answerer` only** for the MM backend (retrieval mode is rejected).
diff --git a/benchmarks/common/mm_bridge.py b/benchmarks/common/mm_bridge.py
@@ -0,0 +1,208 @@
+"""Bridge to the custom MemoryManager (MM) system for the LongMemEval runner.
+
+MM lives in a separate repo (default ~/MemoryManager, override with
+MEMORYMANAGER_PATH). Unlike mem0's stateless extract→search pipeline, MM is a
+stateful managed-context system. Each benchmark question gets a fresh Agent with
+an isolated in-memory LongTermStore.
+
+Two memory modes mirror Agent's native turn structures (REPLY is always the
+external harness answerer, not MM's model):
+
+  llm mode — ingest:  PREP(user) → PERSIST(user, assistant) per haystack pair
+             query:   PREP(question) → harness answer → PERSIST(question, answer)
+
+  algorithmic mode — ingest:  receive(user) → receive(assistant) per pair
+                     query:   receive(question) → harness answer → receive(answer)
+
+Only the surfaced memory differs from the mem0 path — the answerer prompt,
+model, and judge stay identical, so the memory system is the sole variable.
+
+Each pair is ingested under its real-world session date (via the scoped
+ContextManager.using_source_date), so surfaced blocks carry a created_at the
+answerer can date-order — parity with mem0's per-memory timestamps.
+
+LLM-mode ingest is expensive (~2 tool-loop calls per haystack pair).
+
+The LLM backend (OpenRouter, gpt-4o family) and embedder are process-wide
+singletons built once and shared across per-question agents.
+"""
+
+from __future__ import annotations
+
+import os
+import sys
+import threading
+
+_MM_PATH = os.environ.get("MEMORYMANAGER_PATH", os.path.expanduser("~/MemoryManager"))
+if _MM_PATH not in sys.path:
+    sys.path.insert(0, _MM_PATH)
+
+# MM imports are resolved lazily inside _ensure()/make_mm_agent so that importing
+# this module (e.g. for a syntax/import smoke test) does not require API keys or
+# pull in heavy deps until an agent is actually built.
+
+_LOCK = threading.Lock()
+_BACKEND = None
+_EMBEDDER = None
+
+
+def _ensure(embedding_model: str):
+    """Build (once) and return the shared (backend, embedder) singletons."""
+    global _BACKEND, _EMBEDDER
+    with _LOCK:
+        if _BACKEND is None:
+            from llm import make_llm_backend  # MM's llm package
+
+            # OpenRouter chat-completions backend (gpt-4o / gpt-4o-mini). Reads
+            # OPENROUTER_API_KEY from env / MM's repo-root .env.
+            _BACKEND = make_llm_backend(backend="openrouter")
+        if _EMBEDDER is None:
+            from memory.embeddings import make_embedder  # MM's embedder factory
+
+            # Resolves a spec to the right backend: "text-embedding-3-small" /
+            # "openai:<m>" (OpenAI direct), "openrouter:<m>" (via OpenRouter), or
+            # a sentence-transformers name (local). Matches mem0 OSS's embedder.
+            _EMBEDDER = make_embedder(embedding_model)
+    return _BACKEND, _EMBEDDER
+
+
+def make_mm_agent(
+    max_tokens: int,
+    model: str = "openai/gpt-4o",
+    util_model: str = "openai/gpt-4o-mini",
+    embedding_model: str = "openrouter:openai/text-embedding-3-small",
+    novelty_mode=None,
+    mode: str = "llm",
+    clock_seconds_per_turn: float = 0.0,
+):
+    """Build a fresh, isolated MemoryManager Agent for one benchmark question."""
+    from agent import Agent, MemoryMode
+    from controller import MemoryController
+    from ContextManager import ContextManager
+    from functions.llm_fns import make_compress_fn, make_merge_fn
+    from memory.config import MemoryConfig
+    from memory.longterm import LongTermStore
+    from memory.novelty import NoveltyMode
+    from memory.store import ContextStore
+
+    backend, embedder = _ensure(embedding_model)
+
+    cfg = MemoryConfig(clock_seconds_per_turn=clock_seconds_per_turn)
+    store = ContextStore(max_tokens=max_tokens, config=cfg)
+    lt = LongTermStore("sqlite:///:memory:")  # isolated per agent; StaticPool = thread-safe
+    cm = ContextManager(store, lt, embedding_model=embedder, config=cfg)
+    controller = MemoryController(
+        cm,
+        compress_fn=make_compress_fn(backend, util_model),
+        merge_fn=make_merge_fn(backend, cm, util_model),
+        config=cfg,
+    )
+    mm_mode = MemoryMode.LLM if str(mode).lower() == "llm" else MemoryMode.ALGORITHMIC
+    return Agent(
+        controller,
+        backend,
+        model=model,
+        mode=mm_mode,
+        novelty_mode=(novelty_mode or NoveltyMode.EMBEDDING),
+        novelty_model=util_model,
+    )
+
+
+def _split_pair(chunk) -> tuple[str, str]:
+    """Extract user and assistant content from one haystack pair."""
+    if isinstance(chunk, str):
+        return chunk, ""
+    user = assistant = ""
+    for msg in chunk:
+        if not isinstance(msg, dict):
+            continue
+        role = msg.get("role", "")
+        content = msg.get("content", "")
+        if role == "user":
+            user = content
+        elif role == "assistant":
+            assistant = content
+    return user, assistant
+
+
+def _algorithmic_receive(agent, content: str) -> None:
+    """One receive() pass — mirrors the memory half of _algorithmic_turn."""
+    if not content.strip():
+        return
+    embedding = agent._controller.embed(content)
+    agent._controller.receive(
+        content, embedding, agent._novelty_fn(content, embedding)
+    )
+
+
+def _format_blocks(agent) -> list[dict]:
+    """Return in-context blocks shaped like mem0 search hits for the harness.
+
+    created_at is the block's source_date (the real-world date of the memory's
+    content) as an ISO string, so the answerer prompt can date-group/-order
+    memories just like it does for mem0's per-memory timestamps.
+    """
+    blocks = agent._controller._cm._store.all_blocks()
+    out: list[dict] = []
+    for b in blocks:
+        sd = getattr(b, "source_date", None)
+        out.append(
+            {
+                "memory": b.content,
+                "score": float(getattr(b, "novelty_score", 0.0)),
+                "created_at": sd.isoformat() if sd else None,
+            }
+        )
+    return out
+
+
+def mm_ingest(agent, pairs, dates=None) -> None:
+    """Ingest haystack pairs using the agent's native mode lifecycle (no REPLY).
+
+    ``dates`` is an optional list parallel to ``pairs`` of datetimes (the real-world
+    session date of each pair); when given, each pair's blocks are stamped with that
+    source_date via the scoped using_source_date() so the answerer can date-order
+    them. None entries (or dates=None) leave blocks undated.
+    """
+    from agent import MemoryMode
+
+    cm = agent._controller._cm
+    for i, chunk in enumerate(pairs):
+        user, assistant = _split_pair(chunk)
+        if not user.strip() and not assistant.strip():
+            continue
+        source_date = dates[i] if dates is not None else None
+        with cm.using_source_date(source_date):
+            if agent._mode == MemoryMode.LLM:
+                if user.strip():
+                    agent._llm_prep_phase(user)
+                if user.strip() or assistant.strip():
+                    agent._llm_persist_phase(user, assistant)
+            else:
+                _algorithmic_receive(agent, user)
+                _algorithmic_receive(agent, assistant)
+
+
+def mm_surface_and_format(agent, question_text: str) -> list[dict]:
+    """Surface relevant memory into context, then return blocks for the harness.
+
+    Surfaced blocks carry the source_date stamped at ingest (see mm_ingest), which
+    _format_blocks emits as created_at for date-ordering in the answerer prompt.
+    """
+    from agent import MemoryMode
+
+    if agent._mode == MemoryMode.LLM:
+        agent._llm_prep_phase(question_text)
+    else:
+        _algorithmic_receive(agent, question_text)
+    return _format_blocks(agent)
+
+
+def mm_persist(agent, question_text: str, answer: str) -> None:
+    """Persist the Q+A exchange after the harness answer."""
+    from agent import MemoryMode
+
+    if agent._mode == MemoryMode.LLM:
+        agent._llm_persist_phase(question_text, answer)
+    else:
+        _algorithmic_receive(agent, answer)