MemoryCD

Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

A minimal, end-to-end evaluation harness for lifelong cross-domain personalization with LLM agents.

Dataset · Quick Start · Tasks & Metrics · Methods · Citation

Highlights

Lifelong, real user behavior. Memory is built from authentic Amazon reviews spanning years, not synthetic LLM-generated transcripts.
Two evaluation settings. Single-domain (memory and test from the same domain) and cross-domain (memory pooled from other domains, test on a target domain).
Four end-to-end tasks. Rating prediction, review summarization, review generation, and item ranking — covering both discriminative and generative use cases.
Two memory-selection methods out of the box. A long-context baseline and a BM25-based RAG retriever, both behind a simple plug-in interface.
One unified data file. Both eval settings read the same cross-domain JSONL — no separate per-domain dumps.
OpenAI-compatible. Works with OpenAI, OpenRouter, or any compatible gateway via env vars — no SDK lock-in.

Comparison with existing memory benchmarks

Benchmark	Time horizon	Domains	User behavior
MemoryCD (ours)	Long	Cross-domain	Real users
LaMP (Salemi et al., 2024)	Short	Single-domain	Real users
LoCoMo (Maharana et al., 2024)	Long	Single-domain	LLM-simulated

Overview

Figure 2. MemoryCD spans 12 real-world domains and evaluates 6 SOTA memory methods. Unlike benchmarks that probe only one memory stage (mostly retrieval), we design 4 basic tasks × 2 settings to provide end-to-end user-satisfaction evaluation grounded in lifelong real user behaviors.

Quick start

# 1. Install (Python 3.10+)
pip install -r requirements.txt

# 2. Configure your LLM provider
export OPENAI_API_KEY=...                # or OPENROUTER_API_KEY=... and LLM_PROVIDER=openrouter

# 3. Grab the dataset
pip install -U huggingface_hub
huggingface-cli download WZDavid/MemoryCD --repo-type dataset --local-dir .

# 4. Run a quick single-domain sanity check (first 5 users)
python evaluation_all_4task.py \
    --task rating_prediction \
    --domain Books \
    --input data/cross_domain.jsonl \
    --method long_context \
    --max-users 5

You should see per-user progress and a final summary printed to stdout, plus logs/single/Books/predictions_*.jsonl + summary_*.json on disk.

Setup

# Python 3.10+
pip install -r requirements.txt
# or
pip install -e .

Configure the LLM client (any OpenAI-compatible endpoint works):

# OpenAI
export OPENAI_API_KEY=...

# OpenRouter
export OPENROUTER_API_KEY=...
export LLM_PROVIDER=openrouter

Dataset

The unified cross-domain JSONL plus the per-domain metadata files are released on Hugging Face:

https://huggingface.co/datasets/WZDavid/MemoryCD

After downloading, place the files like this (or pass --input and --meta-dir to point elsewhere):

data/cross_domain.jsonl
meta/meta_Beauty_and_Personal_Care.jsonl.gz
meta/meta_Books.jsonl.gz
meta/meta_Electronics.jsonl.gz
meta/meta_Home_and_Kitchen.jsonl.gz

Quick download with the Hugging Face CLI:

pip install -U huggingface_hub
huggingface-cli download WZDavid/MemoryCD --repo-type dataset --local-dir .

Input-data format (click to expand)

Both evaluation scripts read a single cross-domain JSONL file. Each line is one user:

{"user_id": "...", "interactions": {
    "Beauty_and_Personal_Care": [{"parent_asin": "...", "rating": 5, "title": "...",
                                  "text": "...", "timestamp": 1700000000,
                                  "negative_parent_asins": ["...", "..."]}],
    "Books": [...],
    "Electronics": [...],
    "Home_and_Kitchen": [...]
}}

Metadata files meta_<Domain>.jsonl[.gz] (one item per line, with parent_asin, title, average_rating, main_category, optional description) should live in meta/ by default.

Required fields per interaction:

parent_asin, rating, title, text, timestamp
negative_parent_asins (only needed for item_ranking)

Tasks and metrics

Task	Output	Metrics
`rating_prediction`	integer 1–5	MAE, RMSE
`review_summarization`	short review title	ROUGE-1/L, BLEU-1/4, BERTScore
`review_generation`	full review body	ROUGE-1/L, BLEU-1/4, BERTScore
`item_ranking`	ranked ASIN list	NDCG@K, Recall@K (default K = 1, 3, 5)

Methods

`--method`	Behavior
`long_context`	Keep the most-recent N interactions by recency.
`rag`	BM25 retrieval over memory; top-K via `--max-memory-items`. Persistent disk cache under `cache/`.

Adding a new method is a single file in methods/ plus one branch in eval_core.build_method().

Settings

Single-domain — `evaluation_all_4task.py`

Memory and test both come from the same domain. For each user, the last --num-test interactions in --domain are the test set; earlier ones are the memory.

python evaluation_all_4task.py \
    --task rating_prediction \
    --domain Books \
    --input data/cross_domain.jsonl \
    --llm-model openai/gpt-5

python evaluation_all_4task.py \
    --task item_ranking \
    --domain Electronics \
    --input data/cross_domain.jsonl \
    --method rag --max-memory-items 10 \
    --k-values 1 3 5

Cross-domain — `evaluation_all_4task_cross_domain.py`

Memory comes from one or more source domains; test comes from a target domain.

# Default: memory = the 3 non-target domains
python evaluation_all_4task_cross_domain.py \
    --task rating_prediction \
    --target-domain Home_and_Kitchen \
    --input data/cross_domain.jsonl \
    --llm-model openai/gpt-5

# Subset of source domains
python evaluation_all_4task_cross_domain.py \
    --task review_summarization \
    --target-domain Home_and_Kitchen \
    --source-domains Books Electronics

# No-memory baseline
python evaluation_all_4task_cross_domain.py \
    --task rating_prediction \
    --target-domain Books \
    --no-memory

Common flags

Flag	Description
`--task`	One of the 4 tasks above.
`--input`	Path to the unified cross-domain JSONL.
`--meta-dir`	Directory with `meta_<Domain>.jsonl[.gz]` (default `meta/`).
`--llm-model`	Model identifier for the OpenAI-compatible client.
`--method`	`long_context` (default) or `rag`.
`--max-memory-items`	Cap on memory items kept after selection.
`--num-test`	Number of last interactions per user used as test (default 3).
`--max-users`	Limit to first N users for quick smoke tests.
`--k-values`	`--k-values 1 3 5` (item_ranking only).
`--log-dir`	Output log directory (default `logs/single` or `logs/cross`).

Outputs

Each run writes two files:

logs/<setting>/<domain>/predictions_<run_id>.jsonl — one entry per prediction with the full prompt, raw LLM response, prediction, ground truth, and per-prediction metrics.
logs/<setting>/<domain>/summary_<run_id>.json — aggregated final metrics plus RAG cache statistics (when applicable).

<run_id> encodes setting + domain + task + method + model + caps + timestamp, so every experiment lands in its own pair of files.

Reproduce the full sweep

Two parallel drivers run the full grid of tasks × methods × models × domains:

# Single-domain sweep
export API_KEY=$OPENAI_API_KEY
./run_evaluation_pa.sh 4               # 4 parallel jobs

# Cross-domain sweep
./run_evaluation_cross_domain.sh 4

Edit the DOMAINS / TARGET_DOMAINS / TASKS / METHODS / MODELS arrays at the top of each script to scope the run.

Repository layout

api.py                                   LLM gateway client (OpenAI / OpenRouter)
eval_core.py                             Shared library: loaders, evaluators, predictor, prompts
evaluation_all_4task.py                  Single-domain entry point
evaluation_all_4task_cross_domain.py     Cross-domain entry point
methods/
  __init__.py
  base_method.py                         Abstract memory-selection interface
  long_context.py                        Recency-based selection
  rag.py                                 BM25 retrieval with persistent cache
run_evaluation_pa.sh                     Parallel driver — single domain
run_evaluation_cross_domain.sh           Parallel driver — cross domain
requirements.txt                         Pinned core dependencies
pyproject.toml                           PEP-621 manifest
assets/                                  Figures used in this README

FAQ

Do I need GPUs?

No. All four tasks call an external LLM API. The only optional GPU usage is BERTScore, which falls back to CPU automatically (it's a touch slower but still runs end-to-end).

Can I evaluate my own model?

Yes — set OPENAI_BASE_URL (or OPENROUTER_BASE_URL) to any OpenAI-compatible endpoint and pass the model name via --llm-model. Self-hosted servers (vLLM, llama.cpp, TGI, Ollama with its OpenAI shim) all work out of the box.

How do I add a new memory method?

Subclass BaseMethod in methods/your_method.py and implement select_memory(...).
Register it in methods/__init__.py.
Add one elif branch in eval_core.build_method() and a CLI choice in both eval entry points.

The shared evaluator, logger, and prompt builders in eval_core.py don't need to change.

Where is the data-prep pipeline?

Out of scope for this repository. The released dataset is a single self-contained JSONL on Hugging Face. If you want to rebuild from raw Amazon-reviews dumps, see the dataset card.

License

MIT — see LICENSE.

Citation

If you use MemoryCD in your research, please cite:

@article{zhang2026memorycd,
  title={Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization},
  author={Zhang, Weizhi and Wei, Xiaokai and Huang, Wei-Chieh and Hui, Zheng and Wang, Chen and Gong, Michelle and Yu, Philip S},
  journal={arXiv preprint arXiv:2603.25973},
  year={2026}
}

Acknowledgements

The benchmark builds on the public Amazon Reviews dataset and is inspired by prior memory benchmarks including LaMP and LoCoMo. We thank the authors of those works for releasing their resources.

Made with research-grade duct tape. PRs welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MemoryCD

Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Highlights

Comparison with existing memory benchmarks

Overview

Quick start

Setup

Dataset

Tasks and metrics

Methods

Settings

Single-domain — `evaluation_all_4task.py`

Cross-domain — `evaluation_all_4task_cross_domain.py`

Common flags

Outputs

Reproduce the full sweep

Repository layout

FAQ

License

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
methods		methods
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api.py		api.py
eval_core.py		eval_core.py
evaluation_all_4task.py		evaluation_all_4task.py
evaluation_all_4task_cross_domain.py		evaluation_all_4task_cross_domain.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_evaluation_cross_domain.sh		run_evaluation_cross_domain.sh
run_evaluation_pa.sh		run_evaluation_pa.sh

Folders and files

Latest commit

History

Repository files navigation

MemoryCD

Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Highlights

Comparison with existing memory benchmarks

Overview

Quick start

Setup

Dataset

Tasks and metrics

Methods

Settings

Single-domain — evaluation_all_4task.py

Cross-domain — evaluation_all_4task_cross_domain.py

Common flags

Outputs

Reproduce the full sweep

Repository layout

FAQ

License

Citation

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Single-domain — `evaluation_all_4task.py`

Cross-domain — `evaluation_all_4task_cross_domain.py`

Packages