A minimal, end-to-end evaluation harness for lifelong cross-domain personalization with LLM agents.
Dataset · Quick Start · Tasks & Metrics · Methods · Citation
- Lifelong, real user behavior. Memory is built from authentic Amazon reviews spanning years, not synthetic LLM-generated transcripts.
- Two evaluation settings. Single-domain (memory and test from the same domain) and cross-domain (memory pooled from other domains, test on a target domain).
- Four end-to-end tasks. Rating prediction, review summarization, review generation, and item ranking — covering both discriminative and generative use cases.
- Two memory-selection methods out of the box. A long-context baseline and a BM25-based RAG retriever, both behind a simple plug-in interface.
- One unified data file. Both eval settings read the same cross-domain JSONL — no separate per-domain dumps.
- OpenAI-compatible. Works with OpenAI, OpenRouter, or any compatible gateway via env vars — no SDK lock-in.
| Benchmark | Time horizon | Domains | User behavior |
|---|---|---|---|
| MemoryCD (ours) | Long | Cross-domain | Real users |
| LaMP (Salemi et al., 2024) | Short | Single-domain | Real users |
| LoCoMo (Maharana et al., 2024) | Long | Single-domain | LLM-simulated |
Figure 2. MemoryCD spans 12 real-world domains and evaluates 6 SOTA memory methods. Unlike benchmarks that probe only one memory stage (mostly retrieval), we design 4 basic tasks × 2 settings to provide end-to-end user-satisfaction evaluation grounded in lifelong real user behaviors.
# 1. Install (Python 3.10+)
pip install -r requirements.txt
# 2. Configure your LLM provider
export OPENAI_API_KEY=... # or OPENROUTER_API_KEY=... and LLM_PROVIDER=openrouter
# 3. Grab the dataset
pip install -U huggingface_hub
huggingface-cli download WZDavid/MemoryCD --repo-type dataset --local-dir .
# 4. Run a quick single-domain sanity check (first 5 users)
python evaluation_all_4task.py \
--task rating_prediction \
--domain Books \
--input data/cross_domain.jsonl \
--method long_context \
--max-users 5You should see per-user progress and a final summary printed to stdout, plus
logs/single/Books/predictions_*.jsonl + summary_*.json on disk.
# Python 3.10+
pip install -r requirements.txt
# or
pip install -e .Configure the LLM client (any OpenAI-compatible endpoint works):
# OpenAI
export OPENAI_API_KEY=...
# OpenRouter
export OPENROUTER_API_KEY=...
export LLM_PROVIDER=openrouterThe unified cross-domain JSONL plus the per-domain metadata files are released on Hugging Face:
After downloading, place the files like this (or pass --input and --meta-dir to point elsewhere):
data/cross_domain.jsonl
meta/meta_Beauty_and_Personal_Care.jsonl.gz
meta/meta_Books.jsonl.gz
meta/meta_Electronics.jsonl.gz
meta/meta_Home_and_Kitchen.jsonl.gz
Quick download with the Hugging Face CLI:
pip install -U huggingface_hub
huggingface-cli download WZDavid/MemoryCD --repo-type dataset --local-dir .Input-data format (click to expand)
Both evaluation scripts read a single cross-domain JSONL file. Each line is one user:
{"user_id": "...", "interactions": {
"Beauty_and_Personal_Care": [{"parent_asin": "...", "rating": 5, "title": "...",
"text": "...", "timestamp": 1700000000,
"negative_parent_asins": ["...", "..."]}],
"Books": [...],
"Electronics": [...],
"Home_and_Kitchen": [...]
}}Metadata files meta_<Domain>.jsonl[.gz] (one item per line, with parent_asin, title,
average_rating, main_category, optional description) should live in meta/ by default.
Required fields per interaction:
parent_asin,rating,title,text,timestampnegative_parent_asins(only needed foritem_ranking)
| Task | Output | Metrics |
|---|---|---|
rating_prediction |
integer 1–5 | MAE, RMSE |
review_summarization |
short review title | ROUGE-1/L, BLEU-1/4, BERTScore |
review_generation |
full review body | ROUGE-1/L, BLEU-1/4, BERTScore |
item_ranking |
ranked ASIN list | NDCG@K, Recall@K (default K = 1, 3, 5) |
--method |
Behavior |
|---|---|
long_context |
Keep the most-recent N interactions by recency. |
rag |
BM25 retrieval over memory; top-K via --max-memory-items. Persistent disk cache under cache/. |
Adding a new method is a single file in
methods/plus one branch ineval_core.build_method().
Memory and test both come from the same domain. For each user, the last --num-test
interactions in --domain are the test set; earlier ones are the memory.
python evaluation_all_4task.py \
--task rating_prediction \
--domain Books \
--input data/cross_domain.jsonl \
--llm-model openai/gpt-5
python evaluation_all_4task.py \
--task item_ranking \
--domain Electronics \
--input data/cross_domain.jsonl \
--method rag --max-memory-items 10 \
--k-values 1 3 5Memory comes from one or more source domains; test comes from a target domain.
# Default: memory = the 3 non-target domains
python evaluation_all_4task_cross_domain.py \
--task rating_prediction \
--target-domain Home_and_Kitchen \
--input data/cross_domain.jsonl \
--llm-model openai/gpt-5
# Subset of source domains
python evaluation_all_4task_cross_domain.py \
--task review_summarization \
--target-domain Home_and_Kitchen \
--source-domains Books Electronics
# No-memory baseline
python evaluation_all_4task_cross_domain.py \
--task rating_prediction \
--target-domain Books \
--no-memory| Flag | Description |
|---|---|
--task |
One of the 4 tasks above. |
--input |
Path to the unified cross-domain JSONL. |
--meta-dir |
Directory with meta_<Domain>.jsonl[.gz] (default meta/). |
--llm-model |
Model identifier for the OpenAI-compatible client. |
--method |
long_context (default) or rag. |
--max-memory-items |
Cap on memory items kept after selection. |
--num-test |
Number of last interactions per user used as test (default 3). |
--max-users |
Limit to first N users for quick smoke tests. |
--k-values |
--k-values 1 3 5 (item_ranking only). |
--log-dir |
Output log directory (default logs/single or logs/cross). |
Each run writes two files:
logs/<setting>/<domain>/predictions_<run_id>.jsonl— one entry per prediction with the full prompt, raw LLM response, prediction, ground truth, and per-prediction metrics.logs/<setting>/<domain>/summary_<run_id>.json— aggregated final metrics plus RAG cache statistics (when applicable).
<run_id> encodes setting + domain + task + method + model + caps + timestamp, so
every experiment lands in its own pair of files.
Two parallel drivers run the full grid of tasks × methods × models × domains:
# Single-domain sweep
export API_KEY=$OPENAI_API_KEY
./run_evaluation_pa.sh 4 # 4 parallel jobs
# Cross-domain sweep
./run_evaluation_cross_domain.sh 4Edit the DOMAINS / TARGET_DOMAINS / TASKS / METHODS / MODELS arrays at the
top of each script to scope the run.
api.py LLM gateway client (OpenAI / OpenRouter)
eval_core.py Shared library: loaders, evaluators, predictor, prompts
evaluation_all_4task.py Single-domain entry point
evaluation_all_4task_cross_domain.py Cross-domain entry point
methods/
__init__.py
base_method.py Abstract memory-selection interface
long_context.py Recency-based selection
rag.py BM25 retrieval with persistent cache
run_evaluation_pa.sh Parallel driver — single domain
run_evaluation_cross_domain.sh Parallel driver — cross domain
requirements.txt Pinned core dependencies
pyproject.toml PEP-621 manifest
assets/ Figures used in this README
Do I need GPUs?
No. All four tasks call an external LLM API. The only optional GPU usage is BERTScore, which falls back to CPU automatically (it's a touch slower but still runs end-to-end).
Can I evaluate my own model?
Yes — set OPENAI_BASE_URL (or OPENROUTER_BASE_URL) to any OpenAI-compatible endpoint
and pass the model name via --llm-model. Self-hosted servers (vLLM, llama.cpp,
TGI, Ollama with its OpenAI shim) all work out of the box.
How do I add a new memory method?
- Subclass
BaseMethodinmethods/your_method.pyand implementselect_memory(...). - Register it in
methods/__init__.py. - Add one
elifbranch ineval_core.build_method()and a CLI choice in both eval entry points.
The shared evaluator, logger, and prompt builders in eval_core.py don't need to change.
Where is the data-prep pipeline?
Out of scope for this repository. The released dataset is a single self-contained JSONL on Hugging Face. If you want to rebuild from raw Amazon-reviews dumps, see the dataset card.
MIT — see LICENSE.
If you use MemoryCD in your research, please cite:
@article{zhang2026memorycd,
title={Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization},
author={Zhang, Weizhi and Wei, Xiaokai and Huang, Wei-Chieh and Hui, Zheng and Wang, Chen and Gong, Michelle and Yu, Philip S},
journal={arXiv preprint arXiv:2603.25973},
year={2026}
}The benchmark builds on the public Amazon Reviews dataset and is inspired by prior memory benchmarks including LaMP and LoCoMo. We thank the authors of those works for releasing their resources.

