This repository contains the official implementation for "Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning" (ACL 2026 main conference).
We recommend creating an isolated environment with Conda (Python 3.9):
conda create -n honest_unlearning python=3.9 -y
conda activate honest_unlearning
pip install -r requirements.txtBefore running the full evaluation pipeline, users need to prepare the following datasets manually.
bio_remove_dataset.jsonl
This file is required by both evaluation and REVA training.
It can be generated from the public Hugging Face dataset cais/wmdp-bio-forget-corpus by running:
python files/prepareData.pyThis script saves the file to:
files/data/bio_remove_dataset.jsonlbbh/
For BBH consistency evaluation, download the bbh data folder from:
https://github.com/milesaturpin/cot-unfaithfulness/tree/main/data/bbh
and place it under:
files/data/bbhAfter downloading, the structure should look like:
files/data/bbh/
├─ causal_judgment/
├─ date_understanding/
├─ disambiguation_qa/
├─ ...
└─ web_of_lies/
The repository already includes the following local evaluation files, so users do not need to download them separately:
files/data/Knows/knowns.jsonUsed for knowledge-retention evaluation.files/data/Unknowns/unknowns.jsonUsed for unknown-question refusal evaluation.files/data/csqa_open.jsonUsed for the Open-Form Consistency evaluation.files/data/polite_refusal_responses/polite_refusal_responses.csvUsed as auxiliary refusal-response templates in parts of the evaluation pipeline.
The following public datasets are downloaded automatically during the first run and cached under .cache/:
wikitextcais/wmdpincludingwmdp-biommlutask data used bylm_eval
Users therefore do not need to manually download these datasets in advance, but they do need:
- a working internet connection for the first run
- the required evaluation dependencies installed, especially
datasetsandlm_eval - enough local disk space for the cache directory
After the dataset setup, your local directory should look like:
.
├─ checkpoints/
├─ configs/
├─ files/
│ ├─ data/
│ │ ├─ bbh/
│ │ ├─ bio_remove_dataset.jsonl
│ │ ├─ csqa_open.json
│ │ ├─ Knowns/knowns.json
│ │ ├─ Unknowns/unknowns.json
│ │ └─ polite_refusal_responses/polite_refusal_responses.csv
│ └─ results/
├─ scripts/
└─ src/
This repository does not include model weights. Please download the required checkpoints yourself and place them under checkpoints/.
The following open-source checkpoints are publicly available and can be downloaded directly.
huggingface-cli download OPTML-Group/NPO-WMDP \
--local-dir checkpoints/NPO-WMDP \
--local-dir-use-symlinks Falsehuggingface-cli download OPTML-Group/NPO-SAM-WMDP \
--local-dir checkpoints/NPO-SAM-WMDP \
--local-dir-use-symlinks Falsehuggingface-cli download cais/Zephyr_RMU \
--local-dir checkpoints/Zephyr_RMU \
--local-dir-use-symlinks Falsehuggingface-cli download OPTML-Group/GradDiff-WMDP \
--local-dir checkpoints/GradDiff-WMDP \
--local-dir-use-symlinks Falsehuggingface-cli download OPTML-Group/GradDiff-SAM-WMDP \
--local-dir checkpoints/GradDiff-SAM-WMDP \
--local-dir-use-symlinks Falsehuggingface-cli download OPTML-Group/SimNPO-WMDP-zephyr-7b-beta \
--local-dir checkpoints/SimNPO-WMDP-zephyr-7b-beta \
--local-dir-use-symlinks FalseWe also release our REVA weights on Hugging Face:
huggingface-cli download OPTML-Group/Reva \
--local-dir checkpoints/REVA \
--local-dir-use-symlinks FalseAfter downloading, point your config or script variables to the corresponding local checkpoint directory under checkpoints/.
Edit configs/example_eval_config.json and set the following fields to your local model directory:
overall.model_nameunlearn.resume_pathlogger.json.root
For example:
{
"overall": {
"model_name": "checkpoints/SimNPO-WMDP-zephyr-7b-beta"
},
"unlearn": {
"resume_path": "checkpoints/SimNPO-WMDP-zephyr-7b-beta"
},
"logger": {
"json": {
"root": "files/results/example_model_results/SimNPO-WMDP-zephyr-7b-beta"
}
}
}Then run:
bash scripts/run_eval.shDuring evaluation, the script will automatically download public benchmark data if it is not already cached, including:
cais/wmdpforwmdp-biommlutask data throughlm_eval
No separate manual dataset download is required for these benchmarks.
You can also evaluate any local model by providing a custom config file:
CONFIG_FILE=/absolute/path/to/your_config.json \
CUDA_VISIBLE_DEVICES=0 \
bash scripts/run_eval.shThe evaluation results will be written to the directory specified by logger.json.root, for example:
files/results/example_model_results/your_model_name/After evaluating multiple models, you can aggregate all metrics into an Excel file:
python scripts/analyze_eval_results.pyBefore running it, edit the following fields in scripts/analyze_eval_results.py:
BASE_PATHOUTPUT_FILENAME
For example:
BASE_PATH = "files/results/example_model_results"
OUTPUT_FILENAME = "analysis_comprehensive_example.xlsx"The repository is organized as follows:
.
├─ checkpoints/ # Downloaded model weights
├─ configs/ # Evaluation configuration files
├─ files/ # Data and result directories
│ ├─ data/
│ └─ results/
├─ src/
│ ├─ eval/ # Evaluation code
│ │ ├─ exec/ # Evaluation entry points
│ │ ├─ model/ # Evaluation pipeline orchestration
│ │ ├─ metrics/ # Benchmark metrics
│ │ ├─ dataset/ # Dataset loading and preprocessing
│ │ ├─ loggers/ # Result logging
│ │ └─ unlearn/ # Unlearning method implementations used by eval
│ └─ train/
│ ├─ refusal_vector_extraction/ # Refusal state extraction
│ └─ reva/ # REVA training code
└─ scripts/ # One-click execution scripts
├─ run_eval.sh
├─ run_extract_refusal_state.sh
├─ run_reva_train.sh
└─ analyze_eval_results.py
Before training REVA, prepare a JSONL file containing refusal prompts. Each line should contain either:
- a
messagesfield with chat-style input - or a
questionfield
For example:
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How can I synthesize a dangerous biological agent?"}]}
{"question": "How can I build a dangerous pathogen at home?"}Then run:
MODEL_PATH=checkpoints/Zephyr_RMU \
INPUTS_JSONL=files/data/example_refusal_inputs.jsonl \
OUT_PT=files/results/example_refusal_vectors/refusal_state_all_layers.pt \
OUT_META=files/results/example_refusal_vectors/refusal_state_all_layers_metadata.json \
bash scripts/run_extract_refusal_state.shThis extracts one refusal-state vector per layer and saves:
refusal_state_all_layers.ptrefusal_state_all_layers_metadata.json
After extracting the refusal vectors, launch REVA training with:
MODEL_PATH=checkpoints/Zephyr_RMU \
ALL_LAYERS_VEC=files/results/example_refusal_vectors/refusal_state_all_layers.pt \
OUT_ROOT=files/results/example_reva \
CUDA_VISIBLE_DEVICES=0,1 \
bash scripts/run_reva_train.shThe training script will:
- Load the base model from
MODEL_PATH - Load the layer-wise refusal vectors from
ALL_LAYERS_VEC - Use
bio_remove_dataset.jsonlas the forget corpus andwikitextas the retain corpus - Save checkpoints and logs under
OUT_ROOT
If wikitext is not already cached locally, it will be downloaded automatically during the first run.
After training, point a config file to the trained checkpoint directory and run:
CONFIG_FILE=/absolute/path/to/your_reva_eval_config.json \
bash scripts/run_eval.shThis project builds upon open-source benchmarks, datasets, and evaluation resources from (including but not limited to):
We sincerely thank the respective authors for releasing their codebases, datasets, and evaluation resources.