An engine for training and observing LLMs playing the Avalon board game. Supports multi-LLM battles, human participation, real-time spectating, game replay, statistics, batch execution, and RL training data export.
- Multi-LLM Battles: Support for OpenAI, Anthropic, DeepSeek, VLLM and more
- Human Participation: Play alongside AI agents
- Real-time Spectating: Watch games live through the web UI
- Game Replay: Step through historical games move by move
- Statistics: View win rates by model, role, and more
- Batch Execution: Run games in bulk via CLI with parallel support
- Training Data Export: Export game trajectories as JSONL for model training
- RL Training: On-policy self-play with Episode-level GAE + external Critic (Verl + PPO)
- Multi-turn Incremental Context (v2): Tool-calling based multi-turn conversation mode with incremental observations, designed for agentic RL training
cp .env.example .envEdit .env with your LLM API keys and database connection:
# OpenAI
OPENAI_API_KEY=sk-xxx
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODELS=gpt-4o,gpt-4o-mini
# Anthropic
ANTHROPIC_API_KEY=sk-ant-xxx
ANTHROPIC_MODELS=claude-3-5-sonnet-20241022
# DeepSeek (optional)
DEEPSEEK_API_KEY=xxx
DEEPSEEK_BASE_URL=https://api.deepseek.com
DEEPSEEK_MODELS=deepseek-chat
# MongoDB
MONGODB_URI=mongodb://localhost:27017
MONGODB_DATABASE=avalon# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/macOS
# or venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt# macOS (Homebrew)
brew services start mongodb-community
# or using Docker
docker run -d -p 27017:27017 --name avalon-mongo mongo:latestcd web
pnpm installStart the backend server (port 8001):
uvicorn server.main:asgi_app --host 0.0.0.0 --port 8001Start the frontend dev server (port 5173):
cd web
pnpm devVisit http://localhost:5173 to get started.
Use the training/run_batch.py CLI tool to run games in bulk and export training data:
# Run 100 games (single model)
python -m training.run_batch run -n 100 -m "qwen-plus:qwen"
# Run 100 games (multiple models, rotating)
python -m training.run_batch run -n 100 -m "qwen-plus:qwen,gpt-4o:openai"
# Parallel execution (4 games at once)
python -m training.run_batch run -n 100 -m "gpt-4o:openai" --parallel 4
# Without MongoDB (write directly to JSONL)
python -m training.run_batch run -n 100 -m "gpt-4o:openai" --no-mongo --output ./data/games.jsonl
# With experiment tag
python -m training.run_batch run -n 100 -m "gpt-4o:openai" --tag "exp_v1"
# List all batches
python -m training.run_batch list
# Export training trajectories
python -m training.run_batch export --batch-id <BATCH_ID> --output ./data/training.jsonl
# Export by tag
python -m training.run_batch export --tag "exp_v1" --output ./data/exp_v1.jsonlTrain LLMs via on-policy self-play with Episode-level GAE. See training/README.md for full details.
# Copy and edit the config template
cp training/configs/ppo_avalon.yaml training/configs/my_exp.yaml
# Start self-play training
bash training/scripts/self_play.sh training/configs/my_exp.yaml
# Resume from a checkpoint
RESUME_FROM_ROUND=3 RESUME_FROM_STEP=5 \
bash training/scripts/self_play.sh training/configs/my_exp.yamlEach round of self-play runs the full pipeline automatically:
- Run games with the current model via vLLM
- Compute game statistics (win rates → wandb)
- Critic inference: estimate V(s) for each decision point
- Episode-level GAE: compute advantage with credit assignment
- Preprocess data → parquet (with precomputed advantage)
- Verl PPO: train actor
- Train Critic
- Merge checkpoint → next round model
All outputs are saved under experiments/<experiment_name>/.
The engine supports two prompt/context modes for LLM players:
Each decision point reconstructs the complete game state from scratch as a single [prompt, response] pair. The model sees the full history every time.
Enabled via the "多轮增量上下文 (v2)" toggle in the web UI, or use_incremental_context=True in batch config.
The conversation accumulates across the entire game using the standard tool-calling protocol:
[system] — Game rules + role info (fixed)
[user] — Initial observation + first phase instruction
[assistant] — {tool_calls: [{speak, ...}]}
[tool] — Environment feedback (events only: votes, quest results, discussions...)
[user] — Next phase instruction
[assistant] — {tool_calls: [{vote_team, ...}]}
[tool] — Environment feedback
[user] — Next phase instruction
...
Key design decisions:
- [tool] = environment feedback: Only contains events that happened (vote results, quest outcomes, other players' speeches). No action instructions.
- [user] = action directive: Only contains the phase instruction telling the model what to do next.
- Incremental observations: Each tool response only includes NEW events since the last action, avoiding redundancy.
- Tool calling retained: All game tools (
speak,propose_team,vote_team,vote_quest,assassinate) are preserved. Onlyupdate_memoryis removed — the conversation history itself serves as memory. - Events in chronological order: vote results → quest results → round transitions → discussions → team proposals.
Enable v2 mode:
# Batch config
config = BatchConfig(use_incremental_context=True, ...)
# Batch API
POST /api/batch/run
{"use_incremental_context": true, ...}Related files: game/prompts_v2.py (incremental observation builder), server/llm/player_v2.py (multi-turn player).
avalon/
├── game/ # Game engine (standalone)
│ ├── roles.py # Role definitions & team logic
│ ├── rules.py # Rule configuration (5-10 players)
│ ├── state.py # Game state management
│ ├── engine.py # Core game logic
│ ├── prompts.py # Full-state prompt builder (v1)
│ ├── prompts_v2.py # Incremental observation builder (v2)
│ └── manager.py # Game manager (orchestrates engine + LLM + DB)
├── server/ # Python backend
│ ├── main.py # FastAPI + Socket.IO entry point
│ ├── config.py # Configuration
│ ├── llm/ # LLM integration
│ │ ├── base.py # Abstract base classes
│ │ ├── providers.py # Multi-provider support
│ │ ├── player.py # LLM player (v1, full-state)
│ │ ├── player_v2.py # LLM player (v2, multi-turn incremental)
│ │ └── tools.py # LLM tools / function calling
│ ├── api/ # REST API
│ │ ├── batch.py # Batch operations API
│ │ ├── config.py # Config API
│ │ ├── games.py # Games API
│ │ └── stats.py # Statistics API
│ ├── batch/ # Batch execution
│ │ ├── runner.py # Batch runner
│ │ └── exporter.py # Training data exporter
│ ├── socket/ # Socket.IO handlers
│ │ └── handlers.py # WebSocket event handlers
│ ├── models/ # Data models
│ │ ├── database.py # Database initialization
│ │ └── schemas.py # Pydantic schemas
│ └── storage/ # Data storage
│ └── repository.py # Repository
├── training/ # RL training (Verl + Episode-level GAE)
│ ├── run_batch.py # Batch game CLI tool
│ ├── data/ # Data preprocessing
│ ├── reward/ # Reward functions (GAE + length penalty)
│ ├── critic/ # Critic model (value head train/infer)
│ ├── advantage/ # Episode-level GAE computation
│ ├── stats/ # Per-round game statistics & wandb logging
│ ├── verl_extensions/ # Custom Verl advantage estimator
│ ├── configs/ # Training config templates (YAML)
│ ├── scripts/ # Self-play loop & PPO wrapper
│ └── eval/ # Model evaluation
├── web/ # React frontend
│ ├── src/
│ │ ├── App.tsx # App entry (routing)
│ │ ├── components/ # UI components
│ │ ├── pages/ # Pages
│ │ ├── hooks/ # Custom hooks
│ │ └── stores/ # State management
│ └── package.json
├── .env.example # Environment variables example
├── requirements.txt # Python dependencies
└── README.md
Avalon is a social deduction game where players are divided into Good and Evil teams.
Good Team:
- Merlin: Knows all Evil players, but must stay hidden
- Loyal Servant: No special abilities
Evil Team:
- Assassin: Can attempt to assassinate Merlin at the end
- Minion: Knows other Evil players
- Role Assignment: Roles are randomly assigned to each player
- Night Phase: Players receive their role-specific information
- Quest Phase (repeated for 5 rounds):
- The leader selects team members
- All players discuss
- Vote to approve or reject the team
- Team members execute the quest
- Assassination Phase: If Good completes 3 quests, the Assassin may attempt to kill Merlin
- Good wins: Complete 3 quests and Merlin survives
- Evil wins: Fail 3 quests, or 5 consecutive vote rejections, or successfully assassinate Merlin
Backend:
- FastAPI + python-socketio
- Motor (async MongoDB driver)
- OpenAI / Anthropic SDK
- Pydantic validation
Frontend:
- React 19 + TypeScript
- Vite + Tailwind CSS
- Zustand (state management) + Socket.IO Client
- Recharts (charts)
- Lucide React (icons)
RL Training:
- Verl (PPO framework)
- vLLM (self-play inference)
- PyTorch + HuggingFace Transformers (Critic model)
- wandb (experiment tracking)
MIT