SkillCraft

Official implementation of the SkillCraft paper: SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

Project Overview

Real-world tool-using agents operate over long-horizon workflows with recurring structure. In this setting, strong behavior depends not only on calling atomic tools, but also on discovering, abstracting, and applying higher-level tool compositions.

SkillCraft is designed to explicitly evaluate this capability. The benchmark stress-tests whether agents can form and apply higher-level tool compositions (called Skills) under realistic, compositional tool-use scenarios.

Task difficulty is scaled along two axes:

Quantitative scaling: increase the number of entities/items an agent must process.
Structural scaling: compose subtasks into longer and more complex tool-use chains.

The accompanying protocol enables agents to compose atomic tools into executable skills, cache them, and apply them across tasks as a persistent skill library. In the paper’s evaluation, this leads to substantial efficiency gains (up to 80% token reduction) while preserving strong task performance.

Project website (tasks + trajectories):

https://skillcraft-website.github.io/page/

Repository Layout (Reproduction-Relevant)

tasks/scaled_tasks/: evaluation tasks used in the paper
test_all_tasks.py: main batch evaluation entrypoint
run.sh: single-task runner
prefix.sh: environment loading and runtime defaults

Reproducibility Guide

1. Prerequisites

Linux (recommended)
Python 3.10+
uv package manager
Node.js 22+ with npx available (scaled_tasks launches the filesystem MCP server via npx)
Valid LLM API endpoint/key (e.g., OpenRouter-compatible)
Docker/Podman (recommended for environment consistency)

Install uv if needed:

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Install dependencies

From repo root:

uv sync

3. Configure environment

Create .env in repo root (or export env vars in shell):

TOOLATHLON_OPENAI_API_KEY=YOUR_API_KEY
TOOLATHLON_OPENAI_BASE_URL=https://openrouter.ai/api/v1
TOOLATHLON_MODEL=deepseek-v3.2-exp
TOOLATHLON_PROVIDER=openrouter

prefix.sh will load .env automatically.

4. Running the Pipeline

A. Run Single task

Base mode:

bash run.sh scaled_tasks/cat-facts-collector/e1 base --model deepseek-v3.2-exp --provider openrouter

Skill mode:

bash run.sh scaled_tasks/cat-facts-collector/e1 skill --model deepseek-v3.2-exp --provider openrouter

B. Complete Evaluation (base + skill)

Main command for reproducing the complete scaled-task pipeline:

uv run python test_all_tasks.py \
  --scaled-tasks \
  --mode base,skill \
  --model deepseek-v3.2-exp \
  --provider openrouter

C. Resume an interrupted run

uv run python test_all_tasks.py \
  --continue-run test_runs/run_YYYYMMDD_HHMMSS \
  --scaled-tasks \
  --mode base,skill \
  --model deepseek-v3.2-exp \
  --provider openrouter

Expected Outputs

Each run produces a timestamped folder under test_runs/, including:

run_info.json
test_results_<provider>_<model>.json
summary_<provider>_<model>.json
dumps_base_test/... (base trajectories)
dumps_skill_test/... (skill trajectories)

For result validation, check:

Mode-level summary in summary_*.json
Per-task eval_res.json
Per-task traj_log.json completeness and tool call traces

Citation

If you use this repository, please cite the SkillCraft paper:

@misc{chen2026skillcraftllmagentslearn,
      title={SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?},
      author={Shiqi Chen and Jingze Gai and Ruochen Zhou and Jinghan Zhang and Tongyao Zhu and Junlong Li and Kangrui Wang and Zihan Wang and Zhengyu Chen and Klara Kaleb and Ning Miao and Siyang Gao and Cong Lu and Manling Li and Junxian He and Yee Whye Teh},
      year={2026},
      eprint={2603.00718},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.00718},
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
deployment		deployment
tasks/scaled_tasks		tasks/scaled_tasks
utils		utils
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.png		main.png
main.py		main.py
prefix.sh		prefix.sh
pyproject.toml		pyproject.toml
run.sh		run.sh
test_all_tasks.py		test_all_tasks.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkillCraft

Project Overview

Repository Layout (Reproduction-Relevant)

Reproducibility Guide

1. Prerequisites

2. Install dependencies

3. Configure environment

4. Running the Pipeline

A. Run Single task

B. Complete Evaluation (base + skill)

C. Resume an interrupted run

Expected Outputs

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SkillCraft

Project Overview

Repository Layout (Reproduction-Relevant)

Reproducibility Guide

1. Prerequisites

2. Install dependencies

3. Configure environment

4. Running the Pipeline

A. Run Single task

B. Complete Evaluation (base + skill)

C. Resume an interrupted run

Expected Outputs

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages