Skip to content

HKUSTDial/Alpha-SQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš€ Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search

Homepage ICML 2025 arXiv Slides Python License

โœจ If you find our work helpful, please don't hesitate to give us a star โญ !

Introduction Figure

๐Ÿ“– Introduction

Text-to-SQL, which enables natural language interaction with databases, serves as a pivotal method across diverse industries. With new, more powerful large language models (LLMs) emerging every few months, fine-tuning has become incredibly costly, labor-intensive, and error-prone. As an alternative, zero-shot Text-to-SQL, which leverages the growing knowledge and reasoning capabilities encoded in LLMs without task-specific fine-tuning, presents a promising and more challenging direction.

To address this challenge, we propose Alpha-SQL, a novel approach that leverages a Monte Carlo Tree Search (MCTS) framework to iteratively infer SQL construction actions based on partial SQL query states. To enhance the framework's reasoning capabilities, we introduce LLM-as-Action-Model to dynamically generate SQL construction actions during the MCTS process, steering the search toward more promising SQL queries. Moreover, Alpha-SQL employs a self-supervised reward function to evaluate the quality of candidate SQL queries, ensuring more accurate and efficient query generation.

Overview Figure

๐Ÿ“ Project Structure

AlphaSQL/
โ”œโ”€โ”€ ๐Ÿ“‚ data/
โ”‚   โ””โ”€โ”€ ๐Ÿ“‚ bird/
โ”‚       โ””โ”€โ”€ ๐Ÿ“‚ dev/
โ”‚           โ”œโ”€โ”€ ๐Ÿ“„ dev.json
โ”‚           โ””โ”€โ”€ ๐Ÿ“‚ dev_databases/
โ”œโ”€โ”€ ๐Ÿ“‚ config/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ qwen7b_sds_exp.yaml
โ”‚   โ””โ”€โ”€ ๐Ÿ“„ qwen32b_bird_dev.yaml
โ”œโ”€โ”€ ๐Ÿ“‚ results/
โ”‚   โ””โ”€โ”€ ๐Ÿ“„ dev_pred_sqls.json
โ”œโ”€โ”€ ๐Ÿ“‚ script/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ preprocess.sh
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ qwen32b_bird_dev_exp.sh
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ qwen7b_sds_exp.sh
โ”‚   โ””โ”€โ”€ ๐Ÿ“„ sql_selection.sh
โ”œโ”€โ”€ ๐Ÿ“‚ alphasql/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ runner/
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ preprocessor.py
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ sql_selection.py
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ mcts_runner.py
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ selection_runner.py
โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ task.py
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ templates/
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ schema_selection.txt
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ sql_revision.txt
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ sql_generation.txt
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ raphrase_question.txt
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ identify_column_functions.txt
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ identify_column_values.txt
โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ keywords_extraction.txt
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ config/
โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ mcts_config.py
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ database/
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ sql_execution.py
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ utils.py
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ sql_parse.py
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ schema.py
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ database_manager.py
โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ lsh_index.py
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ llm_call/
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ cost_recoder.py
โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ openai_llm.py
โ”‚   โ”‚   โ””โ”€โ”€ ๐Ÿ“„ prompt_factory.py
โ”‚   โ””โ”€โ”€ ๐Ÿ“‚ algorithm/
โ”‚       โ”œโ”€โ”€ ๐Ÿ“‚ selection/
โ”‚       โ”‚   โ””โ”€โ”€ ๐Ÿ“„ utils.py
โ”‚       โ””โ”€โ”€ ๐Ÿ“‚ mcts/
โ”‚           โ”œโ”€โ”€ ๐Ÿ“„ mcts_node.py
โ”‚           โ”œโ”€โ”€ ๐Ÿ“„ mcts_action.py
โ”‚           โ”œโ”€โ”€ ๐Ÿ“„ mcts.py
โ”‚           โ””โ”€โ”€ ๐Ÿ“„ reward.py
โ”œโ”€โ”€ ๐Ÿ“„ README.md
โ”œโ”€โ”€ ๐Ÿ“„ requirements.txt
โ””โ”€โ”€ ๐Ÿ“„ .env

๐Ÿ“ฅ Dataset Preparation

  1. Download required resources:

  2. Unzip the dataset to data/bird directoty following the project structure above.

๐Ÿ› ๏ธ Environment Setup

  1. AlphaSQL Env

    conda create -n alphasql python=3.11
    conda activate alphasql
    
    pip install -r requirements.txt
  2. VLLM Env

    conda create -n vllm python=3.12 -y
    conda activate vllm
    
    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    pip install -e .

๐Ÿš€ Deploy Local LLM Using VLLM

conda activate vllm

# For 4 GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen2.5-Coder-32B-Instruct --served-model-name Qwen/Qwen2.5-Coder-32B-Instruct --port 9999 -tp 4

# For 8 GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve Qwen/Qwen2.5-Coder-32B-Instruct --served-model-name Qwen/Qwen2.5-Coder-32B-Instruct --port 9999 -tp 8

๐Ÿƒโ€โ™‚๏ธRun AlphaSQL

1. Switch AlphaSQL Conda Env

conda activate alphasql

2. Dataset Preprocessing

  1. Configure your .env file based on env.example:

    # Required: OpenAI API Configuration (for LLM)
    OPENAI_API_KEY=your-api-key
    OPENAI_BASE_URL=https://api.openai.com/v1  # Or your custom endpoint
    
    # Required: Embedding Model Configuration
    EMBEDDING_MODEL=text-embedding-3-large  # Or text-embedding-3-small, text-embedding-ada-002
    
    # Optional: Separate embedding service (if using different base URL or API key)
    # EMBEDDING_API_KEY=your-embedding-api-key
    # EMBEDDING_BASE_URL=https://api.openai.com/v1  # or http://localhost:8080/v1

    Important Note:

    • If your embedding model uses a different base URL than your LLM, configure EMBEDDING_BASE_URL and EMBEDDING_API_KEY separately
    • If not set, embedding will automatically use OPENAI_BASE_URL and OPENAI_API_KEY
    • Example: LLM uses local VLLM (http://localhost:9999/v1), embedding uses OpenAI API (https://api.openai.com/v1)
    • See EMBEDDING_CONFIG.md for detailed configuration examples
  2. Run the following:

    bash script/preprocess.sh

3. Generate SQL Candidates

  1. Modify OPENAI_API_KEY and OPENAI_BASE_URL in .env file (we need to access Qwen/Qwen2.5-Coder-32B-Instruct model of VLLM delopyment)

    OPENAI_API_KEY="EMPTY"
    OPENAI_BASE_URL="http://0.0.0.0:9999/v1"
  2. Run the following:

    bash script/qwen32b_bird_dev_exp.sh

4. Select Final SQL

  1. Run the following:

    bash script/sql_selection.sh
  2. The final pred_sqls.json will in the project root dir (defined in script/sql_selection.sh OUTPUT_PATH variable)

๐Ÿ“ Citation

If you find our work useful or inspiring, please kindly cite:

@inproceedings{alpha-sql,
  author       = {Boyan Li and
                  Jiayi Zhang and
                  Ju Fan and
                  Yanwei Xu and
                  Chong Chen and
                  Nan Tang and
                  Yuyu Luo},
  title        = {Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search},
  booktitle    = {Forty-Second International Conference on Machine Learning, {ICML} 2025,
                  Vancouver, Canada, July 13-19, 2025},
  publisher    = {OpenReview.net},
  year         = {2025}
}

About

๐Ÿ”ฅ[ICML'25] Official repository for the paper "Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors