Skip to content

GradeSQL is a framework to generate training datasets and train Outcome Reward Models (ORMs) that score and rank LLM-generated SQL, offering an alternative to execution-based BoN and Majority Voting, with pipelines and benchmarks (Spider, BIRD).

License

Notifications You must be signed in to change notification settings

sisinflab/GradeSQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models

GradeSQL is a Python-based framework for ranking SQL queries generated by Large Language Models (LLMs). It introduces an Outcome Reward Model (ORM) that evaluates candidate SQL queries at inference time, improving reliability and execution accuracy in Text-to-SQL tasks.

This toolkit enables researchers and practitioners to:

  • Run test-time scaling with candidate SQL queries.
  • Build a synthetic dataset and train an ORM-based ranker in one step.
  • Reproduce experiments from the paper on benchmarks such as Spider and BIRD.

If you use GradeSQL in your research, please cite the following paper:

@misc{gradesqloutcomerewardmodels2025,
      title={GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models}, 
      author={Mattia Tritto and Giuseppe Farano and Dario Di Palma and Gaetano Rossiello and Fedelucio Narducci and Dharmashankar Subramanian and Tommaso Di Noia},
      year={2025},
      eprint={2509.01308},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.01308}, 
}

Prerequisites

  • Python Version: Python 3.11 (recommended)

Installation and Setup

Follow these steps to set up and configure the GradeSQL framework:

Step 1: Create the Project Directory Structure

First, create the main project directory where GradeSQL will be installed:

mkdir project_dir
cd project_dir

Step 2: Clone the Repository

Clone the GradeSQL repository from GitHub into your project directory:

git clone https://github.com/sisinflab/GradeSQL

After cloning, your directory structure should look like:

project_dir/
└── GradeSQL/
    ├── src/
    ├── README.md
    └── ...

Step 3: Navigate to the Project Directory

Change into the cloned repository directory:

cd GradeSQL

You are now in the main project directory where all configuration and execution commands will be run.

Step 4: Configure the Main Settings

Navigate to the configuration file and update the primary settings:

nano src/config/config.py

In the config.py file, locate and modify the following parameters:

self.normal_or_hpc = "normal"
self.name_dir_exp = "project_dir"  
  • normal_or_hpc: Choose "normal" for standard environments with internet access. Choose "hpc" for high-performance computing clusters or environments without internet connectivity.
  • name_dir_exp: This must exactly match the name of the parent directory you created in Step 1.

Step 5: Configure the Setup Script

Navigate to the setup script and update the project path:

nano src/config/setup.sh

In the setup.sh file, locate the PROJECT_DIR variable and update it with the absolute path to your GradeSQL directory:

PROJECT_DIR=/absolute/path/to/your/GradeSQL

Step 6: Execute the Setup Script

Make the setup script executable:

chmod +x src/config/setup.sh

Source the setup script to initialize the environment:

source src/config/setup.sh

Step 7: download data directory

Move to the GradeSQL main directory, and download from Hugging Face the data directory, and unzip directly in GradeSQL folder.

cd GradeSQL
huggingface-cli download sisinflab-ai/GradeSQL-reproducibility-data --repo-type=dataset --local-dir=./
unzip data.zip

Usage

Once GradeSQL is successfully installed and configured, you can begin running experiments and evaluating SQL queries generations using this framework. The following sections detail how to use the various features and capabilities.

Running Individual Experiments

Step 1: Create a Configuration File

Start by creating a new configuration file based on the provided template:

cp config_template.yaml test.yaml

Step 2: Customize Your Configuration

Edit your newly created configuration file (called recipe from now) to match your specific experiment requirements:

nano test.yaml

Key configuration parameters you may want to modify:

general:
  seed: 42 # Set this for reproducibility
  debug: "false" # This tests the flow of the pipeline. Set to false to run the whole experiment



dataset:

  is_train_dev_test: "test" # Choose between train dev or test set
  dataset_name: "spider" # Choose between spider or bird dataset
  dataset_path: "../../data/spider/test_merged.json" # Dataset with schema linking
  tables_path: "../../data/spider/test_tables.json" # Structures of tables
  db_sqlite: '../../data/spider/test_database/' # SQLlite databases
  ground_truth_path: '../../data/spider/test_gold.sql' # Ground truth SQL queries for evaluation

  # is_train_dev_test: "train" # Choose between train dev or test set
  # dataset_name: "bird" # Choose between spider or bird dataset
  # dataset_path: "../../data/bird/train/train_merged.json" # Dataset with schema linking
  # tables_path: "../../data/bird/train/train_tables.json" # Structures of tables
  # db_sqlite: '../../data/bird/train/train_databases/' # SQLlite databases

  # is_train_dev_test: "dev" # Choose between train dev or test set
  # dataset_name: "bird" # Choose between spider or bird dataset
  # dataset_path: "../../data/bird/dev_20240627/dev_merged.json" # Dataset with schema linking
  # tables_path: "../../data/bird/dev_20240627/dev_tables.json" # Structures of tables
  # db_sqlite: '../../data/bird/dev_20240627/dev_databases/' # SQLlite databases
  # ground_truth_path: '../../data/bird/dev_20240627/' # Ground truth SQL queries for evaluation



inference:
  model_name: "seeklhy/OmniSQL-7B"
  path_model_pretrained: "../../../../hf_cache/Omni7B" # LLM used for making predictions
  max_tokens: 2048 # Maximun number of tokens the LLM is allowed to generate

  temperature: 0.8
  test_time_strategy: "majority_voting" # Choose between best_of_n_orm, best_of_n_heuristic, majority_voting, pass@k
  n: 32 # Number of predictions
  num_samples_per_batch: 32 # Each batch contains this number of samples
  n_decreasing: False # If True, the number of samples per batch decreases with each iteration
  n_step: 1 # Number of steps to decrease the number of samples per batch

  start_query: 0
  offset_start_query: 2147 # There is start and offset because for multinode inference, you can parallelize inference by writing multiple recipes with different starts and offsets
  timeout_sql: 30 # Timeout set for evaluating queries 
  limited_rows: "none" # Number of rows extracted when executing a query


orm:
  orm_prompt:  "Question: {QUESTION}\nSQL: {SQL}\nIs the SQL correct?" # Prompt used to ask the LLM if the SQL is correct
  max_new_tokens: 5 # Maximum number of tokens the LLM is allowed to generate when checking the SQL
  use_logits: True   # If True, the LLM will return logits for the generated tokens
  dataset_train_version: "v1" # Version of the training dataset
  dataset_test_version: "v1" # Version of the test dataset
  finetuning_version: "v1" # Version of the finetuning script to use
  orm_or_prm: "orm" # Which model you're finetuning?
  base_model_name: "seeklhy/OmniSQL-7B" # Base model name
  version_model_finetuned: 10 # Used to name the new finetuned model
  resume_from_checkpoint: False # If there's a previous finetuning checkpoint, resume from the latest one
  finetuning_iterations: 1 # Number of finetuning iterations
  r: 16 # LoRA rank of the low-rank update
  lora_alpha: 64 # How much the LoRa update affects original weights
  target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] # Transformers modules affected by LoRA finetuning
  lora_dropout: 0.05 # Prevents overfitting
  bias: "none" # Bias terms are not finetuned
  fp16: True # Half precision floating point
  learning_rate: 7e-5
  per_device_train_batch_size: 5
  num_train_epochs: 50
  upload_at_end: "true" # Upload LoRa weights in HF repositories
  hf_username: "enter_your_HF_username_here"
  hf_token: "enter_your_HF_token_here" # For uploading models
  • Ensure dataset paths are correct and accessible. They start from config directory, so the relative paths in this directory point to the data directory placed inside GradeSQL dir.
  • Adjust model parameters based on your computational resources.

In order to download the generator model, follow these steps. From the GradeSQL directory, run:

mkdir ../../hf_cache
cd ../../hf_cache
mkdir Omni7B
cd Omni7B
huggingface-cli download seeklhy/OmniSQL-7B --local-dir=./

Step 3: Execute Your Experiment

Run your configured experiment using the framework's execution command:

cd GradeSQL
$RUN_EXPERIMENT test

What happens during execution:

  • The framework loads your configuration from test.yaml
  • Generator model generates first all CoTs, then SQLs are extracted and passed to the test_time_scaling module
  • SQL query ranking is performed by test_time_scaling module
  • Predictions are cached, for further experiments, if the configuration is the same
  • Results are saved into results directory
  • Progress is displayed in the terminal

Expected output:

   ____               _      ____   ___  _
  / ___|_ __ __ _  __| | ___/ ___| / _ \| |
 | |  _| '__/ _` |/ _` |/ _ \___ \| | | | |
 | |_| | | | (_| | (_| |  __/___) | |_| | |___
  \____|_|  \__,_|\__,_|\___|____/ \__\_\_____|

[Progress]: Running experiment test...
==================================================
          Experiment Settings
==================================================
Model Name               : seeklhy/OmniSQL-7B
Max Tokens               : 2048
N                        : 32
N Decreasing             : False
N Step                   : 1
Seed                     : 42
Temperature              : 0.8
Start Query              : 0
Offset Start Query       : 1534
Test Time Strategy       : best_of_n_orm
Dataset Type             : dev
Spider or Bird           : bird
==================================================
[Progress]: Generating batch 1/1 with 32 samples, seed=42
Processed prompts:  100%|██████████| 1534/1534 [41:25<00:00,  1.62s/it]
[Progress]: Loading predictions generated...
[Progress]: Starting test time scaling module...
Choosing the best query: 100%|██████████| 1534/1534 [41:25<00:00,  1.62s/it]
[Progress]: Script completed successfully with exit code 0. Exiting loop.

Running Batch Experiments

For comprehensive evaluation across multiple recipes, you can run all experiments sequentially in the recipes folder:

cd GradeSQL
$RUN_BATCH_EXPERIMENTS

Batch processing features:

  • Automatically discovers and runs all experiment configurations in the recipes/ folder
  • Processes experiments sequentially
  • Useful for hyperparameter tuning and model comparison

Accessing Experiment Results

Navigate to your experiment's results directory to view outputs:

cd GradeSQL/results/test

Results directory structure:

results/test/
├── config.yaml                  # Copy of your experiment configuration
├── results.txt                  # High-level metrics and performance
├── predict.json                 # All queries generated
├── predict_dev.json             # Queries choosen
├── logging.log                  # Execution logs and debugging

Contributing

We welcome contributions to improve GradeSQL! Here's how you can get involved:

Reporting issues and request features

  • Use the GitHub issue tracker to report bugs or request features
  • Provide detailed information about your environment and use case
  • Include configuration files and error logs when reporting issues

📄 License

This project is released under the Apache 2.0 License.


🙏 Acknowledgements

About

GradeSQL is a framework to generate training datasets and train Outcome Reward Models (ORMs) that score and rank LLM-generated SQL, offering an alternative to execution-based BoN and Majority Voting, with pipelines and benchmarks (Spider, BIRD).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published