GradeSQL is a Python-based framework for ranking SQL queries generated by Large Language Models (LLMs). It introduces an Outcome Reward Model (ORM) that evaluates candidate SQL queries at inference time, improving reliability and execution accuracy in Text-to-SQL tasks.
This toolkit enables researchers and practitioners to:
- Run test-time scaling with candidate SQL queries.
- Build a synthetic dataset and train an ORM-based ranker in one step.
- Reproduce experiments from the paper on benchmarks such as Spider and BIRD.
If you use GradeSQL in your research, please cite the following paper:
@misc{gradesqloutcomerewardmodels2025,
title={GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models},
author={Mattia Tritto and Giuseppe Farano and Dario Di Palma and Gaetano Rossiello and Fedelucio Narducci and Dharmashankar Subramanian and Tommaso Di Noia},
year={2025},
eprint={2509.01308},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.01308},
}
- Python Version: Python 3.11 (recommended)
Follow these steps to set up and configure the GradeSQL framework:
First, create the main project directory where GradeSQL will be installed:
mkdir project_dir
cd project_dir
Clone the GradeSQL repository from GitHub into your project directory:
git clone https://github.com/sisinflab/GradeSQL
After cloning, your directory structure should look like:
project_dir/
└── GradeSQL/
├── src/
├── README.md
└── ...
Change into the cloned repository directory:
cd GradeSQL
You are now in the main project directory where all configuration and execution commands will be run.
Navigate to the configuration file and update the primary settings:
nano src/config/config.py
In the config.py
file, locate and modify the following parameters:
self.normal_or_hpc = "normal"
self.name_dir_exp = "project_dir"
normal_or_hpc
: Choose "normal" for standard environments with internet access. Choose "hpc" for high-performance computing clusters or environments without internet connectivity.name_dir_exp
: This must exactly match the name of the parent directory you created in Step 1.
Navigate to the setup script and update the project path:
nano src/config/setup.sh
In the setup.sh
file, locate the PROJECT_DIR
variable and update it with the absolute path to your GradeSQL directory:
PROJECT_DIR=/absolute/path/to/your/GradeSQL
Make the setup script executable:
chmod +x src/config/setup.sh
Source the setup script to initialize the environment:
source src/config/setup.sh
Move to the GradeSQL main directory, and download from Hugging Face the data directory, and unzip directly in GradeSQL folder.
cd GradeSQL
huggingface-cli download sisinflab-ai/GradeSQL-reproducibility-data --repo-type=dataset --local-dir=./
unzip data.zip
Once GradeSQL is successfully installed and configured, you can begin running experiments and evaluating SQL queries generations using this framework. The following sections detail how to use the various features and capabilities.
Start by creating a new configuration file based on the provided template:
cp config_template.yaml test.yaml
Edit your newly created configuration file (called recipe from now) to match your specific experiment requirements:
nano test.yaml
Key configuration parameters you may want to modify:
general:
seed: 42 # Set this for reproducibility
debug: "false" # This tests the flow of the pipeline. Set to false to run the whole experiment
dataset:
is_train_dev_test: "test" # Choose between train dev or test set
dataset_name: "spider" # Choose between spider or bird dataset
dataset_path: "../../data/spider/test_merged.json" # Dataset with schema linking
tables_path: "../../data/spider/test_tables.json" # Structures of tables
db_sqlite: '../../data/spider/test_database/' # SQLlite databases
ground_truth_path: '../../data/spider/test_gold.sql' # Ground truth SQL queries for evaluation
# is_train_dev_test: "train" # Choose between train dev or test set
# dataset_name: "bird" # Choose between spider or bird dataset
# dataset_path: "../../data/bird/train/train_merged.json" # Dataset with schema linking
# tables_path: "../../data/bird/train/train_tables.json" # Structures of tables
# db_sqlite: '../../data/bird/train/train_databases/' # SQLlite databases
# is_train_dev_test: "dev" # Choose between train dev or test set
# dataset_name: "bird" # Choose between spider or bird dataset
# dataset_path: "../../data/bird/dev_20240627/dev_merged.json" # Dataset with schema linking
# tables_path: "../../data/bird/dev_20240627/dev_tables.json" # Structures of tables
# db_sqlite: '../../data/bird/dev_20240627/dev_databases/' # SQLlite databases
# ground_truth_path: '../../data/bird/dev_20240627/' # Ground truth SQL queries for evaluation
inference:
model_name: "seeklhy/OmniSQL-7B"
path_model_pretrained: "../../../../hf_cache/Omni7B" # LLM used for making predictions
max_tokens: 2048 # Maximun number of tokens the LLM is allowed to generate
temperature: 0.8
test_time_strategy: "majority_voting" # Choose between best_of_n_orm, best_of_n_heuristic, majority_voting, pass@k
n: 32 # Number of predictions
num_samples_per_batch: 32 # Each batch contains this number of samples
n_decreasing: False # If True, the number of samples per batch decreases with each iteration
n_step: 1 # Number of steps to decrease the number of samples per batch
start_query: 0
offset_start_query: 2147 # There is start and offset because for multinode inference, you can parallelize inference by writing multiple recipes with different starts and offsets
timeout_sql: 30 # Timeout set for evaluating queries
limited_rows: "none" # Number of rows extracted when executing a query
orm:
orm_prompt: "Question: {QUESTION}\nSQL: {SQL}\nIs the SQL correct?" # Prompt used to ask the LLM if the SQL is correct
max_new_tokens: 5 # Maximum number of tokens the LLM is allowed to generate when checking the SQL
use_logits: True # If True, the LLM will return logits for the generated tokens
dataset_train_version: "v1" # Version of the training dataset
dataset_test_version: "v1" # Version of the test dataset
finetuning_version: "v1" # Version of the finetuning script to use
orm_or_prm: "orm" # Which model you're finetuning?
base_model_name: "seeklhy/OmniSQL-7B" # Base model name
version_model_finetuned: 10 # Used to name the new finetuned model
resume_from_checkpoint: False # If there's a previous finetuning checkpoint, resume from the latest one
finetuning_iterations: 1 # Number of finetuning iterations
r: 16 # LoRA rank of the low-rank update
lora_alpha: 64 # How much the LoRa update affects original weights
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] # Transformers modules affected by LoRA finetuning
lora_dropout: 0.05 # Prevents overfitting
bias: "none" # Bias terms are not finetuned
fp16: True # Half precision floating point
learning_rate: 7e-5
per_device_train_batch_size: 5
num_train_epochs: 50
upload_at_end: "true" # Upload LoRa weights in HF repositories
hf_username: "enter_your_HF_username_here"
hf_token: "enter_your_HF_token_here" # For uploading models
- Ensure dataset paths are correct and accessible. They start from config directory, so the relative paths in this directory point to the data directory placed inside GradeSQL dir.
- Adjust model parameters based on your computational resources.
In order to download the generator model, follow these steps.
From the GradeSQL
directory, run:
mkdir ../../hf_cache
cd ../../hf_cache
mkdir Omni7B
cd Omni7B
huggingface-cli download seeklhy/OmniSQL-7B --local-dir=./
Run your configured experiment using the framework's execution command:
cd GradeSQL
$RUN_EXPERIMENT test
What happens during execution:
- The framework loads your configuration from
test.yaml
- Generator model generates first all CoTs, then SQLs are extracted and passed to the test_time_scaling module
- SQL query ranking is performed by test_time_scaling module
- Predictions are cached, for further experiments, if the configuration is the same
- Results are saved into results directory
- Progress is displayed in the terminal
Expected output:
____ _ ____ ___ _
/ ___|_ __ __ _ __| | ___/ ___| / _ \| |
| | _| '__/ _` |/ _` |/ _ \___ \| | | | |
| |_| | | | (_| | (_| | __/___) | |_| | |___
\____|_| \__,_|\__,_|\___|____/ \__\_\_____|
[Progress]: Running experiment test...
==================================================
Experiment Settings
==================================================
Model Name : seeklhy/OmniSQL-7B
Max Tokens : 2048
N : 32
N Decreasing : False
N Step : 1
Seed : 42
Temperature : 0.8
Start Query : 0
Offset Start Query : 1534
Test Time Strategy : best_of_n_orm
Dataset Type : dev
Spider or Bird : bird
==================================================
[Progress]: Generating batch 1/1 with 32 samples, seed=42
Processed prompts: 100%|██████████| 1534/1534 [41:25<00:00, 1.62s/it]
[Progress]: Loading predictions generated...
[Progress]: Starting test time scaling module...
Choosing the best query: 100%|██████████| 1534/1534 [41:25<00:00, 1.62s/it]
[Progress]: Script completed successfully with exit code 0. Exiting loop.
For comprehensive evaluation across multiple recipes, you can run all experiments sequentially in the recipes folder:
cd GradeSQL
$RUN_BATCH_EXPERIMENTS
Batch processing features:
- Automatically discovers and runs all experiment configurations in the
recipes/
folder - Processes experiments sequentially
- Useful for hyperparameter tuning and model comparison
Navigate to your experiment's results directory to view outputs:
cd GradeSQL/results/test
Results directory structure:
results/test/
├── config.yaml # Copy of your experiment configuration
├── results.txt # High-level metrics and performance
├── predict.json # All queries generated
├── predict_dev.json # Queries choosen
├── logging.log # Execution logs and debugging
We welcome contributions to improve GradeSQL! Here's how you can get involved:
- Use the GitHub issue tracker to report bugs or request features
- Provide detailed information about your environment and use case
- Include configuration files and error logs when reporting issues
This project is released under the Apache 2.0 License.
- Hugging Face Transformers for model hosting.
- Spider & BIRD datasets.
- Optuna for hyperparameter optimization.
- vLLM for efficient inference.