Recipe: Self-Play Fine-Tuning (SPIN)
Last updated: 05/31/2025.
verl
provides a recipe inspired by the paper “Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models” (SPIN). SPIN is a language model finetuning algorithm that enables iterative self-improvement through a self-play mechanism inspired by game theory.
Core Idea: Models learn by playing against themselves, reducing reliance on external preference datasets or stronger teacher models:
Synthetic Data Generation: The current model generates responses, creating its own training data from previous iterations.
Two-Player Game Setup: A game involving two players acted by a single LLM.
Iterative Training: The model progressively improves by refining its policy, with each iteration’s model becoming the opponent for the next iteration.
Paper Authors: Zixiang Chen*, Yihe Deng*, Huizhuo Yuan*, Kaixuan Ji, Quanquan Gu
[Webpage] [Huggingface] [Paper] [Original Implementation]
verl Implementation Authors: Chendong Wang, Chenyang Zhao
Our Online DPO Implementation
Our compute_online_dpo_loss
function adapts verl
’s existing PPO infrastructure (based on verl
v0.3.0.post1) for this iterative online DPO. Key aspects of our implementation include:
No Critic: Unlike PPO, we omit the value function critic.
Dynamic Reference Model: An explicit reference policy (
ref_policy_wg
) is used for DPO loss. This reference model’s weights can be periodically updated from the actor (ref_update_freq
), providing a dynamic baseline.Online Preference Generation: The
compute_onlineDPO_pref
function (incore_algos.py
) dynamically creates chosen/rejected pairs based on a reward source (e.g., rule-based ranking for math problems).DPO Loss Integration: We replace PPO’s policy loss with our
compute_online_dpo_loss
(incore_algos.py
) within the actor update (dp_actor.py
), directly optimizing the policy using the generated preferences.Iterative Training Orchestration: The
SpinTrainer
(inspin_trainer.py
) manages the entire self-play loop: generation, preference labeling, optional reference model updates, and policy updates, enabling continuous self-improvement aligned with SPIN’s principles.
Algorithm
This recipe implements an Online algorithm adapted to the verl
Reinforcement Learning framework, which provides an alternative to PPO for fine-tuning language models.
Online Loop: Instead of maximizing a scalar reward signal in PPO, this approach directly optimizes the policy model to align with preference data generated online during training:
Generation: The current model generates multiple responses for each prompt in a batch.
Preference Labeling: A function evaluates these generated responses to determine which one is preferred (chosen) and which is dispreferred (rejected). This can be done using a reward function or implicit ranking based on specific rules. (In this recipe, we use rule-based ranking on the math problem).
Update: This preference tuple (
prompt
,chosen_response
,rejected_response
) is used to update the actor model usingcompute_online_dpo_loss
, comparing against a reference model.
Connection with SPIN: Instead of only using a fixed target data distribution, the online generation loop in step 2 will dynamically change the target data distribution by using a certain Preference Labeling method (rule-based ranking on the math problem by selecting the better one in this recipe). This explores the direction mentioned in SPIN’s paper Section 7 about “dynamically changing target data distribution” to potentially elevate LLM performance beyond the fixed human-annotated data ceiling.
Reproduce the Experiment (Example Setup)
The following steps outline how to set up the environment and run the SPIN recipe, based on the provided test log using GSM8K and Qwen2.5-3B-Instruct.
Setup Environment (Example using Docker):
# Start a container with GPU access and shared memory docker run -it --name spin_test --gpus all \ --shm-size=32g \ --ipc=host \ -v /path/to/host/.cache:/root/.cache \ -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \ lmsysorg/sglang:latest \ /bin/bash # Inside the container or on your host machine: # Ensure /tmp is writable mkdir -p /tmp chmod 1777 /tmp # Install Python 3.10 (if not present) and venv sudo apt update sudo apt install -y python3.10 python3.10-venv tmux python3 -m ensurepip --upgrade # Create and activate a virtual environment python3 -m venv ~/.python/spin_env source ~/.python/spin_env/bin/activate # Install uv (fast package installer) python3 -m pip install uv
Install verl and Dependencies:
# Clone the verl repository and checkout the spin branch cd ~ git clone git@github.com:volcengine/verl.git && cd verl # Install flash-attn (handle potential build issues) python3 -m uv pip install wheel packaging python3 -m uv pip install flash-attn --no-build-isolation --no-deps # Install verl with sglang extras python3 -m uv pip install -e ".[sglang]"
Note: If
flash-attn
installation fails, try the manual steps again or consult its documentation.Login & Download Data/Model:
# Login to Weights & Biases (optional, for logging) export WANDB_API_KEY=<YOUR_WANDB_API_KEY> # wandb login # Download the GSM8K dataset python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k # Adjusted path # Download the base model (Example: Qwen2.5-3B-Instruct) huggingface-cli download Qwen/Qwen2.5-3B-Instruct --local-dir $HOME/models/Qwen2.5-3B-Instruct
Configure:
Modify the configuration file (e.g.,
config/spin_trainer.yaml
or the one specified in the run script) with correct paths to your downloaded model, data, desired hyperparameters (dpo_beta
, learning rate, etc.), and distributed training settings (nodes, GPUs per node).Pay attention to
actor_rollout_ref.model_path
,data
paths,reward_model
config (if using one), andtrainer.ref_update_freq
.
Run Training:
# Set CUDA visible devices (adjust based on your hardware and config) export CUDA_VISIBLE_DEVICES=0,1,2,3 # Launch the training script (e.g., test.sh or a custom script) # Ensure test.sh points to the correct config and main script bash recipe/spin/run_spin.sh
Configuration
The primary configuration is typically managed through a YAML file specified in the launch script (e.g.,
config/spin_trainer.yaml
).Key configuration sections:
data
: Paths to training/validation prompt files, batch sizes, sequence lengths.actor_rollout_ref
: Paths to the base model (used for actor and initial reference), FSDP settings, optimization parameters (learning rate, scheduler).reward_model
: Configuration for the reward model used for online preference labeling (path, batch size, etc.). Can be omitted if using a simpler reward function.algorithm
: DPO-specific hyperparameters likedpo_beta
,dpo_loss_type
.trainer
: Distributed training settings (nodes, GPUs per node), logging (WandB), checkpointing frequency, andref_update_freq
(set > 0 to enable periodic reference model updates from the actor).
Key Files
main_spin.py
: Main entry point using Hydra to load the config and launch theSpinTrainer
.spin_trainer.py
: Defines theSpinTrainer
class, orchestrating the Online DPO training loop.fsdp_workers.py
: Implements Ray workers (Actor, Reference) potentially using FSDP.dp_actor.py
: Contains the actor class, including the DPO policy update logic.core_algos.py
: Includes helper functions forcompute_online_dpo_loss
andcompute_onlineDPO_pref
.config/spin_trainer.yaml
(or similar): Main Hydra configuration file for the recipe.run_spin.sh
(or similar): Example bash script for launching a training run.README.md
: This file.
Acknowledgement
We sincerely thank the contribution and guidance from the verl
community and advisors, including (adapted from SPPO):