Algorithm Baselines
Last updated: 06/18/2025.
Math related datasets
GSM8k
Assuming GSM8k/math dataset is preprocessed via:
python3 examples/data_preprocess/*.py
Refer to the table below to reproduce RL training from different pre-trained checkpoints. Below is the performance on the GSM8k dataset if not specified otherwise. More comprehensive benchmark results areavailable in the recipe folder.
Hardware |
Model |
Method |
Test score |
Details |
---|---|---|---|---|
NVIDIA GPU |
google/gemma-2-2b-it |
hf checkpoint |
23.9 |
|
NVIDIA GPU |
google/gemma-2-2b-it |
SFT |
52.06 |
|
NVIDIA GPU |
google/gemma-2-2b-it |
SFT + PPO |
64.02 |
|
NVIDIA GPU |
Qwen/Qwen2.5-0.5B-Instruct |
hf checkpoint |
36.4 |
|
NVIDIA GPU |
Qwen/Qwen2.5-0.5B-Instruct |
PPO |
56.7 |
|
NVIDIA GPU |
Qwen/Qwen2.5-0.5B-Instruct |
PRIME |
58.7 |
|
NVIDIA GPU |
Qwen/Qwen2.5-0.5B-Instruct |
GRPO-LoRA |
54.3 |
|
NVIDIA GPU |
Qwen/Qwen2.5-1.5B-Instruct |
GRPO-LoRA |
77.9 |
|
NVIDIA GPU |
Qwen/Qwen2.5-3B-Instruct |
GRPO-LoRA |
86.1 |
|
NVIDIA GPU |
deepseek-ai/deepseek-llm-7b-chat |
PPO (Megatron) |
69.5 [1] |
|
NVIDIA GPU |
Qwen/Qwen2-7B-Instruct |
GRPO |
89 |
|
NVIDIA GPU |
Qwen/Qwen2-7B-Instruct |
GRPO (FSDP2) |
89.8 |
|
NVIDIA GPU |
Qwen/Qwen2-7B-Instruct |
GRPO (Megatron) |
89.6 |
|
NVIDIA GPU |
Qwen/Qwen2.5-7B-Instruct |
ReMax |
97 |
|
NVIDIA GPU |
Qwen/Qwen2.5-7B-Instruct |
SPPO |
65.6 (MATH) |
|
NVIDIA GPU |
Qwen/Qwen2.5-7B-Instruct |
GRPO-LoRA |
93.4 |
|
NVIDIA GPU |
Mixtral-8x22B-Instruct-v0.1 |
Instruct model |
83.7 |
|
NVIDIA GPU |
Mixtral-8x22B-Instruct-v0.1 |
RLOO (Megatron) |
92.3 |
|
NVIDIA GPU |
Qwen/Qwen2.5-7B-Instruct |
SPIN |
92 |
|
NVIDIA GPU |
Qwen/Qwen2-7B-Instruct |
GPG |
88 |
|
NVIDIA GPU |
Qwen/Qwen2-7B-Instruct |
GPG (Megatron) |
88 |
|
NVIDIA GPU |
Qwen/Qwen2.5-VL-7B-Instruct |
GRPO (Megatron) |
65.4 (GEO3k) |
|
AMD MI300 |
deepseek-ai/deepseek-llm-7b-chat |
PPO |
70.5 [1] |
|
AMD MI300 |
deepseek-ai/deepseek-llm-7b-chat |
GRPO |
71.4 [1] |
|
NVIDIA GPU |
Qwen/Qwen2.5-14B-Instruct |
GRPO-LoRA |
94.6 |
|
NVIDIA GPU |
Qwen/Qwen2.5-32B-Instruct |
GRPO-LoRA |
95.8 |
|
NVIDIA GPU |
Qwen/Qwen2.5-72B-Instruct |
GRPO-LoRA |
96.0 |
DAPO math-17k
Training DAPO math-17k dataset: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k
Testing: AIME’24: https://huggingface.co/datasets/BytedTsinghua-SIA/AIME-2024
Note:
For Qwen/Qwen2.5-Math-7B, we directly modify the max_position_embeddings to 32768 without observing performance degradation in order to train longer response length.
Hardware |
Model |
Method |
Test score |
Details |
---|---|---|---|---|
NVIDIA GPU |
Qwen/Qwen2.5-Math-7B (32k) |
DAPO |
36.3 |
Coding related datasets
Below is the result on leetcode if not specified otherwise.
Hardware |
Model |
Method |
Test score |
Details |
---|---|---|---|---|
NVIDIA GPU |
PRIME-RL/Eurus-2-7B-SFT |
RPIME |
36.1 |
Notes
[1] During evaluation, we have only extracted answers following the format "####"
. A more flexible answer extraction, longer response length, and better prompt engineering may lead to a higher score.
[2] The default value of actor_rollout_ref.actor.entropy_coeff
is set to 0.0
since verl 0.3.x on 2025-05-30, which is different from previous versions.