Performance Tuning Guide
Last updated: 06/23/2025.
Author: Guangming Sheng, Jiali Zheng
In this section, we will discuss how to tune the performance of all the stages in verl, including:
Rollout generation throughput.
Enable
use_remove_padding=True
for sequence packing (i.e., data packing and remove padding).Batch size tuning for forward and backward computation
Enable
use_dynamic_bsz=True
for higher throughput.Utilize Ulysses Sequence Parallel for Long Context Training
LigerKernel for SFT performance optimization
Forward prefetch in FSDP training backend
Memory optimization for entropy calculation from logits
Rollout Generation Tuning
verl currently supports two rollout backends: vLLM and TGI (with SGLang support coming soon).
Below are key factors for tuning vLLM-based rollout. Before tuning, we recommend setting actor_rollout_ref.rollout.disable_log_stats=False
so that rollout statistics are logged.
Increase
gpu_memory_utilization
.For vLLM v0.7.0 and later, the vLLM instance will only use gpu_memory_utilization of the total memory.
For SGLang, it’s the fraction of the free GPU memory used for static memory like model weights and KV cache. However, the remaining (1-gpu_memory_utilization) will also be used during inference.
However, if model parameters and optimizer states are not offloaded, using too high a fraction can lead to OOM. A value between 0.5 and 0.7 often strikes a good balance between high throughput and avoiding OOM.
Note: since the definition of
gpu_memory_utilization
varies across inference engines, a value that works well for one engine may cause OOM for another.Adjust
max_num_seqs
ormax_num_batched_tokens
. If the GPU cache utilization is relatively low in the log, increasemax_num_seqs
ormax_num_batched_tokens
can enlarge the effective batch size in the decoding stage, allowing more concurrent requests per batch. We recommend settingmax_num_batched_tokens > 2048
for higher throughput.Use a smaller
tensor_parallel_size
. When GPU resources allow, a smaller tensor parallel size spawns more vLLM replicas. Data parallelism (DP) can yield higher throughput than tensor parallelism (TP), but also increases KVCache consumption. Carefully balance the trade-off between more replicas and higher memory usage. Our experient in Sec. 8.4 of HybridFlow paper evaluate this trade-off.
More tuning details such as dealing with Preemption and Chunked-prefill can be found in vLLM official tuning guide
For optimal performance, we recommend using vLLM v0.8.3 or later. See https://github.com/volcengine/verl/blob/main/docs/README_vllm0.8.md for details.
Enable remove padding (sequence packing)
Currently, for llama, mistral, gemma1 and qwen based models, users can enable use_remove_padding=True to utilize the sequence packing implementation provided by transformers library.
For other models, transformers library may also support it but we haven’t tested it yet. Users can add the desired model config to the test_transformer.py file. And test its functionaility by running the following command:
pytest -s tests/models/test_transformer.py
If the test passes, you can add your desired model into the model registry.py file. Then, you can enjoy the performance boost of sequence packing and welcome to PR your tested model to verl!
Batch Size Tuning
To achieve higher throughput in experience preparation (i.e., model fwd) and model update (i.e., actor/critic fwd/bwd),
users may need to tune the *micro_batch_size_per_gpu
for different computation.
In verl, the core principle for setting batch sizes is:
Algorithmic metrics (train batch size, PPO mini-batch size) are global (from a single-controller perspective), normalized in each worker. See the normalization code.
Performance-related parameters (micro batch size, max token length for dynamic batch size) are local parameters that define the per-GPU data allocations. See the normalization code.
Note
In your training script, please use *micro_batch_size_per_gpu
instead of *micro_batch_size
.
So that you don’t need to consider the normalization of the micro_batch_size
and micro_batch_size
will be deprecated.
Batch Size Tuning tips
Therefore, users may need to tune the *micro_batch_size_per_gpu
to accelerate training. Here’re some tips:
Enable gradient checkpointing: Set
actor_rollout_ref.model.enable_gradient_checkpointing=True
andcritic.model.enable_gradient_checkpointing=True
. This often allows for larger micro-batch sizes and will be beneficial for large mini-batch training.Increase the
*micro_batch_size_per_gpu
as much as possible till equals to normalizedmini_batch_size
.Use larger forward-only parameters: Forward only parameter, such as
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu
,actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu
,critic.forward_micro_batch_size_per_gpu
could be larger (e.g., 2x) than training related micro batch sizes, such asactor_rollout_ref.actor.ppo_micro_batch_size_per_gpu
,critic.ppo_micro_batch_size_per_gpu
.Allow larger micro-batch sizes for Critic and Reward models: micro batch size of Critic and Reward model could be larger than Actor model. This is because the actor model has much larger vocab size in the final layer.
Enable activation offloading: Set
actor_rollout_ref.model.enable_activation_offload=True
andcritic.model.enable_activation_offload=True
. This often works together with gradient checkpointing to get larger micro-batch sizes and it’s only available in FSDP backend now.
Tuning for Dynamic Batch Size
Dynamic batch size is a technique that allows the model to process similar number of tokens in a single forward pass (with different actual batch sizes). This can significantly improve the training efficiency and reduce the memory usage.
To utilize this technique, users can set use_dynamic_bsz=True
in actor, ref, critic and reward models.
With use_dynamic_bsz=True
, users don’t need to tune *micro_batch_size_per_gpu
.
Instead, users should tune the following parameters:
actor_rollout_ref.actor.ppo_max_token_len_per_gpu
,critic.ppo_max_token_len_per_gpu
: The maximum number of tokens to be processed in fwd and bwd ofupdate_policy
andupdate_critic
.actor_rollout_ref.ref.log_prob_max_token_len_per_gpu
andactor_rollout_ref.rollout.log_prob_max_token_len_per_gpu
: The maximum number of tokens to be processed in a the fwd computation ofcompute_log_prob
andcomptue_ref_log_prob
.critic.forward_micro_batch_size_per_gpu
,reward_model.forward_micro_batch_size_per_gpu
: The maximum number of tokens to be processed in a the fwd computation ofcompute_values
,compute_rm_score
.
Dynamic Batch Size Tuning tips
Here’re some tips to tune the above parameters:
Increase
actor_rollout_ref.actor.ppo_max_token_len_per_gpu
Make it at least 2 x (max_prompt_length + max_response_length). We set it to 3x in run_qwen2-7b_rm_seq_balance.sh. Try to increase it to get higher throughput.Forward-only parameters can be larger: Similar to the non-dynamic-batch scenario, forward-only token limits can exceed those used in forward/backward operations.
Use larger limits for Critic and Reward models: Critic and Reward parameters can be set at least 2× the Actor’s limits. For instance, we set them to 4× here: run_qwen2-7b_rm_seq_balance.sh
Ulysses Sequence Parallel for Long Context Training
To utilize this technique, users can set ulysses_sequence_parallel_size>1
in actor, ref, critic and reward models.
We support different model utilize different ulysses_sequence_parallel_size sizes.
To train log sequence (>32k), users may need to decrease the *micro_batch_size_per_gpu
and *max_token_len_per_gpu
to avoid OOM.
LigerKernel for SFT
LigerKernel is a high-performance kernel for Supervised Fine-Tuning (SFT) that can improve training efficiency. To enable LigerKernel in your SFT training:
Install liger-kernel via
pip3 install liger-kernel
. In your SFT configuration file (e.g.,verl/trainer/config/sft_trainer.yaml
), set theuse_liger
parameter:model: use_liger: True # Enable LigerKernel for SFT
The default value is
False
. Enable it only when you want to use LigerKernel’s optimizations.LigerKernel is particularly useful for improving training performance in SFT scenarios.
Forward prefetch in FSDP training backend
During the training phase, users can enable forward prefetching in FSDP by setting fsdp_config.forward_prefetch=True
. For example, actor_rollout_ref.actor.fsdp_config.forward_prefetch=True
. This configuration prefetches the next forward-pass all-gather operation before completing the current forward computation, overlapping communication with computation and improving efficiency. For further details, refer to the FSDP forward_pefetch documentation.
Note
Backward prefetch is unsupported because the BACKWARD_POST
policy may prefetch incorrectly in nested-module cases. For details, see the FSDP documentation
Memory optimization for entropy calculation from logits
The logits
tensor (typically of shape [bsz*seq_len, voc]
) can consume significant memory. When using compute_entropy_from_logits
, memory usage reaches approximately [bsz*seq_len, voc] × (4 bytes (float32) + 2 bytes (autocast for softmax+logsumexp) + 1 byte (softmax output))
.
To reduce this memory peak, enable chunked computation by setting:
actor_rollout_ref.ref.entropy_from_logits_with_chunking = True
This processes the tensor in chunks of shape [chunk_size, voc]
(e.g., 2048) rather than the full sequence length, exclusively during the model’s forward pass.
Additionally, during training, standard gradient checkpointing (enable_gradient_checkpointing=True
) does not apply to entropy calculations. To reduce memory peaks in this context, set:
actor_rollout_ref.actor.entropy_checkpointing = True
This enables entropy recomputation specifically for the entropy calculation, lowering memory usage during training.