RL(HF) algorithms with LoRA Support

Last updated: 06/05/2025.

We support LoRA (Low-Rank Adaptation) for reinforcement learning algorithms such as PPO, GRPO, and others.

LoRA is a parameter-efficient fine-tuning technique that injects trainable low-rank matrices into pre-trained weights (typically linear layers). This reduces memory footprint and compute cost, making it possible to fine-tune large models with limited hardware.

The benefits this brings include:

  • reinforcement learning with very large models (e.g. 70B+) with modest hardware (e.g. 8x80G GPUs),

  • enable larger batch sizes due to reduced memory usage,

  • simplify model transfer and deployment, as only LoRA adapters need to be saved,

  • Combine with techniques like SLoRA or CCoE to serve multiple LoRA adapters efficiently

This guide explains how to enable LoRA in RL training and configure related parameters.

Usage Guide

  1. Lora is available in the verl.trainer.ppo.ray_trainer.RayPPOTrainer. Examples are provided via the verl.trainer.main_ppo entry point.

  2. Currently, LoRA is supported via huggingface peft, only with fsdp/fsdp2 and vllm backend (sglang support coming soon).

  • strategy=fsdp or strategy=fsdp2

  • rollout.name=vllm

  1. Required configurations for LoRA:

  • actor_rollout_ref.model.lora_rank: int, set to a reasonable value greater than 0 (e.g., 8, 16, 32, 64)

  • actor_rollout_ref.model.lora_alpha: float, the alpha term in LoRA

  • actor_rollout_ref.rollout.load_format=”safetensors”: required. This enables vLLM to load the base model.

  • actor_rollout_ref.model.target_modules: the target modules for LoRA. Typically set to “all-linear”.

  1. Recommend options:

  • actor_rollout_ref.model.use_shm=True: preload the model into /dev/shm to improve model loading speed.

  • actor_rollout_ref.rollout.layered_summon=True: this enables the actor-model to gather the FSDP shards per layers when synchronizing the LoRA Adapter to vLLM, thereby reducing GPU peak memory. Recommended if the model is very large (70B+) or the GPU memory is limited (< 48GB)

Best Practices and Notes

  1. Learning rate: it is recommended to increase the value of learning rate by an order of magnitude.

  2. LoRA Rank:

  • Too small a rank can hurt convergence.

  • LoRA rank recommendation from @thelongestusernameofall:

    • A very small lora_rank can lead to slower convergence or worse training performance. It is recommended to set lora_rank to be>=32. Tests have shown that for a 0.5B model, with lora_rank=32,the training convergence speed and final performance are almost identical to non-LoRA training

    • For a 32B model,with lora_rank=128,the training convergence speed and final performance are also almost identical to non-LoRA training.

    • More comprehensive reference results are coming soon.

https://github.com/eric-haibin-lin/verl-community/blob/f2b80b8b26829124dd393b7a795a0640eff11644/docs/lora.jpg?raw=true
  1. Reference configuration for RL training with the Qwen2.5-72B model using 8 x 80GB GPUs (increase lora_rank if needed):

data.train_batch_size=64 \
actor_rollout_ref.model.use_shm=True \
actor_rollout_ref.model.lora_rank=32 \
actor_rollout_ref.model.lora_alpha=32 \
actor_rollout_ref.model.target_modules=all-linear \
actor_rollout_ref.actor.optim.lr=3e-5 \
actor_rollout_ref.actor.fsdp_config.fsdp_size=8 \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.tensor_model_parallel_size=8 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.rollout.max_num_seqs=64 \
actor_rollout_ref.rollout.max_model_len=1536 \
actor_rollout_ref.rollout.max_num_batched_tokens=1536 \
actor_rollout_ref.rollout.load_format=safetensors \
actor_rollout_ref.rollout.layered_summon=True \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \

Example Script

For an end-to-end example, refer to the script below:

examples/grpo_trainer/run_qwen2_5-3b_gsm8k_grpo_lora.sh