Search — VERL Documentation

verl

Quickstart

Installation
Quickstart: PPO training on GSM8K dataset
Multinode Training
Ray Debug Tutorial
More Resources

Programming guide

HybridFlow Programming Guide
The Design of verl.single_controller

Data Preparation

Prepare Data for Post-Training
Implement Reward Function for Dataset

Configurations

Config Explanation

PPO Example

PPO Example Architecture
GSM8K Example
Multi-Modal Example Architecture

Algorithms

Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)
Recipe: Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)
Recipe: Self-Play Fine-Tuning (SPIN)
Recipe: Self-Play Preference Optimization (SPPO)
Recipe: Entropy Mechanism
On-Policy RL with Optimal Reward Baseline (OPO)
Algorithm Baselines
GPG: Group Policy Gradient

PPO Trainer and Workers

PPO Ray Trainer
PyTorch FSDP Backend
Megatron-LM Backend
SGLang Backend

Performance Tuning Guide

Training DeepSeek 671b
Performance Tuning Guide
Upgrading to vLLM >= 0.8
Hardware Resource Needed for RL
NVIDIA Nsight Systems profiling in verl

Adding new models

Add models with the FSDP backend
Add models with the Megatron-LM backend

Advanced Features

Using Checkpoints to Support Fault Tolerance Training
RoPE Scaling override
RL(HF) algorithms with LoRA Support
Multi-turn Rollout Support
Interaction System for Multi-turn RL Training
Ray API Design Tutorial
Extend to other RL(HF) algorithms
Sandbox Fusion Example

Hardware Support

Getting started with AMD (ROCM Kernel)
verl performance tuning for AMD (ROCm Kernel)
verl x Ascend

API References

Data interface
Single Controller interface
Trainer Interface
Utilities

FAQ

Frequently Asked Questions

Development Notes

Sandbox Fusion Tool Integration

verl

Search

© Copyright 2024 ByteDance Seed Foundation MLSys Team.

Built with Sphinx using a theme provided by Read the Docs.