Training DeepSeek 671b

Last updated: 06/13/2025.

verl integrates Megatron to support large MoE models such as Qwen3-235B-A22B and deepseek-ai/DeepSeek-V3. This is an ongoing community effort.

In the journey the community added the following features and optimizations that enable verl with larger models:

  • per tensor weight resharding between rollout and training

  • context parallelism and expert parallelism enabled via megatron

  • dynamic batch size (sequence balance) for megatron

  • reduced ray-related serialization overhead

  • optimizer offloading, recomputation, and efficient kernels

  • various debugging metrics and utils

and the megatron backend now has a wider list of models supported:

  • DeepSeek-V3

  • Moonlight

  • Qwen3

  • Qwen2.5-VL (to be merged soon)

  • Qwen2

  • Mixtral

Getting Started

DeepSeek 671b

The recommended image with pre-built megatron dependency is whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.1-te2.3-deepseekv3, built with the Dockerfile in docker/Dockerfile.vllm.sglang.megatron.deepseek.

For checkpoint loading, we rely on megatron dist-ckpt for resharding. A converted dist-ckpt for DeepSeek-V3 is available from huggingface BearBiscuit05/dpsk-v3-671B-BF16-dist_ckpt.

To run end-to-end training on the DAPO dataset, run recipe/dapo/test_dapo_dspk_671b_megatron.sh. It runs on 512 H20(96GB) GPUs with the following setup:

  • vllm rollout with TP=32, bfloat16

  • megatron training with attention DP, MoE EP=32, PP=16, bfloat16

MTP is disabled during RL training.

Qwen3 236b

For Qwen3-236b, please refer to examples/grpo_trainer/run_qwen3-236b_megatron.sh, which runs on 128 H20(96GB) GPUs.

Upcoming Optimizations

The community continue to optimize large MoE models further, ongoing efforts include:

  • further optimizing memory consumption, and provide recommended/tuned configurations with various machine types

  • optimizing long context RL training performance

  • performance improvement with SGLang x Megatron

We invite the community to try and improve verl together. Get connected with us on slack/wechat/Github issues!

Acknowledgement

@vermouth1992 @ISEEKYAN @ETOgaosion @yzlnew @ShareLer @BearBiscuit05 @ccclyu @ann-qin-lu @SwordFaith @zzong2006 @zhaochenyang20 @ocss884 @eric-haibin-lin