Multi-Modal Example Architecture ================================= Last updated: 04/28/2025. Introduction ------------ Now, verl has supported multi-modal training. You can use fsdp and vllm/sglang to start a multi-modal RL task. Megatron supports is also on the way. Follow the steps below to quickly start a multi-modal RL task. Step 1: Prepare dataset ----------------------- .. code:: python # it will be saved in the $HOME/data/geo3k folder python examples/data_preprocess/geo3k.py Step 2: Download Model ---------------------- .. code:: bash # download the model from huggingface python3 -c "import transformers; transformers.pipeline(model='Qwen/Qwen2.5-VL-7B-Instruct')" Step 3: Perform GRPO training with multi-modal model on Geo3K Dataset --------------------------------------------------------------------- .. code:: bash # run the task bash examples/grpo_trainer/run_qwen2_5_vl-7b.sh