GitHub - ModalMinds/MM-EUREKA: MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning
Introduction
We present MM-Eureka-Qwen, a multimodal reasoning model that successfully extends large-scale rule-based reinforcement learning (RL) to multimodal reasoning. Compared to the previous version of MM-EUREKA based on InternVL, we have made improvements in model architecture, algorithms, and data. Using only non-in-domain training data, MM-Eureka-Qwen achieves significant improvements over Qwen-2.5-VL-Instruct-7B across multiple benchmarks (e.g. MathVista 73.0). We release all our codes, models, data, etc. at ‣.
Improvements:
- We further iterate the codebase to support algorithms including Online Filter, ADORA, and DAPO.
- We expand our K12 dataset, collecting 15,000 high-quality K12 samples.
- We train the MM-Eureka-Qwen-7B model with GRPO and Online Filter, achieving better results with significantly lower cost than the previous version. We open-sourced our training code and models and hope to facilitate future studies on multimodal reasoning.
MM-EUREKA-Qwen
Based on the key factors identified by https://github.com/ModalMinds/MM-EUREKA for achieving stable training, we enhanced the model, dataset, and algorithmic modules. Specifically, we maintained the strategy of omitting the KL divergence term and applying data filtering, while implementing the following critical modifications:
- The base model was upgraded from InternVL2.5-8B-Instruct to the more powerful QwenVL2.5-7B-Instruct.
- The Vision Transformer (ViT) module was kept fixed during training.
- The underlying RL algorithm was replaced with GRPO, instead of the previously used RLOO.
- The data filtering strategy was transitioned from an offline approach to an online approach.
- Additional data from the K12 dataset was collected, expanding the total dataset size to 15,000 samples.
Finally, MM-EUREKA-Qwen achieves 73.0 on MathVista, surpassing the original Qwen-2.5-VL by 4.8%.