: Uses 22k data pairs focusing on textual accuracy (
The paper addresses the "SFT plateau," a phenomenon where Supervised Fine-Tuning (SFT) performance on Large Language Models (LLMs) stops improving even as the dataset size increases [11, 22]. The authors use a specific of chart-to-code data to demonstrate this limitation and propose Multimodal Structured Reinforcement Learning (MSRL) as a solution [11, 22]. 2. Methodology Supervised Fine-Tuning (SFT) Phase : Baseline Model : Qwen2.5-VL-7B-Instruct [11, 22]. 2.8M GMAIL.txt
The paper demonstrates that MSRL significantly outperforms pure SFT models by optimizing for both textual structure and visual fidelity, effectively surpassing the performance limit reached at 2.8M SFT samples [11, 25]. MSRL Stage Max Dataset Size 2.8 million samples [11, 22] 33k curated samples [11] GPU Requirement 16 H800 GPUs [11] 24 H800 GPUs [11] Training Goal Min. Negative Log-Likelihood [22] Hybrid Text-Visual Reward [11] Outcome Performance Plateaus [22] Breaks SFT Performance Limit [11] : Uses 22k data pairs focusing on textual
To break the plateau, the authors implement a two-stage Reinforcement Learning (RL) process [11]. confirming the plateau [22].
) used in the RL stages or the used to measure the success of the 2.8M dataset?
: Increasing data from 2M to 2.8M results in no further performance gains, confirming the plateau [22]. Multimodal Structured Reinforcement Learning (MSRL) :
: The model is tested on subsets ranging from 200k to 2.8 million samples.