Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. Mathematical reasoning highlights the high-level capability of VLMs to comprehend mathematical information embedded in images and to perform sophisticated reasoning processes. Recently, numerous visual mathematical reasoning benchmarks have been proposed to evaluate the mathematical reasoning capabilities of VLMs. However, these benchmarks suffer from several limitations: they are typically restricted to geometry problems, lack comprehensive evaluation on math word problems, and rarely assess the ability to reason across multiple images. To fill this gap, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is constructed by systematically mapping each sample from the widely used text-based mathematical reasoning benchmark GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate a benchmark comprising 1,319 high-quality samples. We evaluate a wide range of open-source and close-source models on the proposed GSM8K-V benchmark. Our results reveal that, although existing VLMs have achieved nearly saturated performance on the text-based GSM8K, there remains substantial room for improvement on the purely visual GSM8K-V. For instance, the best-performing model, Gemini-2.5-Pro, attains 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive and detailed analysis on GSM8K-V, systematically examining the limitations of existing models on this benchmark as well as potential directions for improvement. GSM8K-V provides a new perspective on visual mathematical reasoning and establishes a novel evaluation benchmark that can guide the research community toward developing more robust and generalizable VLMs.
An overview of our pipeline that converts the text-based mathematical reasoning dataset GSM8K into its purely visual version, GSM8K-V.
We evaluate a broad range of open- and closed-source vision-language models on GSM8K-V, comparing their performance against the text-only GSM8K baseline to quantify the modality gap.
Performance comparison between text-only GSM8K and visual GSM8K-V across different model architectures, highlighting the substantial modality gap in mathematical reasoning capabilities.
@misc{yuan2025gsm8kvvisionlanguagemodels,
title={GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts},
author={Fan Yuan and Yuchen Yan and Yifan Jiang and Haoran Zhao and Tao Feng and Jinyan Chen and Yanwei Lou and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
year={2025},
eprint={2509.25160},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.25160},
}