GSM8K-V Logo

GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts?

Fan Yuan1,*, Yuchen Yan1,*, Yifan Jiang1, Haoran Zhao1, Tao Feng1, Jinyan Chen1,
Yanwei Lou1, Wenqi Zhang1, Yongliang Shen1,†, Weiming Lu1, Jun Xiao1, Yueting Zhuang1

1Zhejiang University

Preprint. Under review.
*Equal contribution, Corresponding author
Code arXiv

🤗

Dataset
Teaser Image for GSM8K-V Project

GSM8K-V systematically maps each GSM8K math word problem into its visual counterpart for clean cross-modal evaluation.

Abstract

Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. Mathematical reasoning highlights the high-level capability of VLMs to comprehend mathematical information embedded in images and to perform sophisticated reasoning processes. Recently, numerous visual mathematical reasoning benchmarks have been proposed to evaluate the mathematical reasoning capabilities of VLMs. However, these benchmarks suffer from several limitations: they are typically restricted to geometry problems, lack comprehensive evaluation on math word problems, and rarely assess the ability to reason across multiple images. To fill this gap, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is constructed by systematically mapping each sample from the widely used text-based mathematical reasoning benchmark GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate a benchmark comprising 1,319 high-quality samples. We evaluate a wide range of open-source and close-source models on the proposed GSM8K-V benchmark. Our results reveal that, although existing VLMs have achieved nearly saturated performance on the text-based GSM8K, there remains substantial room for improvement on the purely visual GSM8K-V. For instance, the best-performing model, Gemini-2.5-Pro, attains 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive and detailed analysis on GSM8K-V, systematically examining the limitations of existing models on this benchmark as well as potential directions for improvement. GSM8K-V provides a new perspective on visual mathematical reasoning and establishes a novel evaluation benchmark that can guide the research community toward developing more robust and generalizable VLMs.

Benchmark Construction

Systematic pipeline for constructing GSM8K-V visual math problems.

An overview of our pipeline that converts the text-based mathematical reasoning dataset GSM8K into its purely visual version, GSM8K-V.

Benchmark Examples

Overall Performance

We evaluate a broad range of open- and closed-source vision-language models on GSM8K-V, comparing their performance against the text-only GSM8K baseline to quantify the modality gap.

Main experimental results of GSM8K-V

Performance comparison between text-only GSM8K and visual GSM8K-V across different model architectures, highlighting the substantial modality gap in mathematical reasoning capabilities.

BibTeX

@misc{yuan2025gsm8kvvisionlanguagemodels,
        title={GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts}, 
        author={Fan Yuan and Yuchen Yan and Yifan Jiang and Haoran Zhao and Tao Feng and Jinyan Chen and Yanwei Lou and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
        year={2025},
        eprint={2509.25160},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2509.25160}, 
  }