GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts?

Abstract

Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. Mathematical reasoning highlights the high-level capability of VLMs to comprehend mathematical information embedded in images and to perform sophisticated reasoning processes. Recently, numerous visual mathematical reasoning benchmarks have been proposed to evaluate the mathematical reasoning capabilities of VLMs. However, these benchmarks suffer from several limitations: they are typically restricted to geometry problems, lack comprehensive evaluation on math word problems, and rarely assess the ability to reason across multiple images. To fill this gap, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is constructed by systematically mapping each sample from the widely used text-based mathematical reasoning benchmark GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate a benchmark comprising 1,319 high-quality samples. We evaluate a wide range of open-source and close-source models on the proposed GSM8K-V benchmark. Our results reveal that, although existing VLMs have achieved nearly saturated performance on the text-based GSM8K, there remains substantial room for improvement on the purely visual GSM8K-V. For instance, the best-performing model, Gemini-2.5-Pro, attains 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive and detailed analysis on GSM8K-V, systematically examining the limitations of existing models on this benchmark as well as potential directions for improvement. GSM8K-V provides a new perspective on visual mathematical reasoning and establishes a novel evaluation benchmark that can guide the research community toward developing more robust and generalizable VLMs.

Benchmark Examples

Overall Performance

We evaluate a broad range of open- and closed-source vision-language models on GSM8K-V, comparing their performance against the text-only GSM8K baseline to quantify the modality gap.

Performance comparison between text-only GSM8K and visual GSM8K-V across different model architectures, highlighting the substantial modality gap in mathematical reasoning capabilities.

BibTeX

@misc{yuan2025gsm8kvvisionlanguagemodels,
        title={GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts}, 
        author={Fan Yuan and Yuchen Yan and Yifan Jiang and Haoran Zhao and Tao Feng and Jinyan Chen and Yanwei Lou and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
        year={2025},
        eprint={2509.25160},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2509.25160}, 
  }

GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts?

GSM8K-V systematically maps each GSM8K math word problem into its visual counterpart for clean cross-modal evaluation.

Abstract

Benchmark Construction

Benchmark Examples

Overall Performance

Benchmark Statistics

Data distribution of GSM8K-V benchmark.

Ablation on the Format of Input Query

We compare two input modes: implicit, where only scene images are provided, and explicit, where images are paired with a masked question linking visual entities to semantics.

Ablation on Multi-Image Inputs

A visual math problem can be shown as one concatenated image or multiple sequential images. Multi-image input yields higher accuracy by preserving temporal and logical order, while concatenation often obscures dependencies and key information.

Sensitivity on Image Style

The GSM8K-V benchmark uses a Pixar-style rendering for clarity and realism. An ablation with Giphli-style re-renderings shows only marginal performance differences, indicating that visual style has minimal impact on reasoning.

Modality Validation

BibTeX