SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

Hongxing Li1*, Dingming Li1*, Zixuan Wang1, Yuchen Yan1, Hang Wu1, Wenqi Zhang1, Yongliang Shen1†, Weiming Lu1, Jun Xiao1, Yueting Zhuang1,
1Zhejiang University,
Preprint. Under review.
*Equal Contribution, Corresponding Author

Abstract

Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single-image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of- domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.

SpatialLadder-26k

MY ALT TEXT

SpatialLadder-26k is a large-scale, high-quality dataset designed to advance spatial reasoning in multimodal models through a progressive learning curriculum. It contains 26,610 samples spanning four hierarchical task categories — object localization (5,929 samples), single-image spatial reasoning (5,929 samples), multi-view spatial reasoning (5,752 samples), and video spatial reasoning (9,000 samples) — enabling systematic skill development from low-level perception to high-level spatiotemporal understanding. The dataset covers three modalities and seven reasoning dimensions, including relative direction, relative and absolute distance, object size, counting, room size, and appearance order. Constructed through a standardized three-stage pipeline, SpatialLadder-26k integrates 3D scene reconstructions from ScanNet and video sequences from SR-91k, followed by precise 3D–2D transformation, data unification, and automatic question–answer generation adapted from VSI-Bench templates. This design ensures diverse, well-structured, and high-fidelity data for training and evaluating multimodal large language models on comprehensive spatial understanding tasks.

Progressive Three-stage Training Framework

MY ALT TEXT

Building upon SpatialLadder-26k, our framework systematically cultivates spatial intelligence through three progressive stages, mirroring the cognitive hierarchy from perception to reasoning. Stage 1 (Spatial Perception) establishes strong visual grounding by fine-tuning on object localization tasks, enabling precise mapping between language and spatially referenced visual regions. Stage 2 (Spatial Understanding) expands comprehension across seven spatial dimensions — including direction, distance, size, and order — through multimodal fine-tuning on single-image, multi-view, and video tasks, fostering robust and generalizable spatial representations. Stage 3 (Spatial Reasoning) further enhances reasoning capability using Group Relative Policy Optimization (GRPO)-based reinforcement learning with chain-of-thought (CoT) generation, guided by rewards for structured reasoning and accurate answers. Together, these stages form a unified, curriculum-inspired training process that progressively transforms perception into structured spatial reasoning in multimodal large language models.

In-domain Performance

MY ALT TEXT

SpatialLadder achieves state-of-the-art performance across multiple spatial reasoning benchmarks, reaching an overall accuracy of 62.3% and outperforming all baseline and proprietary models. The performance gains are especially significant on our proposed benchmarks — 70.2% on SPBench-SI and 70.9% on SPBench-MV, representing improvements of +29.9% and +34.3% over the base model, respectively. Even without specialized 3D encoders, SpatialLadder attains 45.7% accuracy on VSI-Bench (+16.3% over the base model), comparable to the 47.3% achieved by Spatial-MLLM with 3D encoders. These results demonstrate that progressive training effectively replaces architectural complexity, enabling robust spatial understanding that generalizes across numerical (NQ) and multiple-choice (MCQ) tasks.

Generalization Analysis

MY ALT TEXT

SpatialLadder demonstrates strong out-of-domain generalization with an overall accuracy of 50.8%, surpassing GPT-4o (48.1%) and maintaining a 7.2% improvement over the base model. The performance gains are consistent across diverse evaluation settings — CV-Bench for classical vision tasks, SPAR-Bench for multi-level reasoning, and ViewSpatial-Bench for perspective-dependent understanding. Notably, a 16.5% improvement on person-perspective tasks in ViewSpatial-Bench highlights the model’s ability to learn robust, transferable spatial representations that generalize effectively to novel viewpoints and unseen scenarios.

BibTeX

@misc{li2025spatialladderprogressivetrainingspatial,
      title={SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models}, 
      author={Hongxing Li and Dingming Li and Zixuan Wang and Yuchen Yan and Hang Wu and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
      year={2025},
      eprint={2510.08531},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.08531}, 
}