SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

Hongxing Li^1*, Dingming Li^1*, Zixuan Wang¹, Yuchen Yan¹, Hang Wu¹, Wenqi Zhang¹, Yongliang Shen^1†, Weiming Lu¹, Jun Xiao¹, Yueting Zhuang¹,

¹Zhejiang University,
Preprint. Under review.
^*Equal Contribution, ^†Corresponding Author

Abstract

Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single-image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of- domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.

SpatialLadder-26k

SpatialLadder-26k is a large-scale, high-quality dataset designed to advance spatial reasoning in multimodal models through a progressive learning curriculum. It contains 26,610 samples spanning four hierarchical task categories — object localization (5,929 samples), single-image spatial reasoning (5,929 samples), multi-view spatial reasoning (5,752 samples), and video spatial reasoning (9,000 samples) — enabling systematic skill development from low-level perception to high-level spatiotemporal understanding. The dataset covers three modalities and seven reasoning dimensions, including relative direction, relative and absolute distance, object size, counting, room size, and appearance order. Constructed through a standardized three-stage pipeline, SpatialLadder-26k integrates 3D scene reconstructions from ScanNet and video sequences from SR-91k, followed by precise 3D–2D transformation, data unification, and automatic question–answer generation adapted from VSI-Bench templates. This design ensures diverse, well-structured, and high-fidelity data for training and evaluating multimodal large language models on comprehensive spatial understanding tasks.

Progressive Three-stage Training Framework

Building upon SpatialLadder-26k, our framework systematically cultivates spatial intelligence through three progressive stages, mirroring the cognitive hierarchy from perception to reasoning. Stage 1 (Spatial Perception) establishes strong visual grounding by fine-tuning on object localization tasks, enabling precise mapping between language and spatially referenced visual regions. Stage 2 (Spatial Understanding) expands comprehension across seven spatial dimensions — including direction, distance, size, and order — through multimodal fine-tuning on single-image, multi-view, and video tasks, fostering robust and generalizable spatial representations. Stage 3 (Spatial Reasoning) further enhances reasoning capability using Group Relative Policy Optimization (GRPO)-based reinforcement learning with chain-of-thought (CoT) generation, guided by rewards for structured reasoning and accurate answers. Together, these stages form a unified, curriculum-inspired training process that progressively transforms perception into structured spatial reasoning in multimodal large language models.

In-domain Performance

SpatialLadder achieves state-of-the-art performance across multiple spatial reasoning benchmarks, reaching an overall accuracy of 62.3% and outperforming all baseline and proprietary models. The performance gains are especially significant on our proposed benchmarks — 70.2% on SPBench-SI and 70.9% on SPBench-MV, representing improvements of +29.9% and +34.3% over the base model, respectively. Even without specialized 3D encoders, SpatialLadder attains 45.7% accuracy on VSI-Bench (+16.3% over the base model), comparable to the 47.3% achieved by Spatial-MLLM with 3D encoders. These results demonstrate that progressive training effectively replaces architectural complexity, enabling robust spatial understanding that generalizes across numerical (NQ) and multiple-choice (MCQ) tasks.

Generalization Analysis

SpatialLadder demonstrates strong out-of-domain generalization with an overall accuracy of 50.8%, surpassing GPT-4o (48.1%) and maintaining a 7.2% improvement over the base model. The performance gains are consistent across diverse evaluation settings — CV-Bench for classical vision tasks, SPAR-Bench for multi-level reasoning, and ViewSpatial-Bench for perspective-dependent understanding. Notably, a 16.5% improvement on person-perspective tasks in ViewSpatial-Bench highlights the model’s ability to learn robust, transferable spatial representations that generalize effectively to novel viewpoints and unseen scenarios.

BibTeX

@misc{li2025spatialladderprogressivetrainingspatial, title={SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models}, author={Hongxing Li and Dingming Li and Zixuan Wang and Yuchen Yan and Hang Wu and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang}, year={2025}, eprint={2510.08531}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.08531}, }

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

Abstract

SpatialLadder-26k

Progressive Three-stage Training Framework

In-domain Performance

Generalization Analysis

Preliminary Analysis

Progressive perceptual guidance enhances spatial reasoning.

Component Analysis

Progressive training components collectively enhance spatial reasoning accuracy.

Training Dynamics

Complete SpatialLadder training ensures stable learning and superior spatial reasoning performance.

Semantic Entropy Dynamics

Reinforcement learning drives semantic consistency and convergent spatial reasoning.

Visual Attention Comparison

Progressive training sharpens visual attention, making it more object-centric and task-focused.

Hierarchical Reasoning Cases

Perceptual foundations naturally scaffold hierarchical spatial reasoning and systematic cognition.

BibTeX