SpatialLadder-26k
SpatialLadder-26k is a large-scale, high-quality dataset designed to advance spatial reasoning in multimodal models through a progressive learning curriculum. It contains 26,610 samples spanning four hierarchical task categories — object localization (5,929 samples), single-image spatial reasoning (5,929 samples), multi-view spatial reasoning (5,752 samples), and video spatial reasoning (9,000 samples) — enabling systematic skill development from low-level perception to high-level spatiotemporal understanding. The dataset covers three modalities and seven reasoning dimensions, including relative direction, relative and absolute distance, object size, counting, room size, and appearance order. Constructed through a standardized three-stage pipeline, SpatialLadder-26k integrates 3D scene reconstructions from ScanNet and video sequences from SR-91k, followed by precise 3D–2D transformation, data unification, and automatic question–answer generation adapted from VSI-Bench templates. This design ensures diverse, well-structured, and high-fidelity data for training and evaluating multimodal large language models on comprehensive spatial understanding tasks.