Mind the Gap: Bridging Thought Leap for Improved CoT Tuning

Haolei Xu^1*, Yuchen Yan^1*, Yongliang Shen^1†, Wenqi Zhang¹, Guiyang Hou¹, Shengpei Jiang², Kaitao Song³, Weiming Lu^1†, Jun Xiao¹, Yueting Zhuang¹

¹Zhejiang University ²The Chinese University of Hong Kong
³Microsoft Research Asia
Preprint. Under review.
^*Equal Contribution, ^†Corresponding Author

Paper Code arXiv

🤗

HuggingFace

We have uploaded the code, data, and models. Please check the link.

Abstract

Large language models (LLMs) have achieved remarkable progress on mathematical tasks through Chain-of-Thought (CoT) reasoning. However, existing mathematical CoT datasets often suffer from Thought Leaps due to experts omitting intermediate steps, which negatively impacts model learning and generalization. We propose the CoT Thought Leap Bridge Task, which aims to automatically detect leaps and generate missing intermediate reasoning steps to restore the completeness and coherence of CoT. To facilitate this, we constructed a specialized training dataset called ScaleQM+, based on the structured ScaleQuestMath dataset, and trained CoT-Bridge to bridge thought leaps. Through comprehensive experiments on mathematical reasoning benchmarks, we demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets, with improvements of up to +5.87% on NuminaMath. Our approach effectively enhances distilled data (+3.02%) and provides better starting points for reinforcement learning (+3.1%), functioning as a plug-and-play module compatible with existing optimization techniques. Furthermore, CoT-Bridge demonstrate improved generalization to out-of-domain logical reasoning tasks, confirming that enhancing reasoning completeness yields broadly applicable benefits.

Method

To address Thought Leaps in Chain-of-Thought reasoning, we first construct training data (ScaleQM+) by removing steps from complete reasoning chains. Subsequently, the CoT-Bridge model, trained on this data, intelligently identifies and generates these missing intermediate steps. This process repairs incomplete reasoning chains, thereby enhancing the quality of CoT data and improving the model's learning efficiency.

Illustration of our work. The left panel shows data construction for training, where we strategically remove intermediate steps (e.g., between Step 0 and Step 1, or Step 2 and Step 3) from complete reasoning chains in ScaleQuestMath to create ScaleQM+ with Thought Leaps. The right panel demonstrates inference, where CoT-Bridge identifies gaps and generates appropriate intermediate steps to restore coherence in reasoning.

Results

To evaluate the effectiveness of CoT-Bridge, we conducted Supervised Fine-Tuning (SFT) experiments on prominent LLMs (Llama3.1-8B, Qwen2.5-Math-1.5B) using datasets including MetaMathQA and NuminaMath. Models trained with data enhanced by CoT-Bridge demonstrated significant performance improvements across various mathematical benchmarks (e.g., up to +5.87% on NuminaMath). This validates the efficacy of bridging thought leaps in enhancing reasoning capabilities.

Main results (%) on mathematical benchmarks. MATH, GaoKao, Odyssey, and Olympiad correspond to the MATH500, GaoKao2023EN, MathOdyssey, and OlympiadBenchEN benchmarks, respectively. QwenBridger-S and QwenBridger-L represent zero-shot bridging based on Qwen2.5-Instruct-7B and Qwen2.5-Instruct-72B, respectively. CoT-Bridge-R stands for CoT-Bridge-Random.

BibTeX

@misc{xu2025mindgapbridgingthought, title={Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning}, author={Haolei Xu and Yuchen Yan and Yongliang Shen and Wenqi Zhang and Guiyang Hou and Shengpei Jiang and Kaitao Song and Weiming Lu and Jun Xiao and Yueting Zhuang}, year={2025}, eprint={2505.14684}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.14684}, }

Mind the Gap: Bridging Thought Leap for Improved CoT Tuning

Overview of the Thought Leap phenomenon and our bridging approach. (a) Thought Leaps in CoT; (b) Negative impact on training; (c) Bridging leaps improves reasoning performance.

Abstract

Method

Results

Plug-and-Play Integration

Enhancing distilled data: Model performance after bridging distilled and rejection-sampled data.

Plug-and-Play Integration

Better cold start SFT model for RL: Model accuracy over GRPO training steps on MATH500.

Plug-and-Play Integration

Better cold start SFT model for RL: Reinforcement learning results.

Thought Leap Bridge Task Evaluation

Performance of different bridging methods on ScaleQM+ test set. Pre stands for Precision, Rec stands for Recall and Red stands for Redundancy.

OOD Evaluation

Performance of NuminaMath and its bridged version on logical reasoning benchmarks.

Bridging Positions

Performance variation after removing each bridging component.

PRM Score Distribution

PRM scores of Qwen2.5-Instruct-7B/72B & CoT-Bridge for MetaMathQA and NuminaMath.

PRM Score Distribution

PRM scores of Qwen2.5-Instruct-7B/72B & CoT-Bridge for MetaMathQA and NuminaMath.

Noise Impact Analysis

Performance after removing low-scoring steps.

DeepSeek R1 score

DeepSeek R1 scoring results for the original CoT and the bridged CoT.

BibTeX