Training Large Language Models To Reason In Parallel With Global Forking Tokens

1University of Toronto, 2Amazon, 3Vector Institute
ICLR 2026
SSFT: Set Supervised Fine Tuning with bipartite matching
1. Set Supervised Fine Tuning (SSFT)
Distill diverse reasoning & learn Global Forking Tokens simultaneously
GFPO: Global Forking Policy Optimization
2. Global Forking Policy Optimization (GFPO)
RL on only Global Forking Tokens to optimize reasoning mode selection

SSFT learns emergent global forking tokens (<think1> ... <think6>) that trigger distinct reasoning modes. GFPO leverages these steerable tokens to incentivize complex reasoning, outperforming SFT+GRPO on math and code benchmarks.

SSFT computes set NTP loss via bipartite matching

Abstract

Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using bipartite matching between global forking tokens and unique reasoning traces. We observe that whereas naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Global Forking Policy Optimization (GFPO) leverages these maximally steerable tokens to incentivize complex reasoning, and the resulting models consistently outperform their SFT counterparts with GRPO on both math reasoning and execution-based code generation benchmarks.

Motivation for Principled SFT for Parallel Reasoning

Why care about parallel reasoning?

  1. It reveals reasoning capability boundaries, via metrics such as pass@k.
  2. Test-time compute such as self-consistency relies on it.

Goal. Since reasoning capacity of RLVR models is shown to be bounded by the base model (Yue et al., 2025), we aim to expand it through SFT. SFT with diverse traces aligns with how reasoning capacity is measured via parallel generations at inference.

Multi-teacher distillation setup

Limitations of SFT models naively trained with diverse traces.
Mode collapse — Parallel generations do not genuinely elicit distinct reasoning.

Mode collapse
(a) Mode collapse of parallel generations under metrics such as reasoning effort and accuracy.
Diversity-accuracy trade-off
(b) Recurring diversity (pass@k) vs accuracy (pass@1) trade-off under temperature scaling for steerability during inference (AIME25).

Main Contributions

  • Novel "Global Forking Tokens" steer model behavior with only one special token. Our Set-Supervised Fine-Tuning + Global Forking Policy Optimization (SSFT+GFPO) replaces SFT+RLVR when diverse traces are available.
  • Improved benchmarks on math & coding (SFT+GRPO → SSFT+GFPO). AIME25 pass@1 54.06% → 58.80%. AIME24 pass@1 59.80% → 64.22%. LiveCodeBench-v5 pass@1 47.13% → 52.07%.
  • Prevent reasoning diversity collapse. Improvements on pass@k, diversity in reasoning effort and accuracy, qualitative reasoning strategies, and matching convergence.

Set Loss with Set-of-Next-Token-Prediction

We introduce global forking tokens g := {g(1), ..., g(N)} instantiated as <think1> ... <think6> tags. For a prompt x, reasoning traces R := {r(1), ..., r(M)} are matched to forking tokens via a bipartite matching map σ:{1,...,M} → {1,...,N}, where σ(j)=i means trace r(j) is paired with token g(i).

Loss: sum of NTP losses where each NTP loss is that of a trace r(j) conditioned on its matched g(i).

Set prediction: SFT vs SSFT comparison

Key insight: Standard SFT uses random or fixed matching, leading to order-dependent loss. SSFT uses min-cost bipartite matching to optimally assign traces to think tags, producing an order-invariant set loss that naturally preserves reasoning diversity.

Set-Supervised Fine Tuning (SSFT)

Step 1: Before every optimization step, find the linear assignment that minimizes the sum of NTP losses, which is a bipartite matching loss, where each term is the NTP loss of a trace conditioned on its matched forking token.

Step 2: Optimize sum of NTP losses under optimal matching w.r.t. parameters.

SSFT algorithm: bipartite matching and parameter update

The matching step (Step 1) finds which <think> tag best fits each reasoning trace. The parameter update step (Step 2) trains the model so each tag produces its matched reasoning mode. Over training, think tags converge to emergent specializations.

Global Forking Policy Optimization (GFPO)

Apply policy gradients only at the global forking token to optimize reasoning mode selection per question. Since <think i> tags globally steer model behavior, optimizing mid-generation tokens is unnecessary.

GFPO: Global Forking Policy Optimization pipeline

Experiment Setup

Teacher models: GPT-OSS-high/medium, DeepSeek-R1, Gemini-Flash, Opus 4.1. Base model: Qwen2.5-32B-Instruct. N = 6 global forking tokens, M = 4 traces, Tr = 1000 tokens for matching.

Finetuning SSFT data RLVR data Evaluation data
Math model s1k (1k) DAPO-Math-17k AIME24/25, MATH500
Coding model (1k) Open-Thoughts (1k) Intellect-2-RL LiveCodeBench-v5

Key baselines (no set loss): SFT-mixed-distill-...B, SSFT-...B (random σ)

Benchmarks on Math Reasoning & Code Generation

Math fine-tuned models. "-GRPO" and "-GFPO" denote RLVR-ed models.

AIME 2024 AIME 2025 MATH-500 GPQA-D LCB(v5)*
Pass@1: Average performance of individual generations (* = out-of-distribution)
Qwen2.5-32B-Instruct 15.8010.4080.4047.0023.35
SFT-mixed-distill-32B-tags 58.2351.9688.4959.9632.34
SFT-mixed-distill-32B-tags-GRPO 58.8552.4088.85-37.13
SSFT-32B (random σ) 61.7755.1089.9562.2835.33
SSFT-32B 64.0658.1390.0260.3938.92
SSFT-32B-GFPO 64.2258.8089.9062.4842.10
Pass@1 of Native Cons@6: Majority voting with 6 parallel generations
SFT-mixed-distill-32B-tags 73.9470.0095.8858.75-
SSFT-32B (random σ) 73.0367.5895.6761.87-
SSFT-32B 75.4573.9496.4763.05-
Cons@32: Majority voting with large number of parallel generations
SFT-mixed-distill-32B-tags 76.6776.6796.2058.59-
SSFT-32B (random σ) 80.0080.0095.6062.63-
SSFT-32B 83.3386.6796.8061.62-
SSFT-32B-GFPO 83.3383.3396.8062.12-

Code fine-tuned models. Math tasks are out-of-distribution.

Pass@1 comparison LCB(v5) AIME 2024* AIME 2025* MATH-500*
SFT-mixed-code 47.1334.6924.1789.39
SSFT-32B-code (random σ) 45.3639.0631.5689.46
SSFT-32B-code 52.0743.2332.8289.96

Evaluation on Reasoning Diversity after SSFT

SSFT solves mode collapse of SFT with diverse traces. Global forking solves the diversity & accuracy trade-off of temperature scaling by a new steerability mechanism.

Global forking tokens elicit distinct reasoning modes
(a) Global forking tokens from SSFT elicit distinct reasoning modes.
Pass@k improvement with global forking
(b) New mechanism of expanding pass@k reasoning boundaries without sacrificing pass@1 accuracy (AIME25).

Qualitative Evaluation. AIME25 (Q11): Find the sum of intersection y-values of the sawtooth function and parabola. Each <think> tag triggers a qualitatively different strategy.

AIME25 Q11: Six distinct reasoning strategies from global forking tokens

BibTeX

@article{jia2025training,
  title={Training Large Language Models To Reason In Parallel With Global Forking Tokens},
  author={Jia, Sheng and Wang, Xiao and Kasiviswanathan, Shiva Prasad},
  journal={arXiv preprint arXiv:2510.05132},
  year={2025}
}