About The Workshop
The workshop aims to spark discussion around test-time scaling, specifically--but not limited to-- reasoning. By bringing together researchers working on reasoning, planning, alignment, and agentic systems, SCALR aims to build a community around this rapidly evolving area.
Where & When
Palais des Congrès
Montreal, Canada
October 10, 2025
Call For Papers
Submissions and Review Process
We invite researchers and practitioners to submit non-archival, short papers (up to 4 pages excluding references and appendix) describing novel ideas, preliminary results, or negative results related to test-time scaling and reasoning models. We will host paper submissions on OpenReview. Submissions should follow the COLM submission guidelines. Papers should be anonymized for double-blind review. All submissions must be in PDF format, please use the LaTeX style files provided by COLM.
Important Dates
- Submission Deadline:
June 23, 2025June 25, 2025 (Submit on OpenReview) - Accept/Reject Notification: July 24, 2025
Topics of Interest
Topics of interest include, but are not limited to:
- Novel test-time algorithms for reasoning, planning, alignment, or agentic tasks
- Test-time scaling techniques for LLM agents
- Innovations in model training (algorithms, data, architecture) that facilitate more efficient and robust test-time scaling
- Test-time scaling for reasoning tasks with verifiable and non-verifiable rewards
- Novel techniques for training and utilization of outcome and process reward models in test-time scaling scenarios
- Evaluation: benchmarks, simulation environments, evaluation protocols and metrics, and human in the loop evaluation of test-time scaling methods
- Theoretical foundations of test-time scaling
- Test-time scaling techniques for multi-modal reasoning
- Studies on the faithfulness, trustworthiness, and other safety aspects of large reasoning models
- All about LLM test time scaling applications: healthcare, robotics, embodiment, chemistry, education, databases with extra encouragements to less-studied domains and beyond
- Societal implications of LLM test time scaling: bias, equity, misuse, jobs, climate change, and beyond
- Test time scaling for everyone: multi-linguality, multiculturalism, and inference time adaptations to new values
Invited Speakers

Aviral Kumar
Carnegie Mellon University

Xuezhi Wang
Google DeepMind

Nathan Lambert
Allen Institute for AI

Lewis Tunstall
HuggingFace

Azalia Mirhoseini
Stanford University

Eric Wallace
OpenAI
Schedule
Time | Session |
---|---|
9:00 | Opening Remarks |
9:15 | Invited Talk – Aviral Kumar |
9:45 |
Xuezhi Wang - Teaching LLMs to Think
Abstract: Large Language Models are powerful, yet they often struggle with tasks requiring deep reasoning. This talk traces how we've taught models to "think"—from prompting and decoding to advanced post-training and RL techniques that amplify reasoning depth—showing that reasoning is a crucial scaling dimension alongside pre-training. |
10:15 | Coffee Break |
10:30 | Poster Session |
12:00 |
Nathan Lambert - Olmo-Thinking: Training a fully open reasoning model
Abstract: We cover what it takes to add reasoning to a new 7B open model (Olmo-Thinking) to rival Qwen 3, highlighting fresh results, trade-offs, and methods across midtraining, distillation with high-quality thinking SFT data, and reinforcement learning with verifiable rewards. |
12:30 | Invited Talk – Azalia Mirhoseini |
13:00 | Lunch Break |
14:30 |
Lewis Tunstall - The State of Open Reasoning Models
Abstract: Since the release of DeepSeek-R1, the open-source AI ecosystem has undergone a remarkable transformation in how it approaches reasoning and inference-time compute. This progress has been driven by three key developments: the emergence of open reasoning datasets that enable reproducible research, a deeper understanding of reinforcement learning for LLMs, and advances in tooling for training and inference that make large-scale experimentation accessible to the community. In this talk, I will cover how these components are converging into a coherent open reasoning stack. I will highlight the leading projects, the lessons learned from recent experimentation, and the challenges that remain for open models to achieve competitive, general-purpose reasoning capabilities. |
15:00 |
Eric Wallace - Reasoning Enables Safer Language Models
Abstract: I’ll talk about three recent research directions to make OpenAI models more trustworthy and secure. First, I will do a deep dive into our chain-of-thought reasoning models and how we can align them with human preferences using deliberative alignment. Next, I will discuss how to mitigate prompt injections and jailbreaks by teaching LLMs to follow instructions in a hierarchical manner. Finally, I will discuss the tensions that exist between open model access and system security, whereby providing access to LM output probabilities can allow adversaries to reveal the hidden size of black-box models. |
15:30 | Coffee Break |
15:45 | Spotlight paper: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach |
16:00 | Spotlight paper: ReCalibrate: RL for Uncertainty-Aware Reasoning in LLMs |
16:15 | Spotlight paper: SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models |
16:30 | Closing Remarks |
Accepted Papers
Oral Presentations
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models
ReCalibrate: RL for Uncertainty-Aware Reasoning in LLMs
Poster Presentations
EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning
Extending AutoCompressors via Surprisal-Based Dynamic Segmentation
Guiding Reasoning in Small Language Models with LLM Assistance
TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling
Process Reward Models That Think
Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks
Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs
Long Chain-of-Thought Reasoning Across Languages
Logit Arithmetic Elicits Long Reasoning Capabilities Without Training
Maximizing Prefix-Confidence at Test-Time Efficiently Improves Mathematical Reasoning
Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
Distilled pretraining improves test-time scaling and hinders in-context learning
Know What You Don't Know: Uncertainty Calibration of Process Reward Models
Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement
HYBRIDMIND: Meta Selection of Natural Language and Symbolic Language for Enhanced LLM Reasoning
Learning to Discover Abstractions for LLM Reasoning
Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess
From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Test-time Computing in Large Language Models
Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards
Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs
When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
Learning to Reason Across Parallel Samples for LLM Reasoning
Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness
Organizers

Muhammad Khalifa
University of Michigan
PhD candidate focusing on constrained generation and test-time techniques for reasoning. Experienced in organizing local events including NLP@Michigan Day.

Yunxiang Zhang
University of Michigan
PhD candidate researching enhancement of knowledge and reasoning capabilities of LLMs for scientific discovery.

Lifan Yuan
UIUC
PhD student focusing on scalable solutions for LLM post-training and inference in reasoning.


Hao Peng
Assistant Professor, UIUC
Research focuses on post-pretraining methods, long-context efficiency, and using LLMs for scientific discovery. Recipient of multiple paper awards and industry recognitions.

Sean Welleck
Assistant Professor, CMU
Leads the Machine Learning, Language, and Logic (L3) Lab. Research focuses on LLMs, reasoning, and AI for mathematics. NeurIPS Outstanding Paper Award and NAACL Best Paper Award recipient.

Sewon Min
Assistant Professor, UC Berkeley
Research expertise in LLMs and NLP, with focus on retrieval augmentation and data-centric approaches. Multiple paper awards at ACL, NeurIPS, and ICLR.

Honglak Lee
Professor, University of Michigan
Research spans deep learning, representation learning, and reinforcement learning. Recent work focuses on efficient methods for planning, reasoning, and learning for LLM-based agents. IEEE AI's 10 to Watch, NSF CAREER Award recipient.

Lu Wang
Associate Professor, University of Michigan
Research focuses on enhancing factuality and reasoning capabilities of language models. Extensive experience organizing workshops at major NLP conferences. Program co-chair for NAACL 2025.