ScalR 2025: The First Workshop on Test-time Scaling and Reasoning Models

About The Workshop

The workshop aims to spark discussion around test-time scaling, specifically--but not limited to-- reasoning. By bringing together researchers working on reasoning, planning, alignment, and agentic systems, SCALR aims to build a community around this rapidly evolving area.

Where & When

Palais des Congrès
Montreal, Canada

October 10, 2025

Call For Papers

Submissions and Review Process

We invite researchers and practitioners to submit non-archival, short papers (up to 4 pages excluding references and appendix) describing novel ideas, preliminary results, or negative results related to test-time scaling and reasoning models. We will host paper submissions on OpenReview. Submissions should follow the COLM submission guidelines. Papers should be anonymized for double-blind review. All submissions must be in PDF format, please use the LaTeX style files provided by COLM.

Important Dates

Submission Deadline: ~~June 23, 2025~~ June 25, 2025 (Submit on OpenReview)
Accept/Reject Notification: July 24, 2025

Topics of Interest

Topics of interest include, but are not limited to:

Novel test-time algorithms for reasoning, planning, alignment, or agentic tasks
Test-time scaling techniques for LLM agents
Innovations in model training (algorithms, data, architecture) that facilitate more efficient and robust test-time scaling
Test-time scaling for reasoning tasks with verifiable and non-verifiable rewards
Novel techniques for training and utilization of outcome and process reward models in test-time scaling scenarios
Evaluation: benchmarks, simulation environments, evaluation protocols and metrics, and human in the loop evaluation of test-time scaling methods
Theoretical foundations of test-time scaling
Test-time scaling techniques for multi-modal reasoning
Studies on the faithfulness, trustworthiness, and other safety aspects of large reasoning models
All about LLM test time scaling applications: healthcare, robotics, embodiment, chemistry, education, databases with extra encouragements to less-studied domains and beyond
Societal implications of LLM test time scaling: bias, equity, misuse, jobs, climate change, and beyond
Test time scaling for everyone: multi-linguality, multiculturalism, and inference time adaptations to new values

Invited Speakers

Aviral Kumar

Carnegie Mellon University

Xuezhi Wang

Google DeepMind

Nathan Lambert

Allen Institute for AI

Lewis Tunstall

HuggingFace

Azalia Mirhoseini

Stanford University

Eric Wallace

OpenAI

Schedule

Time	Session
9:00	Opening Remarks
9:15	Invited Talk – Aviral Kumar
9:45	Xuezhi Wang - Teaching LLMs to Think Abstract: Large Language Models are powerful, yet they often struggle with tasks requiring deep reasoning. This talk traces how we've taught models to "think"—from prompting and decoding to advanced post-training and RL techniques that amplify reasoning depth—showing that reasoning is a crucial scaling dimension alongside pre-training.
10:15	Coffee Break
10:30	Poster Session
12:00	Nathan Lambert - Olmo-Thinking: Training a fully open reasoning model Abstract: We cover what it takes to add reasoning to a new 7B open model (Olmo-Thinking) to rival Qwen 3, highlighting fresh results, trade-offs, and methods across midtraining, distillation with high-quality thinking SFT data, and reinforcement learning with verifiable rewards.
12:30	Invited Talk – Azalia Mirhoseini
13:00	Lunch Break
14:30	Lewis Tunstall - The State of Open Reasoning Models Abstract: Since the release of DeepSeek-R1, the open-source AI ecosystem has undergone a remarkable transformation in how it approaches reasoning and inference-time compute. This progress has been driven by three key developments: the emergence of open reasoning datasets that enable reproducible research, a deeper understanding of reinforcement learning for LLMs, and advances in tooling for training and inference that make large-scale experimentation accessible to the community. In this talk, I will cover how these components are converging into a coherent open reasoning stack. I will highlight the leading projects, the lessons learned from recent experimentation, and the challenges that remain for open models to achieve competitive, general-purpose reasoning capabilities.
15:00	Eric Wallace - Reasoning Enables Safer Language Models Abstract: I’ll talk about three recent research directions to make OpenAI models more trustworthy and secure. First, I will do a deep dive into our chain-of-thought reasoning models and how we can align them with human preferences using deliberative alignment. Next, I will discuss how to mitigate prompt injections and jailbreaks by teaching LLMs to follow instructions in a hierarchical manner. Finally, I will discuss the tensions that exist between open model access and system security, whereby providing access to LM output probabilities can allow adversaries to reveal the hidden size of black-box models.
15:30	Coffee Break
15:45	Spotlight paper: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
16:00	Spotlight paper: ReCalibrate: RL for Uncertainty-Aware Reasoning in LLMs
16:15	Spotlight paper: SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models
16:30	Closing Remarks

Accepted Papers

Oral Presentations

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean Michael McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein

SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models

Emil Biju, Shayan Talaei, Zhemin Huang, Mohammadreza Pourreza, Azalia Mirhoseini, Amin Saberi

ReCalibrate: RL for Uncertainty-Aware Reasoning in LLMs

Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Jacob Andreas

Poster Presentations

EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning

Sanchit Ahuja, Praneetha Vaddamanu, Barun Patra

Extending AutoCompressors via Surprisal-Based Dynamic Segmentation

Richard Xu, Raine Ma, Dawson Park, David Guo, Srivishnu Ramamurthi, Charles Duong, Kevin Zhu, Vasu Sharma, Sean O'Brien

Guiding Reasoning in Small Language Models with LLM Assistance

Yujin Kim, Euiin Yi, Minu Kim, Se-Young Yun, Taehyeon Kim

TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling

Weizhe Lin, Xing Li, Zhiyuan Yang, Xiaojin Fu, Hui-Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, Mingxuan Yuan

Process Reward Models That Think

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang

Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

Jiajing Guo, Kenil Patel, Jorge Henrique Piazentin Ono, Wenbin He, Liu Ren

Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs

Sayan Ghosh, Shahzaib Saqib Warraich, Dhruv Tarsadiya, Gregory Yauney, Swabha Swayamdipta

Long Chain-of-Thought Reasoning Across Languages

Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr

Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee, Xinliang Frederick Zhang, Farima Fatahi Bayat, Lu Wang

Maximizing Prefix-Confidence at Test-Time Efficiently Improves Mathematical Reasoning

Matthias Otth, Jonas Hübotter, Ido Hakimi, Andreas Krause

Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, Akash Srivastava

Distilled pretraining improves test-time scaling and hinders in-context learning

Sachin Goyal, David Lopez-Paz, Kartik Ahuja

Know What You Don't Know: Uncertainty Calibration of Process Reward Models

Young-Jin Park, Kristjan Greenewald, Kaveh Alim, Hao Wang, Navid Azizan

Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

Victor Gallego

HYBRIDMIND: Meta Selection of Natural Language and Symbolic Language for Enhanced LLM Reasoning

Simeng Han, Tianyu Liu, Chuhan Li, Xuyuan Xiong, Arman Cohan

Learning to Discover Abstractions for LLM Reasoning

Yuxiao Qu, Anikait Singh, Yoonho Lee, Amrith Setlur, Ruslan Salakhutdinov, Chelsea Finn, Aviral Kumar

Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess

Dongyoon Hwang, Hojoon Lee, Jaegul Choo, Dongmin Park, Jongho Park

From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Test-time Computing in Large Language Models

Jinyi Liu, YAN ZHENG, Rong Cheng, Qiyu Wu, Wei Guo, Fei Ni, Hebin Liang, Yifu Yuan, Hangyu Mao, Fuzheng Zhang, Jianye HAO

Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards

Derek Li, Jiaming Zhou, Amirreza Kazemi, Qianyi Sun, Abbas Ghaddar, Liheng Ma, Yu Luo, Dong Li, Jianye HAO, Yingxue Zhang

Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs

Yuanxin Wang, Ganesh Venkatesh

When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, Anna Rohrbach

e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Amrith Setlur, Matthew Y. R. Yang, Charlie Victor Snell, Jeremiah Greer, Ian Wu, Virginia Smith, Max Simchowitz, Aviral Kumar

Learning to Reason Across Parallel Samples for LLM Reasoning

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, Eunsol Choi

Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

Bo Lei, Tavish Malcolm McDonald, Stanislav Fort, Bhavya Kailkhura, Brian R. Bartoldson

Organizers

The First Workshop on Test-time Scaling and Reasoning Models(ScalR @ COLM 2025)