Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce TTT-Bench, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board's spatial configurations, to ensure a win. We evaluate a diverse set of state-of-the-art LRMs, and discover that the models that excel at hard math problems frequently fail at these simple reasoning games. Further testing reveals that our evaluated reasoning models score on average ↓41% & ↓5% lower on TTT-Bench compared to MATH 500 & AIME 2024 respectively, with larger models achieving higher performance using shorter reasoning traces, where most of the models struggle on long-term st rategic reasoning situations on simple and new TTT-Bench tasks.
Weak reasoning ability of LRMs on simple and intuitive tasks:
In this work, we evaluate a comprehensive set of recent SOTA LRMs on TTT-Bench and conduct a head-to-head comparison of their performance on TTT-Bench versus two widely used mathematics benchmarks: AIME 2024 & MATH 500 (high school math), to thoroughly investigate the reasoning capabilities of these models. Key findings:
Reasoning models struggle with simple long-term strategic reasoning:
We investigate the type of reasoning tasks these LRMs are good at by analyzing their performance over these these individual solution verdict category questions. Key findings:
LRMs overthink on TTT-Bench, with an increase in model size resulting in improved performance and efficient use of chain-of-thought:
We also do the following comparisons to further analyze the reasoning behaviors:
@misc{mishra2025tttbenchbenchmarkevaluatingreasoning,
title={TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games},
author={Prakamya Mishra and Jiang Liu and Jialian Wu and Xiaodong Yu and Zicheng Liu and Emad Barsoum},
year={2025},
eprint={2506.10209},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.10209},
}