Poster
Can Large Language Models Ask the Right Questions in Solving Complex Problems with Incomplete Information?
Zhanke Zhou · Xiao Feng · Zhaocheng Zhu · Jiangchao Yao · Sanmi Koyejo · Bo Han
While numerous benchmarks evaluate the reasoning abilities of large language models (LLMs) across various domains, most focus on passive reasoning tasks—where models are provided with all the necessary information to solve a problem. In contrast, active reasoning tasks, which require models to interact with external systems to gather missing information, remain largely underexplored. To bridge this gap, we introduce AR-Bench, a benchmark specifically designed to systematically assess the active reasoning capabilities of LLMs. AR-Bench features three distinct tasks: detective cases, situation puzzles, and guessing numbers, which evaluate active reasoning across commonsense, logical, and symbolic reasoning scenarios. Empirical results reveal that modern LLMs struggle significantly with active reasoning, often failing to consistently acquire or utilize relevant information. This exposes a clear disparity between their passive and active reasoning abilities. Further analysis shows that even advanced techniques, including tree-of-thought and post-training methods, provide only marginal improvements and fail to achieve performance levels suitable for practical applications. These findings underscore the urgent need to improve LLMs' active reasoning capabilities to better align with real-world problem-solving requirements.
Live content is unavailable. Log in and register to view live content