Don’t Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models
- [2025/08] We are delighted that PCBench has been accepted to EMNLP 2025 Findings!
- [2025/05] We released codes for this project.
Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the Premise Critique Ability for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the Premise Critique Bench (PCBench), designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs.
- Most models show limited ability to autonomously critique flawed premises, relying heavily on explicit prompts to detect errors.
- Both question difficulty and the error type can influence models' premise critique ability: models excel at identifying simple, surface-level flaws but struggle with complex inconsistencies or procedural errors.
- There is no consistent correlation between a model’s reasoning capability and its ability to critique premises. Some reasoning models internally catch inconsistencies but fail to articulate them outwardly.
- Flawed premises deepen overthinking in reasoning models, leading to significantly longer responses.
We construct PCBench to systematically evaluate LLMs' premise critique abilities for erroneous inputs via a structured process:
- Error Categories: Define 4 types of premise errors to assess model capabilities in identifying flawed inputs.
- Difficulty Levels:
- Normal: From GSM8K dataset
- Medium: Adapted from Chinese College Entrance Examination (OlympiadBench)
- Difficult: From Omni-MATH (difficulty >6)
- Problem Variants for each base problem (error category + difficulty):
- Original Problem: Correct premises (baseline).
- Flawed Problem: Intentional errors in premises (to test autonomous critique).
- Flawed Problem with Explicit Instruction: Adds prompts to check for errors (comparative reference).
Scale: 100 base problems per error-difficulty combination → 1,200 base problems → 3,600 problems (3 variants each).
Designed to analyze how error type and task complexity impact premise critique ability.
specific response and corresponding evaluation results can be found in evaluation/infer_result and evaluation/eval_result
The statistical results of the evaluation results are acceptable at evaluation/statistics
pip install -r requirements.txt
For reference, the command - line scripts are located in the evaluation/reference_sh folder.
Run following commad to get LLM's responses.
python ./evaluation/inference.py --model_name <model_name> --mode inference --save_frequency <save_frequency> --dataset_load_proc <dataset_load_proc> --infer_proc <infer_proc>Validates inference results for missing samples (repeat until no omissions).
python ./evaluation/inference.py --model_name <model_name> --mode check --save_frequency <save_frequency> --dataset_load_proc <dataset_load_proc> --infer_proc <infer_proc>Scripts located in evaluation/reference_sh.
Run following commad to get o3-mini-high's evaluation result to corresponding responses.
python ./evaluation/eval.py --model_name <model_name> --mode inference --evaluator <evaluator_model> --save_frequency <save_frequency> --infer_proc <infer_proc>Validates evaluation results for missing samples (repeat until no omissions).
python ./evaluation/eval.py --model_name <model_name> --mode check --evaluator <evaluator_model> --save_frequency <save_frequency> --infer_proc <infer_proc>Scripts located in evaluation/reference_sh.
After evaluating all the models, you can run following command to get the statisitcs of the evaluation result.
python ./evaluation/statistics.py@inproceedings{li-etal-2025-dont,
title = "Don{'}t Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models",
author = "Li, Jinzhe and
Li, Gengxu and
Chang, Yi and
Wu, Yuan",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.44/",
doi = "10.18653/v1/2025.findings-emnlp.44",
pages = "836--869",
ISBN = "979-8-89176-335-7",
abstract = "Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the **Premise Critique Ability** for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the **Premise Critique Bench (PCBench)**, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs, Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to be detected than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems."
}
Please cite our paper if you find our research and code useful.



