Don’t Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

📃 Paper • 🤗 Dataset • 🖥️ Code

Updates

[2025/08] We are delighted that PCBench has been accepted to EMNLP 2025 Findings!
[2025/05] We released codes for this project.

Introduction

Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the Premise Critique Ability for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the Premise Critique Bench (PCBench), designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs.

Key Findings

Most models show limited ability to autonomously critique flawed premises, relying heavily on explicit prompts to detect errors.
Both question difficulty and the error type can influence models' premise critique ability: models excel at identifying simple, surface-level flaws but struggle with complex inconsistencies or procedural errors.
There is no consistent correlation between a model’s reasoning capability and its ability to critique premises. Some reasoning models internally catch inconsistencies but fail to articulate them outwardly.
Flawed premises deepen overthinking in reasoning models, leading to significantly longer responses.

Data Construction

We construct PCBench to systematically evaluate LLMs' premise critique abilities for erroneous inputs via a structured process:

Error Categories: Define 4 types of premise errors to assess model capabilities in identifying flawed inputs.
Difficulty Levels:
- Normal: From GSM8K dataset
- Medium: Adapted from Chinese College Entrance Examination (OlympiadBench)
- Difficult: From Omni-MATH (difficulty >6)
Problem Variants for each base problem (error category + difficulty):
- Original Problem: Correct premises (baseline).
- Flawed Problem: Intentional errors in premises (to test autonomous critique).
- Flawed Problem with Explicit Instruction: Adds prompts to check for errors (comparative reference).

Scale: 100 base problems per error-difficulty combination → 1,200 base problems → 3,600 problems (3 variants each).
Designed to analyze how error type and task complexity impact premise critique ability.

Results

specific response and corresponding evaluation results can be found in evaluation/infer_result and evaluation/eval_result

The statistical results of the evaluation results are acceptable at evaluation/statistics

Install

pip install -r requirements.txt

Run Code

For reference, the command - line scripts are located in the evaluation/reference_sh folder.

Inference

Run following commad to get LLM's responses.

python ./evaluation/inference.py --model_name <model_name> --mode inference --save_frequency <save_frequency> --dataset_load_proc <dataset_load_proc> --infer_proc <infer_proc>

Validates inference results for missing samples (repeat until no omissions).

python ./evaluation/inference.py --model_name <model_name> --mode check --save_frequency <save_frequency> --dataset_load_proc <dataset_load_proc> --infer_proc <infer_proc>

Scripts located in evaluation/reference_sh.

Evaluation

Run following commad to get o3-mini-high's evaluation result to corresponding responses.

python ./evaluation/eval.py --model_name <model_name> --mode inference --evaluator <evaluator_model> --save_frequency <save_frequency> --infer_proc <infer_proc>

Validates evaluation results for missing samples (repeat until no omissions).

python ./evaluation/eval.py --model_name <model_name> --mode check --evaluator <evaluator_model> --save_frequency <save_frequency> --infer_proc <infer_proc>

Scripts located in evaluation/reference_sh.

Statistics

After evaluating all the models, you can run following command to get the statisitcs of the evaluation result.

python ./evaluation/statistics.py

Citation

@inproceedings{li-etal-2025-dont,
    title = "Don{'}t Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models",
    author = "Li, Jinzhe  and
      Li, Gengxu  and
      Chang, Yi  and
      Wu, Yuan",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.44/",
    doi = "10.18653/v1/2025.findings-emnlp.44",
    pages = "836--869",
    ISBN = "979-8-89176-335-7",
    abstract = "Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the **Premise Critique Ability** for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the **Premise Critique Bench (PCBench)**, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs, Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to be detected than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems."
}

Please cite our paper if you find our research and code useful.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
evaluation		evaluation
models		models
resources/img		resources/img
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Don’t Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

Updates

Contents

Introduction

Key Findings

Data Construction

Results

Install

Run Code

Inference

Evaluation

Statistics

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Don’t Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

Updates

Contents

Introduction

Key Findings

Data Construction

Results

Install

Run Code

Inference

Evaluation

Statistics

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages