Skip to content

MLGroupJLU/Premise_Critique

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Don’t Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

ill_comic

Updates

  • [2025/08] We are delighted that PCBench has been accepted to EMNLP 2025 Findings!
  • [2025/05] We released codes for this project.

Contents

Introduction

introduction

Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the Premise Critique Ability for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the Premise Critique Bench (PCBench), designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs.

Key Findings

  • Most models show limited ability to autonomously critique flawed premises, relying heavily on explicit prompts to detect errors.
  • Both question difficulty and the error type can influence models' premise critique ability: models excel at identifying simple, surface-level flaws but struggle with complex inconsistencies or procedural errors.
  • There is no consistent correlation between a model’s reasoning capability and its ability to critique premises. Some reasoning models internally catch inconsistencies but fail to articulate them outwardly.
  • Flawed premises deepen overthinking in reasoning models, leading to significantly longer responses.

Data Construction

We construct PCBench to systematically evaluate LLMs' premise critique abilities for erroneous inputs via a structured process:

  1. Error Categories: Define 4 types of premise errors to assess model capabilities in identifying flawed inputs.
  2. Difficulty Levels:
    • Normal: From GSM8K dataset
    • Medium: Adapted from Chinese College Entrance Examination (OlympiadBench)
    • Difficult: From Omni-MATH (difficulty >6)
  3. Problem Variants for each base problem (error category + difficulty):
    • Original Problem: Correct premises (baseline).
    • Flawed Problem: Intentional errors in premises (to test autonomous critique).
    • Flawed Problem with Explicit Instruction: Adds prompts to check for errors (comparative reference).

Scale: 100 base problems per error-difficulty combination → 1,200 base problems → 3,600 problems (3 variants each).
Designed to analyze how error type and task complexity impact premise critique ability.

construction

Results

results

specific response and corresponding evaluation results can be found in evaluation/infer_result and evaluation/eval_result

The statistical results of the evaluation results are acceptable at evaluation/statistics

Install

pip install -r requirements.txt

Run Code

For reference, the command - line scripts are located in the evaluation/reference_sh folder.

Inference

Run following commad to get LLM's responses.

python ./evaluation/inference.py --model_name <model_name> --mode inference --save_frequency <save_frequency> --dataset_load_proc <dataset_load_proc> --infer_proc <infer_proc>

Validates inference results for missing samples (repeat until no omissions).

python ./evaluation/inference.py --model_name <model_name> --mode check --save_frequency <save_frequency> --dataset_load_proc <dataset_load_proc> --infer_proc <infer_proc>

Scripts located in evaluation/reference_sh.

Evaluation

Run following commad to get o3-mini-high's evaluation result to corresponding responses.

python ./evaluation/eval.py --model_name <model_name> --mode inference --evaluator <evaluator_model> --save_frequency <save_frequency> --infer_proc <infer_proc>

Validates evaluation results for missing samples (repeat until no omissions).

python ./evaluation/eval.py --model_name <model_name> --mode check --evaluator <evaluator_model> --save_frequency <save_frequency> --infer_proc <infer_proc>

Scripts located in evaluation/reference_sh.

Statistics

After evaluating all the models, you can run following command to get the statisitcs of the evaluation result.

python ./evaluation/statistics.py

Citation

@inproceedings{li-etal-2025-dont,
    title = "Don{'}t Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models",
    author = "Li, Jinzhe  and
      Li, Gengxu  and
      Chang, Yi  and
      Wu, Yuan",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.44/",
    doi = "10.18653/v1/2025.findings-emnlp.44",
    pages = "836--869",
    ISBN = "979-8-89176-335-7",
    abstract = "Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the **Premise Critique Ability** for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the **Premise Critique Bench (PCBench)**, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs, Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to be detected than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems."
}

Please cite our paper if you find our research and code useful.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors