Relation-Hallucination: Evaluating Visual Contrastive Decoding on Relation Hallucinations in LVLMs.

Overview

This repository contains the code and experiments for investigating whether Visual Contrastive Decoding (VCD) can be adapted to mitigate relation hallucinations in Large Vision-Language Models (LVLMs). While VCD effectively reduces object hallucinations, relation hallucinations remain underexplored. Our project evaluates targeted, relation-specific perturbations against full-image corruption to see if we can provide a stronger contrastive signal for relational reasoning.

Related Repositories

LLaVA - Large Language and Vision Assistant
Grounding DINO - Object detection and grounding
Visual Contrastive Decoding - Mitigating object hallucinations in LVLMs
ReefKnot - Reefknot: A Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in MVLMs
R-Bench - R-Bench: Benchmark with image and instance Level Yes/No Questions

Key Methods & Features

Relation-Aware VCD: Adapts standard VCD by applying Gaussian noise only to specific detected objects or regions instead of the entire image.
Targeted Perturbation Strategies: Uses Grounding DINO for object detection to perform single object masking, all-object masking, inter-object region masking, and patch shuffling.
Counterfactual Prompting: A text-based contrastive strategy that generates counterfactual prompts by replacing the relation in the original prompt.
Extended Detect-then-Calibrate (DTC): Extends the standard DTC baseline, originally limited to Yes/No questions, to support Multiple Choice (MCQ) and Visual Question Answering (VQA) formats using a generalized token set gathered by top-p or top-k.

Datasets & Models

Datasets Evaluated: Reefknot (comprising Y/N, MCQ, and VQA splits based on Visual Genome) and the R-Bench benchmark.
Models: LLaVA-1.5-13B (primary) and Qwen-VL-7B.

Key Findings

Targeted contrastive decoding strategies plateau at a ~36% hallucination rate (for LLaVA on Y/N questions of Reefknot), failing to offer meaningful improvements over the base model.
VCD variants do not match or outperform the Detect-then-Calibrate (DTC) baseline.
Logit distribution and attention analyses reveal that pixel-level perturbations are insufficient to decouple the model's relational reasoning.
Effective mitigation for relation hallucinations likely requires targeting internal model mechanisms rather than corrupting visual input at inference time.

Prerequisites & Setup

The following models, tools, and environments are necessary to reproduce the experiments of this project:

1. Models & Evaluation Libraries

LVLMs: Set up the environments for LLaVA-1.5-13B and Qwen-VL-7B.
Grounding DINO: Required for the object detection and targeted perturbation steps.
DeBERTa-v2: Used for bidirectional textual entailment to evaluate VQA question types.

2. Datasets

Reefknot Benchmark

Built on Visual Genome; includes Y/N, MCQ, and VQA subsets. Following scripts run inference using different methods on Reefknot.

VCD (Visual Contrastive Decoding) Based (and base model): VCD (example script)
DTC (Detect-then-Calibrate): DTC

The generated result files can be evaluated using: Reefknot Evaluation

R-Bench

Specifically the image-level subset containing Y/N questions. Following scripts run the inference for different methods on R-Bench Benchmark Dataset:

VCD (Visual Contrastive Decoding) Based: VCD
DTC (Detect-then-Calibrate): DTC
Base Model: Base LLAVA Model

The result files generated can be evaluated using : R-Bench Evaluation Script

3. Compute Infrastructure

Running inference with large models like LLaVA-13B and contrastive decoding requires significant GPU resources. During development, the following platforms were utilized:

Initial Development & Debugging: Lightning.ai, Kaggle (free-tier), and Google Colab.
Full-Scale Experiments: Lichtenberg and ADA HPC clusters. Ensure you have adequate VRAM and compute limits to run the full evaluation suites.

For better insights

Contributors

Keshav Agrawal
Nico Lick
Anusha Siddapati Mohanreddy
Romila Singh
Manu Thomas

Final Report Link

Name		Name	Last commit message	Last commit date
Latest commit History 319 Commits
GroundingDINO		GroundingDINO
Qwen_VL		Qwen_VL
Reefknot		Reefknot
Updated_Experiment_Results		Updated_Experiment_Results
Utils		Utils
VCD		VCD
.gitignore		.gitignore
FinalReport.pdf		FinalReport.pdf
README.md		README.md
attention-metric.ipynb		attention-metric.ipynb
attention_visualise_lvlm.py		attention_visualise_lvlm.py
create_bounding_box_images.py		create_bounding_box_images.py
evaluation_scripts.py		evaluation_scripts.py
infer-llava-reefknot.sh		infer-llava-reefknot.sh
infer_llava_cf.py		infer_llava_cf.py
infer_llava_text_cf.py		infer_llava_text_cf.py
infer_qwen.py		infer_qwen.py
logit_lens.py		logit_lens.py
logits_trends.py		logits_trends.py
r_bench_eval.py		r_bench_eval.py
r_bench_eval.sh		r_bench_eval.sh
run_dino.py		run_dino.py
run_dino_rbench.py		run_dino_rbench.py
sanitychecks_images.py		sanitychecks_images.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Relation-Hallucination: Evaluating Visual Contrastive Decoding on Relation Hallucinations in LVLMs.

Overview

Related Repositories

Key Methods & Features

Datasets & Models

Key Findings

Prerequisites & Setup

1. Models & Evaluation Libraries

2. Datasets

Reefknot Benchmark

R-Bench

3. Compute Infrastructure

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Relation-Hallucination: Evaluating Visual Contrastive Decoding on Relation Hallucinations in LVLMs.

Overview

Related Repositories

Key Methods & Features

Datasets & Models

Key Findings

Prerequisites & Setup

1. Models & Evaluation Libraries

2. Datasets

Reefknot Benchmark

R-Bench

3. Compute Infrastructure

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages