This repository contains the code and experiments for investigating whether Visual Contrastive Decoding (VCD) can be adapted to mitigate relation hallucinations in Large Vision-Language Models (LVLMs). While VCD effectively reduces object hallucinations, relation hallucinations remain underexplored. Our project evaluates targeted, relation-specific perturbations against full-image corruption to see if we can provide a stronger contrastive signal for relational reasoning.
- LLaVA - Large Language and Vision Assistant
- Grounding DINO - Object detection and grounding
- Visual Contrastive Decoding - Mitigating object hallucinations in LVLMs
- ReefKnot - Reefknot: A Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in MVLMs
- R-Bench - R-Bench: Benchmark with image and instance Level Yes/No Questions
- Relation-Aware VCD: Adapts standard VCD by applying Gaussian noise only to specific detected objects or regions instead of the entire image.
- Targeted Perturbation Strategies: Uses Grounding DINO for object detection to perform single object masking, all-object masking, inter-object region masking, and patch shuffling.
- Counterfactual Prompting: A text-based contrastive strategy that generates counterfactual prompts by replacing the relation in the original prompt.
- Extended Detect-then-Calibrate (DTC): Extends the standard DTC baseline, originally limited to Yes/No questions, to support Multiple Choice (MCQ) and Visual Question Answering (VQA) formats using a generalized token set gathered by top-p or top-k.
- Datasets Evaluated: Reefknot (comprising Y/N, MCQ, and VQA splits based on Visual Genome) and the R-Bench benchmark.
- Models: LLaVA-1.5-13B (primary) and Qwen-VL-7B.
- Targeted contrastive decoding strategies plateau at a ~36% hallucination rate (for LLaVA on Y/N questions of Reefknot), failing to offer meaningful improvements over the base model.
- VCD variants do not match or outperform the Detect-then-Calibrate (DTC) baseline.
- Logit distribution and attention analyses reveal that pixel-level perturbations are insufficient to decouple the model's relational reasoning.
- Effective mitigation for relation hallucinations likely requires targeting internal model mechanisms rather than corrupting visual input at inference time.
The following models, tools, and environments are necessary to reproduce the experiments of this project:
- LVLMs: Set up the environments for LLaVA-1.5-13B and Qwen-VL-7B.
- Grounding DINO: Required for the object detection and targeted perturbation steps.
- DeBERTa-v2: Used for bidirectional textual entailment to evaluate VQA question types.
Built on Visual Genome; includes Y/N, MCQ, and VQA subsets. Following scripts run inference using different methods on Reefknot.
- VCD (Visual Contrastive Decoding) Based (and base model): VCD (example script)
- DTC (Detect-then-Calibrate): DTC
The generated result files can be evaluated using: Reefknot Evaluation
Specifically the image-level subset containing Y/N questions. Following scripts run the inference for different methods on R-Bench Benchmark Dataset:
- VCD (Visual Contrastive Decoding) Based: VCD
- DTC (Detect-then-Calibrate): DTC
- Base Model: Base LLAVA Model
The result files generated can be evaluated using : R-Bench Evaluation Script
Running inference with large models like LLaVA-13B and contrastive decoding requires significant GPU resources. During development, the following platforms were utilized:
- Initial Development & Debugging: Lightning.ai, Kaggle (free-tier), and Google Colab.
- Full-Scale Experiments: Lichtenberg and ADA HPC clusters. Ensure you have adequate VRAM and compute limits to run the full evaluation suites.
For better insights
- Keshav Agrawal
- Nico Lick
- Anusha Siddapati Mohanreddy
- Romila Singh
- Manu Thomas