Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs[Paper]

Abstract: Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.

Setup

Before using our CMI-VLD framework, please set up the environment with the following commands:

conda env create -f environment.yml
conda activate CMI-VLD
python -m pip install -e transformers

Implementation

Before evaluation, you need to download the following datasets and checkpoints of 7B base models:

Download MSCOCO 2014 dataset and extract it in your data path.
Download ShareGPT4V dataset and specify it at predictor_train.sh.
Download LLaVA-1.5 merged 7B model and specify it at eval_configs/llava-1.5_eval.yaml.

After setting up the environment, you can train the Visual Token Purifier by running:

bash predictor_train.sh

After training the Visual Token Purifier, you can run the following code to perform evaluation:

python pope_eval.py --model llava-1.5  --pope-type coco_random --use-cd --use-cmi --use-fast-v --sample --predictor your/path/to/PREDICTOR #CMI-VLD

During evaluation, we combined the VTI method with our approach. If you want to apply our method to other models, please adjust the corresponding parameters accordingly. All code was evaluated on an NVIDIA A6000 device.

You can also directly use our code to run multiple hallucination mitigation methods: Self-Introspective Decoding (SID, Vision Contrastive Decoding (VCD), Instruction Contrastive Decoding (ICD), OPERA, VTI.

python pope_eval.py --model llava-1.5  --pope-type coco_random --use-cd  --use-fast-v #SID

python pope_eval.py --model llava-1.5  --pope-type coco_random --use-vcd  --sample #VCD

python pope_eval.py --model llava-1.5  --pope-type coco_random --use-icd  --sample  #ICD

python pope_eval.py --model llava-1.5  --pope-type coco_random --vti  --sample #VTI

python pope_eval.py --model llava-1.5  --pope-type coco_random --beam 5 --opera #OPERA

The CHAIR metric utilizes the same configuration.

Arguments

Argument	Example	Description
`--model`	`llava-1.5`	Specify the LVLM model.
`--data-path`	`dataset/MSCOCO/val2014`	Path to the dataset file or folder.
`--data-file`	`dataset/MSCOCO/`	Path to the dataset file or folder.
`--pope-type`	`coco_adversarial`	Type for POPE evaluation.
`--sample`	`store_true`	Use the modified decoding strategy.
`--sample-greedy`	`store_true`	Use CD with sampling and greedy decoding.
`--beam`	`5`	Beam search number.
`--opera`	`store_true`	Use OPERA.
`--vti`	`store_true`	Use VTI.

Acknowledgement

Some codes are based on the LVLMs codebase of SID, VTI, OPERA and VCD. Thanks for their excellent works!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
eval_configs		eval_configs
experiments		experiments
figs		figs
llava		llava
minigpt4		minigpt4
pope		pope
transformers		transformers
vti_utils		vti_utils
README.md		README.md
chair_eval.py		chair_eval.py
chair_eval.sh		chair_eval.sh
cmi_sample.py		cmi_sample.py
custom_transformer_layer.py		custom_transformer_layer.py
environment.yml		environment.yml
pope_eval.py		pope_eval.py
pope_eval.sh		pope_eval.sh
pope_loader.py		pope_loader.py
predictor.py		predictor.py
predictor_sample.py		predictor_sample.py
predictor_train.py		predictor_train.py
predictor_train.sh		predictor_train.sh
vcd_add_noise.py		vcd_add_noise.py
vcd_sample.py		vcd_sample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs[Paper]

Setup

Implementation

Arguments

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs[Paper]

Setup

Implementation

Arguments

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages