Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs[Paper]
Abstract: Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
Before using our CMI-VLD framework, please set up the environment with the following commands:
conda env create -f environment.yml
conda activate CMI-VLD
python -m pip install -e transformers
Before evaluation, you need to download the following datasets and checkpoints of 7B base models:
- Download MSCOCO 2014 dataset and extract it in your data path.
- Download ShareGPT4V dataset and specify it at
predictor_train.sh. - Download LLaVA-1.5 merged 7B model and specify it at
eval_configs/llava-1.5_eval.yaml.
After setting up the environment, you can train the Visual Token Purifier by running:
bash predictor_train.sh
After training the Visual Token Purifier, you can run the following code to perform evaluation:
python pope_eval.py --model llava-1.5 --pope-type coco_random --use-cd --use-cmi --use-fast-v --sample --predictor your/path/to/PREDICTOR #CMI-VLD
During evaluation, we combined the VTI method with our approach. If you want to apply our method to other models, please adjust the corresponding parameters accordingly. All code was evaluated on an NVIDIA A6000 device.
You can also directly use our code to run multiple hallucination mitigation methods: Self-Introspective Decoding (SID, Vision Contrastive Decoding (VCD), Instruction Contrastive Decoding (ICD), OPERA, VTI.
python pope_eval.py --model llava-1.5 --pope-type coco_random --use-cd --use-fast-v #SID
python pope_eval.py --model llava-1.5 --pope-type coco_random --use-vcd --sample #VCD
python pope_eval.py --model llava-1.5 --pope-type coco_random --use-icd --sample #ICD
python pope_eval.py --model llava-1.5 --pope-type coco_random --vti --sample #VTI
python pope_eval.py --model llava-1.5 --pope-type coco_random --beam 5 --opera #OPERA
The CHAIR metric utilizes the same configuration.
| Argument | Example | Description |
|---|---|---|
--model |
llava-1.5 |
Specify the LVLM model. |
--data-path |
dataset/MSCOCO/val2014 |
Path to the dataset file or folder. |
--data-file |
dataset/MSCOCO/ |
Path to the dataset file or folder. |
--pope-type |
coco_adversarial |
Type for POPE evaluation. |
--sample |
store_true |
Use the modified decoding strategy. |
--sample-greedy |
store_true |
Use CD with sampling and greedy decoding. |
--beam |
5 |
Beam search number. |
--opera |
store_true |
Use OPERA. |
--vti |
store_true |
Use VTI. |
Some codes are based on the LVLMs codebase of SID, VTI, OPERA and VCD. Thanks for their excellent works!
