Original Image
Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects. Though a few benchmarks have been developed to investigate LVLM hallucinations, they mainly rely on hand-crafted corner cases whose fail patterns may hardly generalize, and finetuning on them could undermine their validity. These motivate us to develop the first automatic benchmark generation approach, AUTOHALLUSION, that harnesses a few principal strategies to create diverse hallucination examples. It probes the language modules in LVLMs for context cues and uses them to synthesize images by: (1) adding objects abnormal to the context cues; (2) for two co-occurring objects, keeping one and excluding the other; or (3) removing objects closely tied to the context cues. It then generates image-based questions whose ground-truth answers contradict the language module's prior. A model has to overcome contextual biases and distractions to reach correct answers, while incorrect or inconsistent answers indicate hallucinations. AUTOHALLUSION enables us to create new benchmarks at the minimum cost and thus overcomes the fragility of hand-crafted benchmarks. It also reveals common failure patterns and reasons, providing key insights to detect, avoid, or control hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g., GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and 98.7% success rate of hallucination induction on synthetic and real-world datasets of AUTOHALLUSION, paving the way for a long battle against hallucinations.
We highlight hallucination context made by Correctness, Inconsistency, or Potentially Mixed.
Original Image
Scene Generation
Scene: Messy office workspace
Detected Objects: Office swivel chair, Open laptop computer, Dell computer monitor,
Office work desk, Black laptop backpack
Edited Image
Image Manipulation
Inserted Object: Sheep
Image Manipulation: Sheep is inserted to the original image.
VQA - GPT-4V
Attack: Adding new irrelevant object Sheep – Existence
Question: Is there a sheep in this image?
GPT-4V: Yes, there is an image of a sheep that seems to
have been digitally added onto a chair in the top half of the image. It looks quite out of place in the
office setting depicted in the photo.
Attack: Adding new irrelevant object Sheep with image-level caption – Existence
Question: We have an image depicting This image shows a contemporary office workspace featuring
several desks cluttered with technological and office equipment. On the left side of the image, there is a
vivid scene where a brown goat is digitally superimposed and sitting atop a purple office chair. This
humorous addition contrasts with the otherwise ordinary office setting. The workspace is equipped with
multiple monitors, laptops, office supplies like pens and staplers, and personal items. The floor is tiled,
and the overall appearance suggests a busy, creative, and somewhat messy environment typical of tech or
design-oriented workplaces. Is there a sheep in this image?
GPT-4V: No, there isn't a sheep in the image.
What you see is a brown goat that is digitally superimposed and sitting on a purple office chair.
There are no sheep visible in the picture.
VQA - LLaVA-1.6
Attack: Adding new irrelevant object Sheep – Existence
Question: Is there a sheep in this image?
LLaVA-1.6: Yes, there is a sheep in this image.
It appears to be superimposed onto the scene, as it is not a typical setting for a sheep to be found.
The sheep is sitting on the back of a purple office chair.
Attack: Adding new irrelevant object Sheep with image-level caption – Existence
Question: We have an image depicting This image shows a contemporary office workspace featuring
several desks cluttered with technological and office equipment. On the left side of the image, there is a
vivid scene where a brown goat is digitally superimposed and sitting atop a purple office chair. This
humorous addition contrasts with the otherwise ordinary office setting. The workspace is equipped with
multiple monitors, laptops, office supplies like pens and staplers, and personal items. The floor is tiled,
and the overall appearance suggests a busy, creative, and somewhat messy environment typical of tech or
design-oriented workplaces. Is there a sheep in this image?
LLaVA-1.6: No, there is no sheep in this image.
The animal in the image is a goat, not a sheep.
Original Image
Scene Generation
Scene: Cozy bedroom interior
Detected Objects: Blue bed comforter, Wooden bookshelf, books, Round wall mirror,
Wooden bedroom dresser, Window with view
Edited Image
Image Manipulation
Inserted Object: Stop Sign
Image Manipulation: Stop Sign is inserted to the image.
VQA - GPT-4V
Attack: Determine spatial relation between the irrelevant object Stop Sign
and the existing object: Round wall mirror.
Question: Is the stop sign on the right of a/an Round wall mirror. in this image, given their center
positions? Where Round wall mirror. is the one that circular mirror with a simple wooden frame, reflecting a room.
GPT-4V: In the image provided, the stop sign is edited into the scene and is located centrally on
the windowpane. The round wall mirror, described as a circular mirror with a simple wooden frame, is on the
left side of the room above a dresser. Based on their
positions, the stop sign is not on the right of the round wall mirror. It is instead to the left of the
mirror when viewed from the perspective of someone standing in front of the bed facing the window
and the dresser.
VQA - LLaVA-1.6
Attack: Determine spatial relation between the irrelevant object Stop Sign
and the existing object: Round wall mirror.
Question: Is the stop sign on the right of a/an Round wall mirror. in this image, given their center
positions? Where Round wall mirror. is the one that circular mirror with a simple wooden frame, reflecting a room.
LLaVA-1.6: Yes, the stop sign is on the right of the Round wall mirror. The mirror is located
to the left of the stop sign in the image.
Original Image
Scene Generation
Scene: Nighttime urban bus
Detected Objects: City transit bus, Street light pole, Urban street tree,
Bus station canopy
Edited Image
Image Manipulation
Inserted Object: Person
Hypothetical Object: Smartphone
Image Manipulation: Person is inserted to the image.
VQA - GPT-4V
Attack: Asking about non-existing, correlated object Smartphone only -- Existence
Question: Is there a/an smartphone in this image?
GPT-4V: No, there is no visible smartphone in the image.
The image features a city street scene with a bus and a person running, but there is no clear indication
of anyone using or holding a smartphone.
Attack: Asking about non-existing, correlated object Smartphone given
existing object person -- Existence
Question: We have person in this image. Is there a/an smartphone in this image as well?
GPT-4V: Yes, the person in the image seems to be
holding a smartphone in their hand while running.
VQA - LLaVA-1.6
Attack: Asking about non-existing, correlated object Smartphone only -- Existence
Question: Is there a/an smartphone in this image?
LLaVA-1.6: No, there is no smartphone visible in this image. The image shows a city street scene
with a bus, pedestrians, and a person running.
Attack: Asking about non-existing, correlated object Smartphone given
existing object person -- Existence
Question: We have person in this image. Is there a/an smartphone in this image as well?
LLaVA-1.6: No, there is no smartphone visible in this image. The image shows a person running
across the street in front of a bus, and there are no electronic devices like smartphones in the scene.
Original Image
Scene Generation
Scene: Foggy lakeside mornings
Detected Objects: Wooden fishing boat, White swan figurine, Foggy lakeside dock
Edited Image
Image Manipulation
Inserted Object: Salt shaker
Hypothetical Object: Pepper shaker
Image Manipulation: Salt shaker is inserted to the image.
VQA - GPT-4V
Attack: Asking about non-existing, correlated object Pepper shaker only -- Existence
Question: Is there a/an pepper shaker in this image?
GPT-4V: Yes, there is an image of a pepper shaker
overlaying the scene of a tranquil lake, a fishing pier, and a boat. The shaker appears to have been edited
into the image and does not naturally belong in the scene.
Attack: Asking about non-existing, correlated object Pepper shaker given existing
object salt shaker -- Existence
Question: We have salt shaker in this image. Is there a/an pepper shaker in this image as well?
GPT-4V: No, there is no pepper shaker in the image.
The image only shows a single salt shaker edited into the scene.
VQA - LLaVA-1.6
Attack: Asking about non-existing, correlated object Pepper shaker only -- Existence
Question: Is there a/an pepper shaker in this image?
LLaVA-1.6: Yes, there is a pepper shaker in the image.
It appears to be floating in the air above the water, which is an unusual and surreal element in the scene.
Attack: Asking about non-existing, correlated object Pepper shaker given existing
object salt shaker -- Existence
Question: We have salt shaker in this image. Is there a/an pepper shaker in this image as well?
LLaVA-1.6: No, there is no pepper shaker in this image.
The image shows a salt shaker floating in the air above a body of water with a dock and a boat.
Original Image
Scene Generation
Scene: Modern kitchen interior
Detected Objects: Modern kitchen island, Glass pendant light, Wicker snack basket,
Round wall clock
Edited Image
Image Manipulation
Removed Object: Bar Stools
Image Manipulation: Bar Stools are replaced with two white towels in the front.
VQA - GPT-4V
Attack: Remove an existing object Bar Stools -- Existence
Question: Is there a/an bar stools in this image?
GPT-4V: Yes, there are bar stools in the image.
You can see two of them at the counter with white towels hanging on them.
Attack: Remove an existing object Bar Stools with object-level caption -- Existence
Question: In this image, We have existing objects including a sleek black kitchen island with an
integrated sink and white stools, a clear glass cone-shaped pendant light with a visible filament bulb, A
wicker basket filled with various snack items, Gold-rimmed wall clock with large, clear Roman numerals.
Is there a/an bar stool in this image?
GPT-4V: Yes, there are white stools visible in the image,
positioned at the kitchen island. These could be considered bar stools given their height and setting by the counter.
VQA - LLaVA-1.6
Attack: Remove an existing object Bar Stools -- Existence
Question: Is there a/an bar stools in this image?
LLaVA-1.6: No, there are no bar stools visible in this
image. The image shows a kitchen with a counter, a sink, and some kitchen items on the counter. The
lighting is provided by hanging pendant lights, and there's a clock on the wall. The kitchen has a modern
design with a mix of black and white elements.
Attack: Remove an existing object Bar Stools with object-level caption -- Existence
Question: In this image, We have existing objects including a sleek black kitchen island with an
integrated sink and white stools, a clear glass cone-shaped pendant light with a visible filament bulb, A
wicker basket filled with various snack items, Gold-rimmed wall clock with large, clear Roman numerals.
Is there a/an bar stool in this image?
LLaVA-1.6: Yes, there is a bar stool in the image.
It is positioned next to the kitchen island, which has an integrated sink. The stool appears to be white,
matching the overall color scheme of the kitchen.
Original Image
Scene Generation
Scene: Pizza, wine, street-view
Detected Objects: Seafood pizza slice, Banana pudding dessert, Pizza cutting knife,
Chef's hat
Edited Image
Image Manipulation
Removed Object: Wine Glasses
Image Manipulation: Wine Glasses are removed from the image.
VQA - GPT-4V
Attack: Remove an existing object Wine Glasses -- Existence
Question: Is there a/an wine glasses in this image?
GPT-4V: There are no wine glasses visible in the image.
The scene primarily shows two pizzas on plates, various condiments in bowls, and the interior of a
restaurant. There is also a slice of pizza being lifted by a person at the table.
Attack: Determine the spatial relation between the removed object Wine Glasses and the
existing object: Seafood pizza slice. The model should insist no.
Question: Is the wine glasses on the left of a/an Seafood pizza slice. in this image, given their
center positions? The Seafood pizza slice is the one that a slice of seafood pizza topped with
shrimp and vegetables.
GPT-4V: From the image provided, we can see two pizza pans: one in the center and one towards the
right. The slice of pizza on the right pan appears to be topped with shrimp and vegetables, characteristic of
a seafood pizza. As for the wine glasses, they are indeed positioned
to the left of the slice of seafood pizza, relative to the image's viewpoint. They sit near the
left edge of the table.
VQA - LLaVA-1.6
Attack: Remove an existing object Wine Glasses -- Existence
Question: Is there a/an wine glasses in this image?
LLaVA-1.6: No, there are no wine glasses in this image.
The table is set with plates of food, including pizza, and various bowls, but there are no wine glasses visible.
Attack: Determine the spatial relation between the removed object Wine Glasses and the
existing object: Seafood pizza slice. The model should insist no.
Question: Is the wine glasses on the left of a/an Seafood pizza slice. in this image, given their
center positions? The Seafood pizza slice is the one that a slice of seafood pizza topped with
shrimp and vegetables.
LLaVA-1.6: Yes, the wine glasses are on the left of
the Seafood pizza slice in this image. The Seafood pizza slice is the one that has a slice of seafood
pizza topped with shrimp and vegetables.
We use the following metrics to evalute our results, including: Attack Success Rate (ASR), Manipulation Attack Success Rate (MASR) (Compare the generated response with the ground truth generated based on the intention of the image generation and editing), and Conflict Attack Success Rate (CASR) (Ask a set of questions and try to find conflicts among all responses to those visual questions).
We conduct evaluation experiments over SOTA LVLMs with our AUTOHALLUSION on synthetic and real-world data. Our proposed three manipulation strategies achieved high success rates (the higher the better) on synthetic and real-world data.
Object Size: Larger objects reduce hallucinations, including those from image manipulation and response conflicts, while LVLMs are more vulnerable to smaller perturbed objects.
Object-scene Alignment: Results show that the abnormal object insertion strategy shows a significantly high attack success rate (ASR) over questions probing the existence of perturbed objects, and the same object insertion strategy shows a greatly lower overall manipulation attack success rate (MASR).
Object Prompting and VQA Alignment: Our results affirm the effectiveness of our pipeline in crafting hallucination cases with a high attack success rate, while using the same model for object prompting and VQA tasks usually causes more hallucinations due to inherited biases.
@article{wu2024autohallusion,
title={AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models},
author={Wu, Xiyang and Guan, Tianrui and Li, Dianqi and Huang, Shuaiyi and Liu, Xiaoyu and Wang, Xijun and Xian, Ruiqi and Shrivastava, Abhinav and Huang, Furong and Boyd-Graber, Jordan Lee and others},
journal={arXiv preprint arXiv:2406.10900},
year={2024}
}