AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Abstract

Large vision-language models (LVLMs) are prone to hallucinations, where certain contextual cues in an image can trigger the language module to produce overconfident and incorrect reasoning about abnormal or hypothetical objects. While some benchmarks have been developed to investigate LVLM hallucinations, they often rely on hand-crafted corner cases whose failure patterns may not generalize well. Additionally, fine-tuning on these examples could undermine their validity. To address this, we aim to scale up the number of cases through an automated approach, reducing human bias in crafting such corner cases. This motivates the development of AutoHallusion, the first automated benchmark generation approach that employs several key strategies to create a diverse range of hallucination examples. Our generated visual-question pairs pose significant challenges to LVLMs, requiring them to overcome contextual biases and distractions to arrive at correct answers. AutoHallusion enables us to create new benchmarks at the minimum cost and thus overcomes the fragility of hand-crafted benchmarks. It also reveals common failure patterns and reasons, providing key insights to detect, avoid, or control hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g., GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and 98.7% success rate of hallucination induction on synthetic and real-world datasets of AutoHallusion, paving the way for a long battle against hallucinations.

The Crafted Hallusion Showcases

We highlight hallucination context made by Correctness, Inconsistency, or Potentially Mixed.

Abnormal Object Insertion – Existence

Original Image

Scene Generation

Scene: Messy office workspace
Detected Objects: Office swivel chair, Open laptop computer, Dell computer monitor, Office work desk, Black laptop backpack

Edited Image

Image Manipulation

Inserted Object: Sheep
Image Manipulation: Sheep is inserted to the original image.

VQA - GPT-4V

Attack: Adding new irrelevant object Sheep – Existence
Question: Is there a sheep in this image?
GPT-4V: Yes, there is an image of a sheep that seems to have been digitally added onto a chair in the top half of the image. It looks quite out of place in the office setting depicted in the photo.

Attack: Adding new irrelevant object Sheep with image-level caption – Existence
Question: We have an image depicting This image shows a contemporary office workspace featuring several desks cluttered with technological and office equipment. On the left side of the image, there is a vivid scene where a brown goat is digitally superimposed and sitting atop a purple office chair. This humorous addition contrasts with the otherwise ordinary office setting. The workspace is equipped with multiple monitors, laptops, office supplies like pens and staplers, and personal items. The floor is tiled, and the overall appearance suggests a busy, creative, and somewhat messy environment typical of tech or design-oriented workplaces. Is there a sheep in this image?
GPT-4V: No, there isn't a sheep in the image. What you see is a brown goat that is digitally superimposed and sitting on a purple office chair. There are no sheep visible in the picture.

VQA - LLaVA-1.6

Attack: Adding new irrelevant object Sheep – Existence
Question: Is there a sheep in this image?
LLaVA-1.6: Yes, there is a sheep in this image. It appears to be superimposed onto the scene, as it is not a typical setting for a sheep to be found. The sheep is sitting on the back of a purple office chair.

Attack: Adding new irrelevant object Sheep with image-level caption – Existence
Question: We have an image depicting This image shows a contemporary office workspace featuring several desks cluttered with technological and office equipment. On the left side of the image, there is a vivid scene where a brown goat is digitally superimposed and sitting atop a purple office chair. This humorous addition contrasts with the otherwise ordinary office setting. The workspace is equipped with multiple monitors, laptops, office supplies like pens and staplers, and personal items. The floor is tiled, and the overall appearance suggests a busy, creative, and somewhat messy environment typical of tech or design-oriented workplaces. Is there a sheep in this image?
LLaVA-1.6: No, there is no sheep in this image. The animal in the image is a goat, not a sheep.

Abnormal Object Insertion – Spatial Relation

Original Image

Scene Generation

Scene: Cozy bedroom interior
Detected Objects: Blue bed comforter, Wooden bookshelf, books, Round wall mirror, Wooden bedroom dresser, Window with view

Edited Image

Image Manipulation

Inserted Object: Stop Sign
Image Manipulation: Stop Sign is inserted to the image.

VQA - GPT-4V

Attack: Determine spatial relation between the irrelevant object Stop Sign and the existing object: Round wall mirror.
Question: Is the stop sign on the right of a/an Round wall mirror. in this image, given their center positions? Where Round wall mirror. is the one that circular mirror with a simple wooden frame, reflecting a room.
GPT-4V: In the image provided, the stop sign is edited into the scene and is located centrally on the windowpane. The round wall mirror, described as a circular mirror with a simple wooden frame, is on the left side of the room above a dresser. Based on their positions, the stop sign is not on the right of the round wall mirror. It is instead to the left of the mirror when viewed from the perspective of someone standing in front of the bed facing the window and the dresser.

VQA - LLaVA-1.6

Attack: Determine spatial relation between the irrelevant object Stop Sign and the existing object: Round wall mirror.
Question: Is the stop sign on the right of a/an Round wall mirror. in this image, given their center positions? Where Round wall mirror. is the one that circular mirror with a simple wooden frame, reflecting a room.
LLaVA-1.6: Yes, the stop sign is on the right of the Round wall mirror. The mirror is located to the left of the stop sign in the image.

Paired Object Insertion – Correlated

Original Image

Scene Generation

Scene: Nighttime urban bus
Detected Objects: City transit bus, Street light pole, Urban street tree, Bus station canopy

Edited Image

Image Manipulation

Inserted Object: Person
Hypothetical Object: Smartphone
Image Manipulation: Person is inserted to the image.

VQA - GPT-4V

Attack: Asking about non-existing, correlated object Smartphone only -- Existence
Question: Is there a/an smartphone in this image?
GPT-4V: No, there is no visible smartphone in the image. The image features a city street scene with a bus and a person running, but there is no clear indication of anyone using or holding a smartphone.

Attack: Asking about non-existing, correlated object Smartphone given existing object person -- Existence
Question: We have person in this image. Is there a/an smartphone in this image as well?
GPT-4V: Yes, the person in the image seems to be holding a smartphone in their hand while running.

VQA - LLaVA-1.6

Attack: Asking about non-existing, correlated object Smartphone only -- Existence
Question: Is there a/an smartphone in this image?
LLaVA-1.6: No, there is no smartphone visible in this image. The image shows a city street scene with a bus, pedestrians, and a person running.

Attack: Asking about non-existing, correlated object Smartphone given existing object person -- Existence
Question: We have person in this image. Is there a/an smartphone in this image as well?
LLaVA-1.6: No, there is no smartphone visible in this image. The image shows a person running across the street in front of a bus, and there are no electronic devices like smartphones in the scene.

Paired Object Insertion – Correlated

Original Image

Scene Generation

Scene: Foggy lakeside mornings
Detected Objects: Wooden fishing boat, White swan figurine, Foggy lakeside dock

Edited Image

Image Manipulation

Inserted Object: Salt shaker
Hypothetical Object: Pepper shaker
Image Manipulation: Salt shaker is inserted to the image.

VQA - GPT-4V

Attack: Asking about non-existing, correlated object Pepper shaker only -- Existence
Question: Is there a/an pepper shaker in this image?
GPT-4V: Yes, there is an image of a pepper shaker overlaying the scene of a tranquil lake, a fishing pier, and a boat. The shaker appears to have been edited into the image and does not naturally belong in the scene.

Attack: Asking about non-existing, correlated object Pepper shaker given existing object salt shaker -- Existence
Question: We have salt shaker in this image. Is there a/an pepper shaker in this image as well?
GPT-4V: No, there is no pepper shaker in the image. The image only shows a single salt shaker edited into the scene.

VQA - LLaVA-1.6

Attack: Asking about non-existing, correlated object Pepper shaker only -- Existence
Question: Is there a/an pepper shaker in this image?
LLaVA-1.6: Yes, there is a pepper shaker in the image. It appears to be floating in the air above the water, which is an unusual and surreal element in the scene.

Attack: Asking about non-existing, correlated object Pepper shaker given existing object salt shaker -- Existence
Question: We have salt shaker in this image. Is there a/an pepper shaker in this image as well?
LLaVA-1.6: No, there is no pepper shaker in this image. The image shows a salt shaker floating in the air above a body of water with a dock and a boat.

Correlated Object Removal -- Existence

Original Image

Scene Generation

Scene: Modern kitchen interior
Detected Objects: Modern kitchen island, Glass pendant light, Wicker snack basket, Round wall clock

Edited Image

Image Manipulation

Removed Object: Bar Stools
Image Manipulation: Bar Stools are replaced with two white towels in the front.

VQA - GPT-4V

Attack: Remove an existing object Bar Stools -- Existence
Question: Is there a/an bar stools in this image?
GPT-4V: Yes, there are bar stools in the image. You can see two of them at the counter with white towels hanging on them.

Attack: Remove an existing object Bar Stools with object-level caption -- Existence
Question: In this image, We have existing objects including a sleek black kitchen island with an integrated sink and white stools, a clear glass cone-shaped pendant light with a visible filament bulb, A wicker basket filled with various snack items, Gold-rimmed wall clock with large, clear Roman numerals. Is there a/an bar stool in this image?
GPT-4V: Yes, there are white stools visible in the image, positioned at the kitchen island. These could be considered bar stools given their height and setting by the counter.

VQA - LLaVA-1.6

Attack: Remove an existing object Bar Stools -- Existence
Question: Is there a/an bar stools in this image?
LLaVA-1.6: No, there are no bar stools visible in this image. The image shows a kitchen with a counter, a sink, and some kitchen items on the counter. The lighting is provided by hanging pendant lights, and there's a clock on the wall. The kitchen has a modern design with a mix of black and white elements.

Attack: Remove an existing object Bar Stools with object-level caption -- Existence
Question: In this image, We have existing objects including a sleek black kitchen island with an integrated sink and white stools, a clear glass cone-shaped pendant light with a visible filament bulb, A wicker basket filled with various snack items, Gold-rimmed wall clock with large, clear Roman numerals. Is there a/an bar stool in this image?
LLaVA-1.6: Yes, there is a bar stool in the image. It is positioned next to the kitchen island, which has an integrated sink. The stool appears to be white, matching the overall color scheme of the kitchen.

Correlated Object Removal -- Spatial Relation

Original Image

Scene Generation

Scene: Pizza, wine, street-view
Detected Objects: Seafood pizza slice, Banana pudding dessert, Pizza cutting knife, Chef's hat

Edited Image

Image Manipulation

Removed Object: Wine Glasses
Image Manipulation: Wine Glasses are removed from the image.

VQA - GPT-4V

Attack: Remove an existing object Wine Glasses -- Existence
Question: Is there a/an wine glasses in this image?
GPT-4V: There are no wine glasses visible in the image. The scene primarily shows two pizzas on plates, various condiments in bowls, and the interior of a restaurant. There is also a slice of pizza being lifted by a person at the table.

Attack: Determine the spatial relation between the removed object Wine Glasses and the existing object: Seafood pizza slice. The model should insist no.
Question: Is the wine glasses on the left of a/an Seafood pizza slice. in this image, given their center positions? The Seafood pizza slice is the one that a slice of seafood pizza topped with shrimp and vegetables.
GPT-4V: From the image provided, we can see two pizza pans: one in the center and one towards the right. The slice of pizza on the right pan appears to be topped with shrimp and vegetables, characteristic of a seafood pizza. As for the wine glasses, they are indeed positioned to the left of the slice of seafood pizza, relative to the image's viewpoint. They sit near the left edge of the table.

VQA - LLaVA-1.6

Attack: Remove an existing object Wine Glasses -- Existence
Question: Is there a/an wine glasses in this image?
LLaVA-1.6: No, there are no wine glasses in this image. The table is set with plates of food, including pizza, and various bowls, but there are no wine glasses visible.

Attack: Determine the spatial relation between the removed object Wine Glasses and the existing object: Seafood pizza slice. The model should insist no.
Question: Is the wine glasses on the left of a/an Seafood pizza slice. in this image, given their center positions? The Seafood pizza slice is the one that a slice of seafood pizza topped with shrimp and vegetables.
LLaVA-1.6: Yes, the wine glasses are on the left of the Seafood pizza slice in this image. The Seafood pizza slice is the one that has a slice of seafood pizza topped with shrimp and vegetables.

Benchmark Metrics

We assess the benchmark based on the following three aspects:
Diversity indicates the number of different scenes and objects to construct the dataset, out of 200 (synthetic) / 160 (real-world) samples.
Image Quality is represented using the Inception Score for both original and edited images, and Frechet Inception Distance between the original and edited images.
Effectiveness is measured by the average number of questions constructed to induce hallucination.

Leaderboard

We provide the leaderboard of SOTA LVLMs on the benchmark created by AUTOHALLUSION. We highlight the top-performing model.

Experiment Results

We conduct evaluation experiments over SOTA LVLMs with our AUTOHALLUSION on synthetic and real-world data. Our proposed three manipulation strategies achieved high success rates (the higher the better) on synthetic and real-world data.

Ablation Studies

Object Size: Larger objects reduce hallucinations, including those from image manipulation and response conflicts, while LVLMs are more vulnerable to smaller perturbed objects.

Object-scene Alignment: Results show that the abnormal object insertion strategy shows a significantly high attack success rate (ASR) over questions probing the existence of perturbed objects, and the same object insertion strategy shows a greatly lower overall manipulation attack success rate (MASR).

Object Prompting and VQA Alignment: Our results affirm the effectiveness of our pipeline in crafting hallucination cases with a high attack success rate, while using the same model for object prompting and VQA tasks usually causes more hallucinations due to inherited biases.

BibTeX

@article{wu2024autohallusion,
  title={AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models},
  author={Wu, Xiyang and Guan, Tianrui and Li, Dianqi and Huang, Shuaiyi and Liu, Xiaoyu and Wang, Xijun and Xian, Ruiqi and Shrivastava, Abhinav and Huang, Furong and Boyd-Graber, Jordan Lee and others},
  journal={arXiv preprint arXiv:2406.10900},
  year={2024}
}

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

AUTOHALLUSION is the first automatic benchmark generation approach that harnesses a few principal strategies to create diverse hallucination examples by probing the language modules in LVLMs for context cues.

Abstract

The Crafted Hallusion Showcases

Benchmark Metrics

Leaderboard

Experiment Results

Ablation Studies

BibTeX