VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

1University of Maryland, College Park, 2University of Southern California
*Equal Contribution
Teaser Image


VideoHallu is a benchmark of over 3,000 synthetic videos with expert-crafted counterintuitive QA pairs, evaluating MLLMs' ability to detect perceptually obvious abnormalities often missed due to language priors.

Abstract

Synthetic video generation has gained significant attention for its realism and broad applications, but remains prone to violations of common sense and physical laws. This highlights the need for reliable abnormality detectors that understand such principles and are robust to hallucinations. To address this, we introduce VideoHallu, a benchmark of over 3,000 video QA pairs built from synthetic videos generated by models like Veo2, Sora, and Kling, paired with expert-crafted counterintuitive QA to evaluate the critical thinking abilities of Multi-modal Large Language Models (MLLMs) on abnormalities that are perceptually obvious to humans but often hallucinated due to language priors. VideoHallu evaluates MLLMs' abnormality detection abilities with examples across alignment, consistency, commonsense, and physics. We benchmark SOTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, Video-R1, and VideoChat-R1. We observe that these models perform well on many real-world benchmarks like MVBench and MovieChat, but still struggle with basic physics-based and commonsense reasoning in synthetic videos. We further show that post-training with Group Relative Policy Optimization (GRPO), using curriculum learning on datasets combining video QA with counterintuitive commonsense and physics reasoning over real and synthetic videos, improves MLLMs’ abnormality detection and critical thinking, demonstrating the value of targeted training for improving their understanding of commonsense and physical laws.

Benchmark

We design our benchmark, VideoHallu, with four question categories to probe hallucinations in synthetic video understanding, covering perceptual understanding to abstract reasoning:


(a) Alignment checks if the model correctly identifies and understands entities using visual and textual cues.

(b) Spatial-temporal Consistency examines whether the model can track entity motion across frames.

(c) Common Sense Reasoning tests if the model can reason based on its knowledge.

(d) Physics assesses if the model applies physical laws to entity motions and procedural understanding.


Teaser Image

The Dawn of MLLMs in Synthetic Videos

We present selected cases from SoTA MLLM evaluations across each category. Hallucinations in model answers, common sense or physics violations in videos, and other notable cues in the video, questions, or ground truth are highlighted to assist the reader's understanding. More examples can be found in the Appendix of our paper.


Note: The legend below explains all the symbols used to represent the State-of-the-Art (SoTA) MLLMs featured in our showcases for synthetic video generation and video question-answering.

legend

Alignment

Video Generation Prompt: A young male athlete is playing basketball on an outdoor court, performing impressive dribbling and slam dunks.


Synthetic Video:

Alignment

Video Question-Answering by MLLMs:

alignment_vqa

Spatial-temporal Consistency

Video Generation Prompt: Generate a quail and a rooster celebrating New Year.


Synthetic Video:

STC

Video Question-Answering by MLLMs:

stc_vqa

Common Sense Reasoning

Video Generation Prompt: A feather and a heavy rock are released at the same height and begin to fall to the ground on Earth.


Synthetic Video:

CSR

Video Question-Answering by MLLMs:

stc_vqa

Physics

Video Generation Prompt: Generate the sequence showing a bullet being shot into a watermelon.


Synthetic Video:

CSR

Video Question-Answering by MLLMs:

stc_vqa

Evaluation over SoTA MLLMs

We evaluate diverse SoTA models across sizes and training strategies, reporting both overall and sub-category accuracies. Qwen2.5-VL-32B achieves the highest overall performance among all models.

stc_vqa

We evaluate SoTA MLLMs on VideoHallu, with results broken down by sub-category. From left to right, we show: (a) models under 7B parameters; (b) models between 7B–38B; (c) R1 fine-tuned models; and (d) large black-box MLLMs. While many perform well on alignment tasks, they remain prone to hallucinations in reasoning-heavy tasks, with notably weaker performance on physics and commonsense reasoning.

stc_vqa

Fine-tuning Results

We evaluate models fine-tuned on either domain-specific sub-datasets or curriculum-based composite datasets. Results show that models trained only on general real-world videos yield little to no gains on synthetic video understanding. Incorporating general physics data improves physics reasoning, and a curriculum starting with real-world physics followed by synthetic data leads to a 2.8% performance boost.

stc_vqa

We show results for (a) previous SoTA MLLMs, (b) models fine-tuned on sub-datasets, and (c) models fine-tuned on the full dataset via curriculum learning. Compared to the baseline (Qwen2.5-VL-7B), reinforcement fine-tuning on commonsense and physics data improves models' reasoning and overall performance in synthetic video understanding.

stc_vqa

BibTeX

@misc{li2025videohalluevaluatingmitigatingmultimodal,
      title={VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos},
      author={Zongxia Li and Xiyang Wu and Yubin Qin and Guangyao Shi and Hongyang Du and Dinesh Manocha and Tianyi Zhou and Jordan Lee Boyd-Graber},
      year={2025},
      eprint={2505.01481},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.01481},
}