VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

1University of Maryland, College Park, 2University of Southern California
*Equal Contribution
Teaser Image


VideoHallu is a synthetic video benchmark with QA pairs requiring human-level reasoning. Evaluating and post-training SoTA MLLMs on commonsense/physics data shows its impact on improving model reasoning.

Abstract

Synthetic video generation using foundation models has gained significant attention due to its realism and broad applications. However, while these models excel at generating visually coherent and high-quality video frames, they often overlook commonsense reasoning and physical law violations, leading to abnormal content. Existing score-based evaluations like VideoScore mainly focus on general video quality and do not take these abnormalities into account, and offer no explanations of the evaluation results. A more promising evaluation approach is to leverage multi-modal large language models (MLLMs) as interpretable video evaluators, following the approach of FActScore. However, how well MLLMs can detect these abnormalities in synthetic videos is underexplored.

Motivated by a more interpretable video generation evaluation, we introduce VideoHallu, a benchmark built from synthetic videos produced by popular models like Sora, Veo2, Kling, paired with expert-crafted question-answering pair examples easily solvable with human-level perception and reasoning across multiple categories. We evaluate several State-of-the-Art (SoTA) MLLMs with our benchmark, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and forefront models like Video-R1 and VideoChat-R1. Despite the strong performance of R1 MLLMs on real-world video benchmarks like MVBench and MovieChat, these models still struggle and hallucinate on basic commonsense and physics reasoning tasks in synthetic videos, highlighting synthetic video hallucination as an underexplored challenge.

Moreover, we post-train current SoTA MLLMs, Qwen-2.5-VL-7B, with Group Relative Policy Optimization (GRPO) using both real-world and synthetic commonsense/physics datasets. Our results show improved overall accuracy compared to the base model, achieving the highest performance among all models, highlighting the importance of integrating high-quality counterexamples to enhance commonsense and physics reasoning in MLLMs' language priors.

Benchmark

We design our benchmark, VideoHallu, around four question categories aimed at probing hallucinations in synthetic video understanding, organized by the level of reasoning required from MLLMs to perform video-question answering in practice. The benchmark spans from perceptual understanding to high-level abstract reasoning.


(a) Alignment checks if the model correctly identifies and understands entities using visual and textual cues.

(b) Spatial-temporal Consistency examines whether the model can track entity motion across frames.

(c) Common Sense Reasoning tests if the model can reason based on its knowledge.

(d) Physics assesses if the model applies physical laws to entity motions and procedural understanding.


Teaser Image

The Dawn of MLLMs in Synthetic Videos

We present selected cases from SoTA MLLM evaluations across each category. Hallucinations in model answers, common sense or physics violations in videos, and other notable cues in the video, questions, or ground truth are highlighted to assist the reader's understanding. More examples can be found in the Appendix of our paper.


Note: The legend below explains all the symbols used to represent the State-of-the-Art (SoTA) MLLMs featured in our showcases for synthetic video generation and video question-answering.

legend

Alignment

Video Generation Prompt: A young male athlete is playing basketball on an outdoor court, performing impressive dribbling and slam dunks.


Synthetic Video:

Alignment

Video Question-Answering by MLLMs:

alignment_vqa

Spatial-temporal Consistency

Video Generation Prompt: Generate a quail and a rooster celebrating New Year.


Synthetic Video:

STC

Video Question-Answering by MLLMs:

stc_vqa

Common Sense Reasoning

Video Generation Prompt: A feather and a heavy rock are released at the same height and begin to fall to the ground on Earth.


Synthetic Video:

CSR

Video Question-Answering by MLLMs:

stc_vqa

Physics

Video Generation Prompt: Generate the sequence showing a bullet being shot into a watermelon.


Synthetic Video:

CSR

Video Question-Answering by MLLMs:

stc_vqa

Evaluation over SoTA MLLMs

We evaluate diverse SoTA models across sizes and training strategies, reporting both overall and sub-category accuracies. Qwen2.5-VL-32B achieves the highest overall performance among all models.

stc_vqa

We evaluate SoTA MLLMs on VideoHallu, with results broken down by sub-category. From left to right, we show: (a) models under 7B parameters; (b) models between 7B–38B; (c) R1 fine-tuned models; and (d) large black-box MLLMs. While many perform well on alignment tasks, they remain prone to hallucinations in reasoning-heavy tasks, with notably weaker performance on physics and commonsense reasoning.

stc_vqa

Fine-tuning Results

We evaluate models fine-tuned on either domain-specific sub-datasets or curriculum-based composite datasets. Results show that models trained only on general real-world videos yield little to no gains on synthetic video understanding. Incorporating general physics data improves physics reasoning, and a curriculum starting with real-world physics followed by synthetic data leads to a 2.8% performance boost.

stc_vqa

We show results for (a) previous SoTA MLLMs, (b) models fine-tuned on sub-datasets, and (c) models fine-tuned on the full dataset via curriculum learning. Compared to the baseline (Qwen2.5-VL-7B), reinforcement fine-tuning on commonsense and physics data improves models' reasoning and overall performance in synthetic video understanding.

stc_vqa

BibTeX

@misc{li2025videohalluevaluatingmitigatingmultimodal,
      title={VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos},
      author={Zongxia Li and Xiyang Wu and Yubin Qin and Guangyao Shi and Hongyang Du and Dinesh Manocha and Tianyi Zhou and Jordan Lee Boyd-Graber},
      year={2025},
      eprint={2505.01481},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.01481},
}