Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples
Journal
Interspeech 2025
Series/Report No.
Proceedings of the Annual Conference of the International Speech Communication Association Interspeech
Start Page
2073
End Page
2077
ISSN
2308457X
Date Issued
2025-08-17
Author(s)
Kuan, Chun-Yi
Abstract
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs. However, these models often hallucinate non-existent sound events, reducing their reliability in real-world applications. To address this, we propose LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds using synthesized data from the backbone LLM. Unlike prior approaches, our method requires no modification to LLM parameters and efficiently integrates audio representations via a lightweight adapter. Experiments show that LISTEN effectively mitigates hallucinations while maintaining impressive performance on existing audio question and reasoning benchmarks. At the same time, it is more efficient in both data and computation.
Event(s)
26th Interspeech Conference 2025
Subjects
audio hallucination
audio understanding
audio-aware large language models
SDGs
Publisher
ISCA
Type
conference paper
