Learning to recognize transient sound events using attentional supervision
Journal
IJCAI International Joint Conference on Artificial Intelligence
Journal Volume
2018-July
Pages
3336 - 3342
Date Issued
2018
Author(s)
Abstract
Making sense of the surrounding context and ongoing events through not only the visual inputs but also acoustic cues is critical for various AI applications. This paper presents an attempt to learn a neural network model that recognizes more than 500 different sound events from the audio part of user generated videos (UGV). Aside from the large number of categories and the diverse recording conditions found in UGV, the task is challenging because a sound event may occur only for a short period of time in a video clip. Our model specifically tackles this issue by combining a main subnet that aggregates information from the entire clip to make clip-level predictions, and a supplementary subnet that examines each short segment of the clip for segment-level predictions. As the labeled data available for model training are typically on the clip level, the latter subnet learns to pay attention to segments selectively to facilitate attentional segment-level supervision. We call our model the M&mnet, for it leverages both “M”acro (clip-level) supervision and “m”icro (segment-level) supervision derived from the macro one. Our experiments show that M&mnet works remarkably well for recognizing sound events, establishing a new state-of-the-art for DCASE17 and AudioSet data sets. Qualitative analysis suggests that our model exhibits strong gains for short events. In addition, we show that the micro subnet is computationally light and we can use multiple micro subnets to better exploit information in different temporal scales. © 2018 International Joint Conferences on Artificial Intelligence. All right reserved.
Event(s)
27th International Joint Conference on Artificial Intelligence, IJCAI 2018
Other Subjects
Audio acoustics; AI applications; Model training; Neural network model; Qualitative analysis; Short segments; State of the art; Temporal scale; User-generated video; Artificial intelligence
Type
conference paper