SHAN-YU CHOUJYH-SHING JANGYI-HSUAN YANG2020-05-042020-05-04201810450823https://doi.org/10.24963/ijcai.2018/463https://www.scopus.com/inward/record.uri?eid=2-s2.0-85055700674&doi=10.24963%2fijcai.2018%2f463&partnerID=40&md5=9fde5bff469c847a3e5a51e52fac564bMaking sense of the surrounding context and ongoing events through not only the visual inputs but also acoustic cues is critical for various AI applications. This paper presents an attempt to learn a neural network model that recognizes more than 500 different sound events from the audio part of user generated videos (UGV). Aside from the large number of categories and the diverse recording conditions found in UGV, the task is challenging because a sound event may occur only for a short period of time in a video clip. Our model specifically tackles this issue by combining a main subnet that aggregates information from the entire clip to make clip-level predictions, and a supplementary subnet that examines each short segment of the clip for segment-level predictions. As the labeled data available for model training are typically on the clip level, the latter subnet learns to pay attention to segments selectively to facilitate attentional segment-level supervision. We call our model the M&mnet, for it leverages both “M”acro (clip-level) supervision and “m”icro (segment-level) supervision derived from the macro one. Our experiments show that M&mnet works remarkably well for recognizing sound events, establishing a new state-of-the-art for DCASE17 and AudioSet data sets. Qualitative analysis suggests that our model exhibits strong gains for short events. In addition, we show that the micro subnet is computationally light and we can use multiple micro subnets to better exploit information in different temporal scales. © 2018 International Joint Conferences on Artificial Intelligence. All right reserved.Audio acoustics; AI applications; Model training; Neural network model; Qualitative analysis; Short segments; State of the art; Temporal scale; User-generated video; Artificial intelligenceLearning to recognize transient sound events using attentional supervisionconference paper10.24963/ijcai.2018/4632-s2.0-85055700674