https://scholars.lib.ntu.edu.tw/handle/123456789/581259
DC 欄位 | 值 | 語言 |
---|---|---|
dc.contributor.author | Lin Y.-B | en_US |
dc.contributor.author | Wang Y.-C.F. | en_US |
dc.contributor.author | YU-CHIANG WANG | zz |
dc.creator | Lin Y.-B;Wang Y.-C.F. | - |
dc.date.accessioned | 2021-09-02T00:08:03Z | - |
dc.date.available | 2021-09-02T00:08:03Z | - |
dc.date.issued | 2021 | - |
dc.identifier.issn | 03029743 | - |
dc.identifier.uri | https://www.scopus.com/inward/record.uri?eid=2-s2.0-85103251788&doi=10.1007%2f978-3-030-69544-6_17&partnerID=40&md5=6546863b7b373c56a9156850d979cfb3 | - |
dc.identifier.uri | https://scholars.lib.ntu.edu.tw/handle/123456789/581259 | - |
dc.description.abstract | Audio-visual event localization requires one to identify the event label across video frames by jointly observing visual and audio information. To address this task, we propose a deep learning framework of cross-modality co-attention for video event localization. Our proposed audiovisual transformer (AV-transformer) is able to exploit intra and inter-frame visual information, with audio features jointly observed to perform co-attention over the above three modalities. With visual, temporal, and audio information observed across consecutive video frames, our model achieves promising capability in extracting informative spatial/temporal features for improved event localization. Moreover, our model is able to produce instance-level attention, which would identify image regions at the instance level which are associated with the sound/event of interest. Experiments on a benchmark dataset confirm the effectiveness of our proposed framework, with ablation studies performed to verify the design of our propose network model. ? 2021, Springer Nature Switzerland AG. | - |
dc.relation.ispartof | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | - |
dc.subject | Audiovisual; Deep learning; Audio features; Audio information; Benchmark datasets; Cross modality; Event localizations; Learning frameworks; Network modeling; Visual information; Computer vision | - |
dc.title | Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization | en_US |
dc.type | conference paper | en |
dc.identifier.doi | 10.1007/978-3-030-69544-6_17 | - |
dc.identifier.scopus | 2-s2.0-85103251788 | - |
dc.relation.pages | 274-290 | - |
dc.relation.journalvolume | 12627 LNCS | - |
item.cerifentitytype | Publications | - |
item.fulltext | no fulltext | - |
item.openairecristype | http://purl.org/coar/resource_type/c_5794 | - |
item.openairetype | conference paper | - |
item.grantfulltext | none | - |
crisitem.author.dept | Electrical Engineering | - |
crisitem.author.dept | Communication Engineering | - |
crisitem.author.dept | FinTech Center | - |
crisitem.author.dept | Center for Artificial Intelligence and Advanced Robotics | - |
crisitem.author.orcid | 0000-0002-2333-157X | - |
crisitem.author.parentorg | College of Electrical Engineering and Computer Science | - |
crisitem.author.parentorg | College of Electrical Engineering and Computer Science | - |
crisitem.author.parentorg | Others: University-Level Research Centers | - |
crisitem.author.parentorg | Others: University-Level Research Centers | - |
顯示於: | 電機工程學系 |
在 IR 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。