Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization

Lin Y.-B;Wang Y.-C.F.

DC 欄位	值	語言
dc.contributor.author	Lin Y.-B	en_US
dc.contributor.author	Wang Y.-C.F.	en_US
dc.contributor.author	YU-CHIANG WANG	zz
dc.creator	Lin Y.-B;Wang Y.-C.F.	-
dc.date.accessioned	2021-09-02T00:08:03Z	-
dc.date.available	2021-09-02T00:08:03Z	-
dc.date.issued	2021	-
dc.identifier.issn	03029743	-
dc.identifier.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85103251788&doi=10.1007%2f978-3-030-69544-6_17&partnerID=40&md5=6546863b7b373c56a9156850d979cfb3	-
dc.identifier.uri	https://scholars.lib.ntu.edu.tw/handle/123456789/581259	-
dc.description.abstract	Audio-visual event localization requires one to identify the event label across video frames by jointly observing visual and audio information. To address this task, we propose a deep learning framework of cross-modality co-attention for video event localization. Our proposed audiovisual transformer (AV-transformer) is able to exploit intra and inter-frame visual information, with audio features jointly observed to perform co-attention over the above three modalities. With visual, temporal, and audio information observed across consecutive video frames, our model achieves promising capability in extracting informative spatial/temporal features for improved event localization. Moreover, our model is able to produce instance-level attention, which would identify image regions at the instance level which are associated with the sound/event of interest. Experiments on a benchmark dataset confirm the effectiveness of our proposed framework, with ablation studies performed to verify the design of our propose network model. ? 2021, Springer Nature Switzerland AG.	-
dc.relation.ispartof	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	-
dc.subject	Audiovisual; Deep learning; Audio features; Audio information; Benchmark datasets; Cross modality; Event localizations; Learning frameworks; Network modeling; Visual information; Computer vision	-
dc.title	Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization	en_US
dc.type	conference paper	en
dc.identifier.doi	10.1007/978-3-030-69544-6_17	-
dc.identifier.scopus	2-s2.0-85103251788	-
dc.relation.pages	274-290	-
dc.relation.journalvolume	12627 LNCS	-
item.cerifentitytype	Publications	-
item.fulltext	no fulltext	-
item.openairecristype	http://purl.org/coar/resource_type/c_5794	-
item.openairetype	conference paper	-
item.grantfulltext	none	-
crisitem.author.dept	Electrical Engineering	-
crisitem.author.dept	Communication Engineering	-
crisitem.author.dept	FinTech Center	-
crisitem.author.dept	Center for Artificial Intelligence and Advanced Robotics	-
crisitem.author.orcid	0000-0002-2333-157X	-
crisitem.author.parentorg	College of Electrical Engineering and Computer Science	-
crisitem.author.parentorg	College of Electrical Engineering and Computer Science	-
crisitem.author.parentorg	Others: University-Level Research Centers	-
crisitem.author.parentorg	Others: University-Level Research Centers	-
顯示於：	電機工程學系

顯示文件簡單紀錄

SCOPUS^TM
Citations

checked on 2023/11/13

Page view(s)

checked on 2024/4/27

Google Scholar^TM

檢查

Altmetric

TAIR相關文章

SCOPUSTM Citations

Page view(s)

Google ScholarTM

Altmetric

Altmetric

SCOPUS^TM
Citations

Google Scholar^TM