https://scholars.lib.ntu.edu.tw/handle/123456789/634354
標題: | Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection | 作者: | Chen, Xuanjun Wu, Haibin Meng, Helen HUNG-YI LEE JYH-SHING JANG |
關鍵字: | adversarial robustness | Audio-visual active speaker detection | multi-modal adversarial attack | 公開日期: | 1-一月-2023 | 來源出版物: | 2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings | 摘要: | Audio-visual active speaker detection (AVASD) is well-developed, and now is an indispensable front-end for several multi-modal applications. However, to the best of our knowledge, the adversarial robustness of AVASD models hasn't been investigated, not to mention the effective defense against such attacks. In this paper, we are the first to reveal the vulnerability of AVASD models under audio-only, visual-only, and audio-visual adversarial attacks through extensive experiments. What's more, we also propose a novel audio-visual interaction loss (AVIL) for making attackers difficult to find feasible adversarial examples under an allocated attack budget. The loss aims at pushing the inter-class embeddings to be dispersed, namely non-speech and speech clusters, sufficiently disentangled, and pulling the intra-class embeddings as close as possible to keep them compact. Experimental results show the AVIL outperforms the adversarial training by 33.14 mAP (%) under multi-modal attacks. |
URI: | https://scholars.lib.ntu.edu.tw/handle/123456789/634354 | ISBN: | 9798350396904 | DOI: | 10.1109/SLT54892.2023.10022646 |
顯示於: | 資訊工程學系 |
在 IR 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。