Video Question Generation via Semantic Rich Cross-Modal Self-Attention Networks Learning

Wang, Y.-S.; Su, H.-T.; Chang, C.-H.; Liu, Z.-Y.; WINSTON HSU

doi:10.1109/ICASSP40776.2020.9053476

Video Question Generation via Semantic Rich Cross-Modal Self-Attention Networks Learning

Journal

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Journal Volume

2020-May

Pages

2423-2427

Date Issued

2020

Author(s)

Wang, Y.-S.

Su, H.-T.

Chang, C.-H.

Liu, Z.-Y.

WINSTON HSU

DOI

10.1109/ICASSP40776.2020.9053476

URI

https://www.scopus.com/inward/record.url?eid=2-s2.0-85089224063&partnerID=40&md5=bf4cba337e99a6b850d459203fad34a1

https://scholars.lib.ntu.edu.tw/handle/123456789/559297

Abstract

We introduce a novel task, Video Question Generation (Video QG). A Video QG model automatically generates questions given a video clip and its corresponding dialogues. Video QG requires a range of skills-sentence comprehension, temporal relation, the interplay between vision and language, and the ability to ask meaningful questions. To address this, we propose a novel semantic rich cross-modal self-attention (SR-CMSA) network to aggregate the multi-modal and diverse features. To be more precise, we enhance the video frames semantic by integrating the object-level information, and we jointly consider the cross-modal attention for the video question generation task. Excitingly, our proposed model remarkably improves the baseline from 7.58 to 14.48 in the BLEU-4 score on the TVQA dataset. Most of all, we arguably pave a novel path toward understanding the challenging video input and we provide detailed analysis in terms of diversity, which ushers the avenues for future investigations. © 2020 IEEE.

Subjects

Cross-Modal Attention; Video Question Generation

SDGs

[SDGs]SDG4

Other Subjects

Semantic Web; Semantics; Speech communication; Cross-modal; Diverse features; Multi-modal; Networks learning; Novel task; Temporal relation; Video clips; Video frame; Audio signal processing

Type

conference paper

Video Question Generation via Semantic Rich Cross-Modal Self-Attention Networks Learning

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)