Wei, Wen LiWen LiWeiLin, Jen ChunJen ChunLinLiu, Tyng LuhTyng LuhLiuYI-HSUAN YANGWang, Hsin MinHsin MinWangTyan, Hsiao RongHsiao RongTyanMark Liao, Hong YuanHong YuanMark Liao2023-10-192023-10-192018-10-08978153861737319457871https://scholars.lib.ntu.edu.tw/handle/123456789/636289Types of shots in the language of film are considered the key elements used by a director for visual storytelling. In filming a musical performance, manipulating shots could stimulate desired effects such as manifesting the emotion or deepening the atmosphere. However, while the visual storytelling technique is often employed in creating professional recordings of a live concert, audience recordings of the same event often lack such sophisticated manipulations. Thus it would be useful to have a versatile system that can perform video mashup to create a refined video from such amateur clips. To this end, we propose to translate the music into a near-professional shot (type) sequence by learning the relation between music and visual storytelling of shots. The resulting shot sequence can then be used to better portray the visual storytelling of a song and guide the concert video mashup process. Our method introduces a novel probabilistic-based fusion approach, named as multi-resolution fused recurrent neural networks (MF-RNNs) with film-language, which integrates multi-resolution fused RNNs and a film-language model for boosting the translation performance. The results from objective and subjective experiments demonstrate that MF-RNNs with film-language can generate an appealing shot sequence with better viewing experience.Language of film | live concert | recurrent neural networks | types of shotsSeethevoice: Learning from Music to Visual Storytelling of Shotsconference paper10.1109/ICME.2018.84864962-s2.0-85061443477https://api.elsevier.com/content/abstract/scopus_id/85061443477