Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling

Yen M.-C;Huang W.-C;Kobayashi K;Peng Y.-H;Tsai S.-W;Tsao Y;Toda T;Jang J.-S.R;Wang H.-M.

標題:	Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling
作者:	Yen M.-C Huang W.-C Kobayashi K Peng Y.-H Tsai S.-W Tsao Y Toda T JYH-SHING JANG Wang H.-M.
關鍵字:	electrolaryngeal speech; pretraining; sequence-to-sequence learning; transformer; voice conversion
公開日期:	2021
起(迄)頁:	650-657
來源出版物:	2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
摘要:	The electrolaryngeal speech (EL speech) is typically spoken with an electrolarynx device that generates excitation signals to substitute human vocal fold vibrations. Because the excitation signals cannot perfectly characterize sound sources generated by vocal folds, the naturalness and intelligibility of the EL speech are inevitably worse than that of the natural speech (NL speech). To improve speech naturalness, statistical models, such as Gaussian mixture models and deep-learning-based models, have been employed for EL speech voice conversion (ELVC). The ELVC task aims to convert EL speech into NL speech through an ELVC model. To implement a frame-wise ELVC system, accurate feature alignment is crucial for model training. However, the abnormal acoustic characteristics of the EL speech cause misalignments and accordingly limit the ELVC performance. To address this issue, we propose a novel ELVC system based on sequence-to-sequence (seq2seq) modeling with text-to-speech (TTS) pretraining. The seq2seq model involves an attention mechanism to concurrently perform representation learning and alignment. Meanwhile, TTS pretraining provides efficient training with limited data. Experimental results show that the proposed ELVC system yields notable improvements in terms of standardized evaluation metrics and subjective listening tests over a well-known frame-wise ELVC system. © 2021 IEEE.
URI:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85126824776&doi=10.1109%2fASRU51503.2021.9687908&partnerID=40&md5=bc70906cb2770c39bf1793c32bdface9 https://scholars.lib.ntu.edu.tw/handle/123456789/632437
DOI:	10.1109/ASRU51503.2021.9687908
SDG/關鍵字:	Petroleum reservoir evaluation; Speech intelligibility; Conversion systems; Electrolaryngeal speech; Excitation signals; Natural speech; Pre-training; Sequence learning; Sequence-to-sequence learning; Text to speech; Transformer; Voice conversion; Deep learning
顯示於：	資訊工程學系

顯示文件完整紀錄

SCOPUS^TM
Citations

checked on 2023/12/27

Page view(s)

checked on 2024/4/27

Google Scholar^TM

檢查

Altmetric

TAIR相關文章

SCOPUSTM Citations

Page view(s)

Google ScholarTM

Altmetric

Altmetric

SCOPUS^TM
Citations

Google Scholar^TM