FragmentVC: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention

Lin Y.Y; Chien C.-M; Lin J.-H; HUNG-YI LEE; LIN-SHAN LEE; Lin Y.Y;Chien C.-M;Lin J.-H;Lee H.-Y;Lee L.-S.

doi:10.1109/ICASSP39728.2021.9413699

FragmentVC: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention

Journal

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Journal Volume

2021-June

Pages

5939-5943

Date Issued

2021

Author(s)

Lin Y.Y

Chien C.-M

Lin J.-H

HUNG-YI LEE

LIN-SHAN LEE

DOI

10.1109/ICASSP39728.2021.9413699

URI

https://www.scopus.com/inward/record.uri?eid=2-s2.0-85111438144&doi=10.1109%2fICASSP39728.2021.9413699&partnerID=40&md5=cd0e19d069322b478325a5ab88472748

https://scholars.lib.ntu.edu.tw/handle/123456789/607157

Abstract

Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms. By aligning the hidden structures of the two different feature spaces with a two-stage training process, FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance, all based on the attention mechanism of Transformer as verified with analysis on attention maps, and is accomplished end-to-end. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information and doesn’t require parallel data. Objective evaluation based on speaker verification and subjective evaluation with MOS both showed that this approach outperformed SOTA approaches, such as AdaIN-VC and AUTOVC. ? 2021 IEEE

Subjects

Any-to-any

Attention mechanism

Concatenative

Transformer

Voice conversion

Speech analysis

Speech recognition

Attention mechanisms

Hidden structures

Objective evaluation

Phonetic structure

Real-world scenario

Speaker verification

Subjective evaluations

Signal processing

SDGs

[SDGs]SDG4

Type

conference paper

FragmentVC: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)