Utilizing self-supervised representations for MOS prediction

Tseng W.-C;Huang C.-Y;Kao W.-T;Lin Y.Y;Lee H.-Y.

doi:10.21437/Interspeech.2021-2013

Utilizing self-supervised representations for MOS prediction

Journal

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Journal Volume

5

Pages

3521-3525

Date Issued

2021

Author(s)

Tseng W.-C

Huang C.-Y

Kao W.-T

Lin Y. Y.

HUNG-YI LEE

DOI

10.21437/Interspeech.2021-2013

URI

https://www.scopus.com/inward/record.uri?eid=2-s2.0-85117824559&doi=10.21437%2fInterspeech.2021-2013&partnerID=40&md5=e7feaac120637d3b7076227417997a18

https://scholars.lib.ntu.edu.tw/handle/123456789/607152

Abstract

Speech quality assessment has been a critical issue in speech processing for decades. Existing automatic evaluations usually require clean references or parallel ground truth data, which is infeasible when the amount of data soars. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. However, such a test is expensive and time-consuming because crowd work is necessary. It thus becomes highly desired to develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data. In this paper, we use self-supervised pre-trained models for MOS prediction. We show their representations can distinguish between clean and noisy audios. Then, we fine-tune these pre-trained models followed by simple linear layers in an end-to-end manner. The experiment results showed that our framework outperforms the two previous state-of-the-art models by a significant improvement on Voice Conversion Challenge 2018 and achieves comparable or superior performance on Voice Conversion Challenge 2016. We also conducted an ablation study to further investigate how each module benefits the task. The experiment results are implemented and reproducible with publicly available toolkits. ? 2021 ISCA

Subjects

MOS prediction

Self-supervised learning

Speech quality assessment

Machine learning

Petroleum reservoir evaluation

Speech communication

Speech processing

Automatic evaluation

Critical issues

Evaluation approach

Ground truth data

Human perception

Parallel data

Voice conversion

Forecasting

Type

conference paper

Utilizing self-supervised representations for MOS prediction

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)