Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Zezario, Ryandhimas E.; Fu, Szu Wei; Chen, Fei; CHIOU-SHANN FUH; Wang, Hsin Min; Tsao, Yu

doi:10.1109/TASLP.2022.3205757

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Journal

IEEE/ACM Transactions on Audio Speech and Language Processing

Journal Volume

31

Pages

54

Date Issued

2023-01-01

Author(s)

Zezario, Ryandhimas E.

Fu, Szu Wei

Chen, Fei

CHIOU-SHANN FUH

Wang, Hsin Min

Tsao, Yu

DOI

10.1109/TASLP.2022.3205757

URI

https://scholars.lib.ntu.edu.tw/handle/123456789/633088

URL

https://api.elsevier.com/content/abstract/scopus_id/85139456461

Abstract

This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-Term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments. Experimental results show that in both seen and unseen noise environments, MOSA-Net can improve the linear correlation coefficient (LCC) scores in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-Task model for PESQ prediction, and improve LCC scores in short-Time objective intelligibility (STOI) prediction, compared to STOI-Net, an existing single-Task model for STOI prediction. Moreover, MOSA-Net can be used as a pre-Trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC scores in mean opinion score (MOS) predictions, compared to MOS-SSL, a strong single-Task model for MOS prediction. We further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-Aware SE (QIA-SE) approach. Experimental results show that QIA-SE outperforms the baseline SE system with improved PESQ scores in both seen and unseen noise environments over a baseline SE model.

Subjects

Deep learning | multi-objective learning | non-intrusive speech assessment models | speech enhancement

SDGs

[SDGs]SDG4

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Type

journal article

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)