https://scholars.lib.ntu.edu.tw/handle/123456789/633088
標題: | Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features | 作者: | Zezario, Ryandhimas E. Fu, Szu Wei Chen, Fei CHIOU-SHANN FUH Wang, Hsin Min Tsao, Yu |
關鍵字: | Deep learning | multi-objective learning | non-intrusive speech assessment models | speech enhancement | 公開日期: | 1-一月-2023 | 出版社: | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC | 卷: | 31 | 起(迄)頁: | 54 | 來源出版物: | IEEE/ACM Transactions on Audio Speech and Language Processing | 摘要: | This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-Term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments. Experimental results show that in both seen and unseen noise environments, MOSA-Net can improve the linear correlation coefficient (LCC) scores in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-Task model for PESQ prediction, and improve LCC scores in short-Time objective intelligibility (STOI) prediction, compared to STOI-Net, an existing single-Task model for STOI prediction. Moreover, MOSA-Net can be used as a pre-Trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC scores in mean opinion score (MOS) predictions, compared to MOS-SSL, a strong single-Task model for MOS prediction. We further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-Aware SE (QIA-SE) approach. Experimental results show that QIA-SE outperforms the baseline SE system with improved PESQ scores in both seen and unseen noise environments over a baseline SE model. |
URI: | https://scholars.lib.ntu.edu.tw/handle/123456789/633088 | ISSN: | 23299290 | DOI: | 10.1109/TASLP.2022.3205757 |
顯示於: | 資訊工程學系 |
在 IR 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。