Noise-Robust Bandwidth Expansion for 8K Speech Recordings

Lin, Yin TseYin TseLinSu, Bo HaoBo HaoSuLin, Chi HanChi HanLinKuo, Shih ChanShih ChanKuoJYH-SHING JANGLee, Chi ChunChi ChunLee2023-10-182023-10-182023-01-012308457Xhttps://scholars.lib.ntu.edu.tw/handle/123456789/636153Speech recordings in call centers are narrowband and mixed with various noises. Developing a bandwidth expansion (BWE) model is important to mitigate the automated speech recognition (ASR) performance gap between the low and high sampling rate speech data. To further address the in-the-wild noise in call center settings, we propose an Embedding-Polished Wave-U-Net (EP-WUN) that includes an additional speech quality classifier to handle the noise and bandwidth expansion of 8k audio simultaneously. Our framework shows improved speech quality metrics on a well-known BWE dataset (Valentini-Botinhao corpus) when comparing to the current state-of-the-art noise-robust BWE model with 33% fewer parameters. It also achieves an 11.71% word error rate reduction when evaluating on a real-world interactive voice response system from the E.SUN bank.Automated speech recognition | Bandwidth expansion | Robust speech representation learningNoise-Robust Bandwidth Expansion for 8K Speech Recordingsconference paper10.21437/Interspeech.2023-8572-s2.0-85171536109https://api.elsevier.com/content/abstract/scopus_id/85171536109