Toward unsupervised model-based spoken term detection with spoken queries without annotated data

Chan, C.-A.; Chung, C.-T.; Kuo, Y.-H.; LIN-SHAN LEE; Chan, C.-A.;Chung, C.-T.;Kuo, Y.-H.;Lee, L.-S.

doi:10.1109/ICASSP.2013.6639334

Toward unsupervised model-based spoken term detection with spoken queries without annotated data

Journal

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Pages

8550 - 8554

Date Issued

2013

Author(s)

Chan, C.-A.

Chung, C.-T.

Kuo, Y.-H.

LIN-SHAN LEE

DOI

10.1109/ICASSP.2013.6639334

URI

https://scholars.lib.ntu.edu.tw/handle/123456789/498582

https://www.scopus.com/inward/record.uri?eid=2-s2.0-84890526441&doi=10.1109%2fICASSP.2013.6639334&partnerID=40&md5=cd1d9f75c9431093c16f8c89dec15998

Abstract

We present a two-stage model-based approach for unsupervised query-by-example spoken term detection (STD) without any annotated data. Compared to the prevailing DTW approaches for the unsupervised STD task, HMMs used by model-based approaches can better capture the signal distributions and time trajectories of speech with a more global view of the spoken archive; matching with model states also significantly reduces the computational load. The utterances in the spoken archive are first offline decoded into acoustic patterns automatically discovered in an unsupervised way from the spoken archive. In the first stage, we propose a document state matching (DSM) approach, where query frames are matched to the HMM state sequences for the spoken documents. In this process, a novel duration-constrained Viterbi (DC-Vite) algorithm is proposed to avoid unrealistic speaking rate distortion. In the second stage, pseudo relevant/irrelevant examples retrieved from the first stage are respectively used to construct query/anti-query HMMs. Each spoken term hypothesis is then rescored with the likelihood ratio to these two HMMs. Experimental results show an absolute 11.8% of mean average precision improvement with a more than 50% reduction in computation time compared to the segmental DTW approach on a Mandarin broadcast news corpus. © 2013 IEEE.

Event(s)

2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013

Subjects

query-by-example; speech pattern discovery; Unsupervised spoken term detection; zero-resource

SDGs

[SDGs]SDG4

Other Subjects

Model based approach; Precision improvement; Query-by-example; Signal distribution; Speech patterns; Spoken Term Detection (STD); Spoken term detections; zero-resource; Signal processing; Speech recognition; Viterbi algorithm; Query processing

Type

conference paper

Toward unsupervised model-based spoken term detection with spoken queries without annotated data

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)