Unsupervised Spoken Term Detection with Spoken Queries
Date Issued
2012
Date
2012
Author(s)
Chan, Chun-an
Abstract
Unsupervised spoken term detection (STD) with spoken queries is a new and important topic in multimedia retrieval. The unsupervised approaches without the need of annotated data bypass various problems in speech recognition particularly the recognition errors under different acoustic and linguistic conditions. Such approaches even make searching for spoken terms possible in low-resourced languages or languages without writing system. In this dissertation, we propose several techniques to solve the problem of unsupervised STD problem with spoken queries.
We propose two improved DTW-based approaches to handle the speaking rate distortion and computation efficiency issues in the conventional segmental DTW approach. The Slope-Constrained Dynamic Time Warping (SC-DTW) approach is developed to handle the speaking rate distortion problem. The segment-based DTW approach is devised to reduce the computational burden. The concatenation of these two approaches and the Weighted Pseudo Similarity of SC-DTW approach in the Pseudo Relevance Feedback (PRF) framework show significant improvement on both detection and efficiency performances.
We also propose two model-based approaches for unsupervised STD. We design procedures to construct a set of Acoustic Segment Models (ASMs) that describes the patterns and structures of the target language. In this way, the signal trajectory modeling techniques can be leveraged using the ASMs. Using the ASMs, we propose the Document State Matching (DSM) approach to match spoken queries to the ASM states in the documents. The Duration-Constrained Viterbi algorithm is developed in the DSM approach. Another Pseudo Likelihood Ratio approach is proposed to verify the hypotheses in the PRF framework. Experimental results show that the model-based approaches achieve comparable detection performances in much smaller computation time. Our attempt of migrating from DTW-based approaches to model-based approaches creates the possibilities of leveraging well-developed model-based speech processing techniques in unsupervised STD.
Finally, we tested various approach integration configurations in our system. With the combined model-based and DTW-based approaches, a 14.2\% of absolute Mean Average Precision improvement was achieved using only 23\% of CPU time on the Mandarin broadcast news corpus.
We propose two improved DTW-based approaches to handle the speaking rate distortion and computation efficiency issues in the conventional segmental DTW approach. The Slope-Constrained Dynamic Time Warping (SC-DTW) approach is developed to handle the speaking rate distortion problem. The segment-based DTW approach is devised to reduce the computational burden. The concatenation of these two approaches and the Weighted Pseudo Similarity of SC-DTW approach in the Pseudo Relevance Feedback (PRF) framework show significant improvement on both detection and efficiency performances.
We also propose two model-based approaches for unsupervised STD. We design procedures to construct a set of Acoustic Segment Models (ASMs) that describes the patterns and structures of the target language. In this way, the signal trajectory modeling techniques can be leveraged using the ASMs. Using the ASMs, we propose the Document State Matching (DSM) approach to match spoken queries to the ASM states in the documents. The Duration-Constrained Viterbi algorithm is developed in the DSM approach. Another Pseudo Likelihood Ratio approach is proposed to verify the hypotheses in the PRF framework. Experimental results show that the model-based approaches achieve comparable detection performances in much smaller computation time. Our attempt of migrating from DTW-based approaches to model-based approaches creates the possibilities of leveraging well-developed model-based speech processing techniques in unsupervised STD.
Finally, we tested various approach integration configurations in our system. With the combined model-based and DTW-based approaches, a 14.2\% of absolute Mean Average Precision improvement was achieved using only 23\% of CPU time on the Mandarin broadcast news corpus.
Subjects
spoken term detection
information retrieval
Type
thesis
File(s)
Loading...
Name
ntu-101-F95942047-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):deaef7a13c7aa2449c1d73466f289672