Identifying Weak Segmentation on the Fly to Improve Chinese Word Segmentation
Date Issued
2015
Date
2015
Author(s)
Teng, Jun
Abstract
We propose a new method to improve Chinese word segmentation system by identifying weak segmentation on the fly, and this idea can be implemented in any kind of statistical learning model. The method can be simply described in three steps. First, we use the segmentation result generated by our baseline model to identifying weak segmentations and denote them as error candidate tokens. Second, we use the powerful Internet resource to correct the error candidate tokens and collect the new words we found into a correction word dictionary. Third, we use this dictionary to relabel the testing data rather than retraining the model to improve the overall performance. We implement this idea by CRF and propose three ways to find the error candidate tokens, we also design a mechanism using the Internet searching counts and the title of wikipedia’s pages to find out new words which might be wrong segmented at the first place. In experiment in SIGHAN 2005 Chinese word segmentation Bakeoff, our work can make the baseline model has better performance in F-measure than before. In addition, it also has a significant improvement in OOV recall. In SIGHAN 2014 and NLPCC 2015 datasets, which are made of weibo data, the performance of the baseline model has a great promotion after applying our work on it. This shows that our work is powerful when dealing with the data of social media that contains lots of new words.
Subjects
segmentation
new word detection
identify weak segmentation
Type
thesis
