Pause and stop labeling for Chinese sentence boundary detection
Journal
International Conference Recent Advances in Natural Language Processing, RANLP
Pages
146-153
Date Issued
2011
Author(s)
Huang H.-H.
Abstract
The fuzziness of Chinese sentence boundary makes discourse analysis more challenging. Moreover, many articles posted on the Internet are even lack of punctuation marks. In this paper, we collect documents written by masters as a reference corpus and propose a model to label the punctuation marks for the given text. Conditional random field (CRF) models trained with the corpus determine the correct delimiter (a comma or a full-stop) between each pair of successive clauses. Different tagging schemes and various features from different linguistic levels are explored. The results show that our segmenter achieves an accuracy of 77.48% for plain text, which is close to the human performance 81.18%. For the rich formatted text, our segmenter achieves an even better accuracy of 82.93%.
Description
8th International Conference on Recent Advances in Natural Language Processing, RANLP 2011, 12 September 2011 through 14 September 2011, Hissar
Type
conference paper
