Huang H.-H.Chen H.-H.2019-07-102019-07-10201113138502https://www.scopus.com/inward/record.uri?eid=2-s2.0-84858321116&partnerID=40&md5=78033f1f8d5a56b23c8ee30263e9f0e58th International Conference on Recent Advances in Natural Language Processing, RANLP 2011, 12 September 2011 through 14 September 2011, HissarThe fuzziness of Chinese sentence boundary makes discourse analysis more challenging. Moreover, many articles posted on the Internet are even lack of punctuation marks. In this paper, we collect documents written by masters as a reference corpus and propose a model to label the punctuation marks for the given text. Conditional random field (CRF) models trained with the corpus determine the correct delimiter (a comma or a full-stop) between each pair of successive clauses. Different tagging schemes and various features from different linguistic levels are explored. The results show that our segmenter achieves an accuracy of 77.48% for plain text, which is close to the human performance 81.18%. For the rich formatted text, our segmenter achieves an even better accuracy of 82.93%.Pause and stop labeling for Chinese sentence boundary detectionconference paper2-s2.0-84858321116