Pause and stop labeling for Chinese sentence boundary detection

Huang H.-H.; Chen H.-H.; Chen H.-H.;Huang H.-H.

Pause and stop labeling for Chinese sentence boundary detection

Journal

International Conference Recent Advances in Natural Language Processing, RANLP

Pages

146-153

Date Issued

2011

Author(s)

Huang H.-H.

Chen H.-H.

URI

https://www.scopus.com/inward/record.uri?eid=2-s2.0-84858321116&partnerID=40&md5=78033f1f8d5a56b23c8ee30263e9f0e5

Abstract

The fuzziness of Chinese sentence boundary makes discourse analysis more challenging. Moreover, many articles posted on the Internet are even lack of punctuation marks. In this paper, we collect documents written by masters as a reference corpus and propose a model to label the punctuation marks for the given text. Conditional random field (CRF) models trained with the corpus determine the correct delimiter (a comma or a full-stop) between each pair of successive clauses. Different tagging schemes and various features from different linguistic levels are explored. The results show that our segmenter achieves an accuracy of 77.48% for plain text, which is close to the human performance 81.18%. For the rich formatted text, our segmenter achieves an even better accuracy of 82.93%.

Description

8th International Conference on Recent Advances in Natural Language Processing, RANLP 2011, 12 September 2011 through 14 September 2011, Hissar

Type

conference paper

Pause and stop labeling for Chinese sentence boundary detection

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)