An Approach of Using Multiple Dictionaries and Conditional Random Field in Chinese Segmentation and Part of Speech Tagging
Date Issued
2008
Date
2008
Author(s)
Lo, Yong-Sheng
Abstract
This paper proposes a dictionary-CRF-combined approach for Chinese word segmentation and part of speech tagging. This approach proposes all probable sentences by looking up dictionaries and selects the best sentence utilizing a CRF model. This approach can incorporate as many dictionaries as possible to solve new term problem without re-training the model. Moreover, a practical method which adds terms in the system’s dictionary without causing any inconsistence of segmentation rules is also proposed. Most usefully, this approach is able to select dictionaries and segmentation settings according to the document type. Training and testing collections of SIGHAN bakeoff 1 and a medical document collection are used in the experiments. This approach achieves an f-score 0.964 in segmentation, and 0.922 in part of speech tagging, which is satisfactory. Moreover, the training process uses only 7,229 lines in the training file, and this shows that it is easy to build this model by small training data. This approach achieves an f-score 0.954 in segmentation and 0.939 in part of speech tagging even 10 simplified parts of speech are used for training. The simplicity, practicability and flexibility are the superiorities of this approach.
Subjects
Chinese word segmentation
part of speech tagging
dictionaries
conditional random field
CRF
linguistic rules
SIGHAN
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-97-R95922009-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):f9d9d4881ae7711253cbbc58772cd822