The Analysis of identification of Chinese Stop Characters with Support Vector Machine
Date Issued
2012
Date
2012
Author(s)
Wang, Wei-Chiang
Abstract
In Chinese linguistics studies, the Chinese vocabulary can be classified as content words and function words. The role of function words is attached or connected. Function words can not form sentence, and it only cooperate with the content words to complete grammatical structure. Therefore, the function words are often studied by the linguists because of its grammatical function. It is an important research topic in Chinese Language Processing. In this paper, the identification of function words is limited to the identification of single Chinese character which is function word.
In this thesis, we proposed a method which combined two type of SVM (Support Vector Machine), one-class SVM and 2-class SVM, to identify the Chinese function word. Using the function characters which are curated by human, we trained the machine learning model to build the automatic identification tool for function characters. For every sample characters, we generated a feature vector of 45 features. LIBSVM tool is applied in three parts, includes one-class SVM, 2-class SVM, and feature selection.
The training data and testing data are selected Buddhist scriptures which are from the FaHua division in CBETA corpus. The training data contains 3660 characters, which includes 289 function characters. Besides, the test data contains 3228 single words, which includes 223 function characters. According to our leave-one-out cross validation experiment, with the optimization process, the precision and the recall can achieve 0.947 and 0.920, respectively. However, in the independent test experiment, the precision and recall drop to 0.311 and 0.318, respectively.
We discussed two reasons which may cause the performance gap between leave-one-out cross validation experiment and independent test experiment. One reason is the differences in the styles of articles and the variation of usage from different dynasties between training data and test data, and the other is the insufficient training data.
Subjects
function word
support vector machine
Chinese Buddhist Electronic Text Association
natural language processing
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-101-R99525082-1.pdf
Size
23.54 KB
Format
Adobe PDF
Checksum
(MD5):b0c3f8123bcb97a96461456163df75ac
