A Word-Clip algorithm for Named Entity Recognition -by example of historical documents
Date Issued
2006
Date
2006
Author(s)
Chang, Shan-Pin
DOI
zh-TW
Abstract
The Chinese characters may in principle be composed into a countless number of phrases, which no existing methods, including dictionaries, can completely enumerate. This leads to the problem of erroneous detections or misses when attempting to identify proper nouns (PN) in a document. In this thesis, we have proposed a method based on a notion of word-clip to identify proper nouns from documents in a specific domain.
Methods for PN recognition can be classified into the following three categories: rule-based methods, corpus-based methods, and machine-learning methods. The corpus-based methods are the most widely used approach. However, they usually require the establishment of a large dictionary. This is where the bulk of work lies. The word-clip method has no need of establishing a dictionary, which makes our algorithm more efficient.
The main concept of the word-clip method is to use some existing relationships between PNs and the whole phrase. For example, the abbreviation "Mr." is usually followed by the name of a person (with a few exceptions such as "Mr. President"). A typical word-clip is thus formed by combining a "leading phrase", a "PN prefix", a "PN postfix", and an "ending phrase." Our algorithm uses a set of initial sample PNs plus a set of training documents to generate word-clips. These word-clips are then used to identify new PNs for the next training cycle. This process is iterated to generate candidate PNs.
We have tested our method on two large sets of historical documents. One is a set of 33,025 court documents from the Ming and Qing Dynasties, and the other is a set of 21,575 old land deeds. For the former we have generated 74,825 names of persons with a precision rate of 56.1% and recall rate of 77.1% ,and we have generated 6,306 names of location with a precision rate of 87.0% and recall rate of 87.9%. For the latter we have generated 28,358 names of persons with a precision rate of 45.6% and recall rate of 72.9%, and we have generated 4,132 names of location with a precision rate of 77.6% and recall rate of 80.3%.
Subjects
詞夾子
候選詞
專有名詞辨識
word-clip
candidate
named entity recognition
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-95-R93922127-1.pdf
Size
23.31 KB
Format
Adobe PDF
Checksum
(MD5):6723e0039dfef21b6ec1dfd86a65eb18