A Word-Clip algorithm for Named Entity Recognition
								-by example of historical documents

Chang, Shan-Pin

A Word-Clip algorithm for Named Entity Recognition -by example of historical documents

Date Issued

2006

Date

2006

Author(s)

Chang, Shan-Pin

DOI

zh-TW

URI

http://ntur.lib.ntu.edu.tw//handle/246246/53830

Abstract

The Chinese characters may in principle be composed into a countless number of phrases, which no existing methods, including dictionaries, can completely enumerate. This leads to the problem of erroneous detections or misses when attempting to identify proper nouns (PN) in a document. In this thesis, we have proposed a method based on a notion of word-clip to identify proper nouns from documents in a specific domain. Methods for PN recognition can be classified into the following three categories: rule-based methods, corpus-based methods, and machine-learning methods. The corpus-based methods are the most widely used approach. However, they usually require the establishment of a large dictionary. This is where the bulk of work lies. The word-clip method has no need of establishing a dictionary, which makes our algorithm more efficient. The main concept of the word-clip method is to use some existing relationships between PNs and the whole phrase. For example, the abbreviation "Mr." is usually followed by the name of a person (with a few exceptions such as "Mr. President"). A typical word-clip is thus formed by combining a "leading phrase", a "PN prefix", a "PN postfix", and an "ending phrase." Our algorithm uses a set of initial sample PNs plus a set of training documents to generate word-clips. These word-clips are then used to identify new PNs for the next training cycle. This process is iterated to generate candidate PNs. We have tested our method on two large sets of historical documents. One is a set of 33,025 court documents from the Ming and Qing Dynasties, and the other is a set of 21,575 old land deeds. For the former we have generated 74,825 names of persons with a precision rate of 56.1% and recall rate of 77.1% ,and we have generated 6,306 names of location with a precision rate of 87.0% and recall rate of 87.9%. For the latter we have generated 28,358 names of persons with a precision rate of 45.6% and recall rate of 72.9%, and we have generated 4,132 names of location with a precision rate of 77.6% and recall rate of 80.3%.

Subjects

詞夾子

候選詞

專有名詞辨識

word-clip

candidate

named entity recognition

Type

thesis

File(s)

Name

ntu-95-R93922127-1.pdf

Size

23.31 KB

Format

Adobe PDF

Checksum

(MD5):6723e0039dfef21b6ec1dfd86a65eb18

A Word-Clip algorithm for Named Entity Recognition -by example of historical documents

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)