https://scholars.lib.ntu.edu.tw/handle/123456789/114947
標題: | 人名歧義性分析之研究 A Study of Personal Name Disambiguation |
作者: | 魏煜娟 Wei, Yu-Chuan |
關鍵字: | 人名解歧;資訊檢索;Name Disambiguation;Information Retrieval | 公開日期: | 2006 | 摘要: | 本論文探討人名歧義性的問題。如同一個字具有多個意思,一個人名可能同時為多人所擁有,如何判別不同文章中所出現的相同人名是否屬於同一個人,是本研究的主要目標。近年來,人名歧義性分析受到愈來愈多的重視,相關的應用包括個人資料建立、個人網頁搜尋、專家搜尋、社群關係分析等。我們提出兩種類型的人名解歧義性的方法,目的是希望將提及此名字的文件分群,使得每一群中的文件所談的特定對象均指同一個人。 多分類器方法鏈結五種分類器來分群文件,五種分類器分別代表著從文章中擷取出來的五種特徵,是用於區別不同個體的依據,最前面的兩個分類器分別採用職稱與社群為分群的依據,期望能夠獲得較高的精確率,接著再以詞彙、時間、網址等分類器來判斷,藉由提高召回率使整體效能得以提昇。此外我們也針對其中三種分類器分別提出了不同的演算法,以探討所造成的影響。單分類器是另一種人名解歧的方法,它同時考慮了多個特徵值,並且直接做文件分群,在此,我們探討使用不同分群演算法以及不同特徵時的分群結果。 在我們的實驗資料中,選用了三個真實人名,並且同時考慮了人名的知名度(名人、一般人)、不同類型的資料(新聞、網頁)以及不同資料來源(臺灣地區、中國大陸)對人名解歧的影響。結果顯示:在多分類器的方法中,使用直接職稱分群的效果好於複雜的兩階段判斷法;使用全文分析將引入更多的雜訊,並降低系統的效能;對於單分類器的方法,同時考慮所有特徵的結果比僅利用詞彙來的好;利用網路擴充社群對兩種分類法均有正面的影響。在多分類器的方法中,最好結果可以達到70%的F值,與只有考慮詞彙為特徵的單分類器(最基本的人名解歧的方法)相比,效能大約提升了原本的40%。最後,在結論的部分,我們將提出在此研究議題中未來仍可努力的地方。 In this thesis, we study the problem of personal name disambiguation. As we know, many individuals have the same name. The objective of our work is to identify different individuals from a set of documents and cluster the documents in groups such that each group relates to one person. Two types of approaches are proposed and compared. In the multiple-classifier approach, several classifiers are integrated to disambiguate the denotations of personal names. Each classifier is built based on one feature. Alternatives are proposed and replaced in the three classifiers. In the single-classifier approach, documents are clustered at a time. Different clustering algorithms and different features are considered and compared in this approach. In the test data sets, we address the issues of awareness degree of an entity (household name vs. general name), the sources of materials (newswire vs. web pages), and web pages in different areas (Mainland of China vs. Taiwan). The experimental results in the multiple-classifier approach show that personal titles and communities are two strong cues for clustering. The first two classifiers achieve very high precisions, and the last three classifiers improve the recalls at only some expense of precisions. The average F-score increases gradually from the first classifier to the last one. The results of several alternatives show that clustering personal titles directly performs better than the two steps strategy, and terms extracted from the full text seems to bring in many noises for name disambiguation. In the single-classifier approach, high performance is achieved when all types of the features are applied. Expanding communities from the Web improves the performance in both approaches. The alternative in the multiple-classifier approach achieves the best F-score 70% and has about 40% increases compared to the general name disambiguation method. We close with discussion of the comparison of two proposed clustering algorithms and make the conclusion. |
URI: | http://ntur.lib.ntu.edu.tw//handle/246246/53610 | 其他識別: | en-US |
顯示於: | 資訊工程學系 |
檔案 | 描述 | 大小 | 格式 | |
---|---|---|---|---|
ntu-95-R92922129-1.pdf | 23.31 kB | Adobe PDF | 檢視/開啟 |
在 IR 系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。