人名歧義性分析之研究

陳信希臺灣大學：資訊工程學研究所魏煜娟Wei, Yu-ChuanYu-ChuanWei2007-11-262018-07-052007-11-262018-07-052006http://ntur.lib.ntu.edu.tw//handle/246246/53610本論文探討人名歧義性的問題。如同一個字具有多個意思，一個人名可能同時為多人所擁有，如何判別不同文章中所出現的相同人名是否屬於同一個人，是本研究的主要目標。近年來，人名歧義性分析受到愈來愈多的重視，相關的應用包括個人資料建立、個人網頁搜尋、專家搜尋、社群關係分析等。我們提出兩種類型的人名解歧義性的方法，目的是希望將提及此名字的文件分群，使得每一群中的文件所談的特定對象均指同一個人。多分類器方法鏈結五種分類器來分群文件，五種分類器分別代表著從文章中擷取出來的五種特徵，是用於區別不同個體的依據，最前面的兩個分類器分別採用職稱與社群為分群的依據，期望能夠獲得較高的精確率，接著再以詞彙、時間、網址等分類器來判斷，藉由提高召回率使整體效能得以提昇。此外我們也針對其中三種分類器分別提出了不同的演算法，以探討所造成的影響。單分類器是另一種人名解歧的方法，它同時考慮了多個特徵值，並且直接做文件分群，在此，我們探討使用不同分群演算法以及不同特徵時的分群結果。在我們的實驗資料中，選用了三個真實人名，並且同時考慮了人名的知名度(名人、一般人)、不同類型的資料(新聞、網頁)以及不同資料來源(臺灣地區、中國大陸)對人名解歧的影響。結果顯示：在多分類器的方法中，使用直接職稱分群的效果好於複雜的兩階段判斷法；使用全文分析將引入更多的雜訊，並降低系統的效能；對於單分類器的方法，同時考慮所有特徵的結果比僅利用詞彙來的好；利用網路擴充社群對兩種分類法均有正面的影響。在多分類器的方法中，最好結果可以達到70%的F值，與只有考慮詞彙為特徵的單分類器(最基本的人名解歧的方法)相比，效能大約提升了原本的40%。最後，在結論的部分，我們將提出在此研究議題中未來仍可努力的地方。In this thesis, we study the problem of personal name disambiguation. As we know, many individuals have the same name. The objective of our work is to identify different individuals from a set of documents and cluster the documents in groups such that each group relates to one person. Two types of approaches are proposed and compared. In the multiple-classifier approach, several classifiers are integrated to disambiguate the denotations of personal names. Each classifier is built based on one feature. Alternatives are proposed and replaced in the three classifiers. In the single-classifier approach, documents are clustered at a time. Different clustering algorithms and different features are considered and compared in this approach. In the test data sets, we address the issues of awareness degree of an entity (household name vs. general name), the sources of materials (newswire vs. web pages), and web pages in different areas (Mainland of China vs. Taiwan). The experimental results in the multiple-classifier approach show that personal titles and communities are two strong cues for clustering. The first two classifiers achieve very high precisions, and the last three classifiers improve the recalls at only some expense of precisions. The average F-score increases gradually from the first classifier to the last one. The results of several alternatives show that clustering personal titles directly performs better than the two steps strategy, and terms extracted from the full text seems to bring in many noises for name disambiguation. In the single-classifier approach, high performance is achieved when all types of the features are applied. Expanding communities from the Web improves the performance in both approaches. The alternative in the multiple-classifier approach achieves the best F-score 70% and has about 40% increases compared to the general name disambiguation method. We close with discussion of the comparison of two proposed clustering algorithms and make the conclusion.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Problem Statement 1 1.3 Related Work 2 1.4 Main Issues 5 1.5 The Organization of This Thesis 5 Chapter 2 Evaluation Corpora 7 2.1 Selection Strategies of Testing Names 7 2.2 Description of Resources 8 2.2.1 Newswire 8 2.2.2 Web Pages 11 2.3 Comparison of Three Materials 13 2.3.1 Newswire vs. Web Pages 13 2.3.2 Web Pages in Taiwan vs. Web Pages in China 15 Chapter 3 Multiple-Classifier Approach 16 3.1 Overview 16 3.2 Data Preprocessing 17 3.2.1.1 Data Extraction, Code Translation, Context Extraction, and POS Tagging 17 3.2.2 Feature Extraction 18 3.2.2.1 Personal Title Extraction 18 3.2.2.2 Community Extraction 19 3.2.2.3 Term Extraction 19 3.2.2.4 Temporal Expression Extraction 20 3.2.2.5 URL Extraction 20 3.3 Five Classifiers in the Multiple-Classifier Approach 20 3.3.1 A Classifier Using Personal Titles (C1) 21 3.3.1.1 Dividing by Title Keywords and Organization Names (C11) 22 3.3.1.2 Merging by Organization Names (C12) 23 3.3.2 A Classifier Using Communities (C2) 24 3.3.2.1 Disambiguating by Communities (C21) 24 3.3.2.2 Self-Dividing by Communities (C22) 25 3.3.3 A Classifier Using Term Vectors (C3) 26 3.3.3.1 Disambiguating by Term Vectors (C31) 26 3.3.3.2 Merging by Term Vectors (C32) 27 3.3.4 A Classifier Using Temporal Expressions (C4) 28 3.3.5 A Classifier Using URLs of Documents (C5) 28 3.4 Cluster Labeling 29 Chapter 4 Experiments of Multiple-Classifier Approach 31 4.1 Evaluation Metrics 31 4.2 Baseline Models 32 4.3 Experimental Results 33 4.3.1 Performance of Personal Title Classifier 33 4.3.2 Performance of Community Classifier 34 4.3.3 Performance of Term Vector Classifier 35 4.3.4 Performance of Temporal Expression and URLs of Documents Classifiers 36 4.3.5 Overall Performance and Discussion 37 4.4 Alternative Approaches 41 4.4.1 Personal Title Classifier 41 4.4.1.1 Directly Clustering by Personal Titles 42 4.4.1.2 Merging by Ratio 42 4.4.1.3 Merging by Chi-square 43 4.4.2 Community Classifier 44 4.4.2.1 Community Expansion 44 4.4.2.1.1 Building an NE ontology 44 4.4.2.1.2 Setting up a Community Chain from Two Ontologies 45 4.4.2.1.3 Web Search with Double Checking Model 46 4.4.2.1.4 Community Expansion from the Web 47 4.4.2.2 Expansion in Community Classifier 47 4.4.3 Term Vector Classifier 48 4.5 Results of Alternative Approaches 48 4.5.1 Personal Title Classifier 48 4.5.2 Community Classifier 49 4.5.3 Term Vector Classifier 50 Chapter 5 Single-Classifier Approach 51 5.1 Agglomerative Clustering Algorithms 51 5.2 Two Alternatives 52 5.3 Experimental Results 52 5.3.1 Three Agglomerative Clustering Algorithms 52 5.3.2 Two Alternative Single-Classifiers 54 5.4 Comparison between Multiple-Classifiers and Single-Classifiers 57 5.5 Dynamic Threshold Setting 60 5.5.1 Average-link with Dynamic Threshold 60 5.5.2 Experiments 61 5.6 Visualization of Results 62 Chapter 6 Conclusion and Future Work 65 6.1 Conclusion 65 6.2 Future Work 66 References 68 Appendix 70 І Statistics of “Chien-Ming Wang” in UDN, TW, and CN 70 D Performances of Multiple-Classifier Approach 75 M Test Data and Scores in Dynamic Threshold Setting 761213180 bytesapplication/pdfen-US人名解歧資訊檢索Name DisambiguationInformation Retrieval人名歧義性分析之研究A Study of Personal Name Disambiguationthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/53610/1/ntu-95-R92922129-1.pdf