多文件文章摘要系統之研究

陳信希臺灣大學：資訊工程學研究所郭俊桔Kuo, June-JeiJune-JeiKuo2007-11-262018-07-052007-11-262018-07-052006http://ntur.lib.ntu.edu.tw//handle/246246/53758為了幫助線上使用者可以從網際網路上迅速有效地擷取所需新聞資訊﹐本論文針對多文件文章摘要系統﹐探討相關課題，例如，句子挑選、重複內容偵測與刪除、句子排列等並提相關的解決方法後，提出一個新的摘要系統，其主要由事件分群和摘要產生兩大模組所構成。除了使用傳統的語法屬性﹐例如﹐詞性、詞頻等外﹐還提出資訊詞、事件詞和同指涉鏈等語意屬性來解決相關課題，針對事件分群模組﹐為了提高分群的效率﹐我們先使用同指涉鍊來產生每個文件的摘要後，再以摘要為對象執行分群處理。另外，除導入了同指涉鏈外﹐並也提出動態閥值和控制辭彙產生等演算法。另一方面﹐針對摘要產生模組﹐為了解決跨文件名詞辭彙不統一和時間指示等問題﹐除利用上述之控制辭彙外﹐提出了使用參照時間的時間標記演算法。為了避免傳統之句子分群所帶來的錯誤﹐潛在語意分析(LSA)被應用於候選句的選擇上。再者﹐為了能含入更多高資訊的句子在摘要中﹐提出了使用事件構成詞和資訊字的文句縮減演算法。同時﹐為了能抽出基本事件以得到事件構成詞﹐導入網頁語料庫之名詞句判別系統(NP-chunker)也被提出。針對候選句在摘要內的排列方式﹐提出了使用句時間的文句排列演算法。最後﹐針對傳統的多文件摘要系統﹐由實驗結果得知本論文的多文件文章摘要系統不論在內容或閱讀性上比傳統的多文件文章摘要系統，在統計檢驗上都有明顯的改善。另外，為了驗證這些語意屬性的有效性﹐將其運用在多文件的標題產生和多國語多文件文章摘要上，除提出標題重組和不同語言間的文件（文句）比對等相關演算法外，得到令人滿意的結果。再者，為了能跨越人工摘要評估的瓶頸，我們提出導入自動問答系統（Question Answering）以執行自動摘要評估的方法，不論使用小型或大型的語料庫，實驗結果證實自動評估系統在時間上和客觀性上可行性。In order to provide a generic summary to help on-line readers to absorb news information from multiple sources, in this dissertation we study the related issues on the multi-document summarization, e.g., event clustering, sentence selection, redundancy avoidance, sentence ordering and summary evaluation, and focus on two major modules: event clustering and summary generation. Besides using the conventional features, e.g., lexical information or part-of-speech, term frequency, document frequency and paragraph dispersion of a word in a document are used to propose informative words, which can be used to represent the corresponding document. In the event clustering module, to further understand a document we introduce the semantic features, such as event words and co-reference chains. The controlled vocabulary mining from co-reference chains is also proposed to solve the cross document name entity unification issue. Meanwhile, we propose a novel dynamic threshold model to enhance the performance of event clustering. On the other hand, in the summary generation module, we propose a temporal tagger to deal with the temporal resolution and provide sentence dates for sentence ordering. We also introduce the latent semantic analysis (LSA) to tackle the sentence selection issue. On the one hand, to tackle the summary length issue, the sentence reduction algorithm using both event constituent words and informative words is also proposed. Finally, the experimental results on both content and readability for generated multi-document summarization are promising. On the other hand, to investigate the performance of proposed semantic features, the headline generation and multi-lingual multi-document summarization are also studied. Besides, we tackle the automatic evaluation issue on summary evaluation by introducing question answering (QA). Promising results are obtained as well.Abstract i 摘要 iii 誌謝 v Contents vii Illustrations xiii Tables xv 1. Introduction 1.1. Document Summarization 1 1.2. Headline Generation 4 1.3. Multi-lingual Multi-document Summarization 4 1.4. Summary Evaluation 6 1.5. The Event Words and Co-reference Chains 7 1.6. The Goal of the Study 10 2. Multi-document Summarization Using Informative Words 2.1. System Architecture 13 2.2. Issues of Basic Multi-document Summarization System 14 2.3. Generating Summaries with Informative words 14 2.4. Experiment 16 2.4.1. Experimental Results 16 2.4.2. Observation 18 2.5. Discussion 19 3. Evaluation Model Using Question Answering 3.1. Modeling using Question Answering 21 3.2. Evaluation 22 3.2.1. Data Set and Evaluation Method 23 3.2.2. Experimental Results and Observation 24 3.3. Experiments using Large Documents and Results 24 3.3.1. Data Set 24 3.3.2. Experimental Results 25 3.4. Discussion 26 4. Headline Generation 4.1. Introduction 27 4.2. Selection of Informative Words 28 4.2.1. Paragraph Dispersion 29 4.2.2. Informative Words 29 4.3. Headline Generation Using Informative Words 30 4.3.1. Bag Generation Method Using Informative Words 31 4.3.2. Sentence Selection Using Statistical Information and Density 33 4.4. Evaluation 34 4.4.1. Evaluation and Method 34 4.4.2. Results 35 4.5. Discussion 36 5. Clustering and Visualization in a Multi-lingual Multi-document Summarization System 5.1. Introduction 39 5.2. Basic Architecture 40 5.3. Similarity Measurement 41 5.3.1. Methods 41 5.3.2. Experiments 43 5.4. Event Clustering 45 5.4.1. Clustering Models 45 5.4.2. Experiments 46 5.5. Sentence Clustering 48 5.5.1. Clustering Models 48 5.5.2. Experiments 50 5.6. Visualization 52 5.6.1. Focusing Model 52 5.6.2. Browsing Model 53 5.7. Discussion 54 6. Multi-document Summarization Using both Informative Words and Knowledge Mining from Co-reference Chains 6.1. Introduction 57 6.2. System Architecture 58 6.3. Document Summarization Using Co-reference Chains 60 6.4. Creating Controlled Vocabulary from Individual Co-reference Chains 62 6.4.1. Normalized Chain Edit Distance 63 6.4.2. Creating Controlled Vocabulary 65 6.4.3. Evaluation 66 6.4.3.1. Data Set 67 6.4.3.2. Experimental Results 67 6.5. Event Clustering 68 6.6. Experimental Results 70 6.6.1. Data Sets 70 6.6.2. Evaluation Metrics 71 6.6.3. Experimental Results 72 6.7. Experiments Using Co-reference Chains from Co-reference Resolution System 74 6.7.1. Flow of a Chinese Co-reference Resolution System 75 6.7.2. Experimental Results of Using Noisy Co-Reference Chains 77 6.7.3. Co-reference Chains Filter 78 6.7.4. Performance of Event Clustering Using Clearer Co-Reference Chains 81 6.8. Discussion 82 7. Event-based Summary Generation 7.1. Introduction 85 7.1.1. Similarity Model 85 7.1.2. Sentence Extraction and Ordering 87 7.1.3. Experimental Results 87 7.1.4. Discussion 88 7.2. Processing Chinese Temporal Expression 90 7.2.1. Representation of Time and Date 90 7.2.2. Temporal Resolution Using Focus Time and Co-Reference Chains 91 7.2.3. Experiments 93 7.2.4. Discussion 94 7.3. System Architecture of News Summarizer 95 7.4. Event Extraction and NP-Chunker 96 7.4.1. NP-Chunker Using Significance Estimation Function and Web Corpora 97 7.4.1.1. Observation 98 7.4.1.2. NP-Chunker using Web Corpora and Association Rules 101 7.4.1.3. Experiment 102 7.4.2. Event Extraction 104 7.5. Sentence Selection 104 7.5.1. Latent semantic Analysis 105 7.5.2. Sentence Extraction Using Latent Semantic Analysis 106 7.6. Summary Generation Using Sentence Date 107 7.6.1. Sentence Reduction Using Both Informative Words and Event Constituent Words 107 7.6.2. Summary Generation Using Sentence Data 108 7.7. Experiment 109 7.7.1. Data Set and Evaluation Metrics 109 7.7.2. Experimental Results 109 7.8. Discussion 113 8. Conclusions and Future Works 8.1. Achievements 115 8.1.1. Event Clustering 117 8.1.2. Summary Generation 117 8.2. Future Work 119 References Appendices Appendix A The Evaluation File for Headline Generation 135 Appendix B An Example of Chi-square Test for a Term Pair 136 Appendix C Controlled Vocabulary Before/After Employing Chain Filter 137 Appendix D Example of 8 Type Generated Summaries 1384140206 bytesapplication/pdfen-US多文件文章摘要系統事件分群摘要產生自動摘要評估multiple document summarization systemevent clusteringsummary generationautomatic summary evaluation多文件文章摘要系統之研究A Study on Multiple Document Summarization Systemsthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/53758/1/ntu-95-D88526001-1.pdf