蔡益坤臺灣大學:資訊管理學研究所黃俊誌Huang, Chun-ChihChun-ChihHuang2007-11-262018-06-292007-11-262018-06-292005http://ntur.lib.ntu.edu.tw//handle/246246/54346The purpose of a text retrieval system is to locate documents from a large, textual document collection that meet a user’s needs. The SIR system is such a system that is based on the sequence model. As it was designed and implemented as a sequential, rather than a parallel application, it becomes less efficient when the size of the data collection gets larger. Another drawback of the SIR system is that the index must be rebuilt entirely when the data collections are modified. Also, compared with other models, the query evaluation process of the sequence model is time consuming. In this thesis, we seek to make improvements that address these problems. To facilitate parallel query processing, we implement three kinds of index partitioning schemes in the system, and evalauete their load balancing characteristics. To improve the scalability of index building, we design and implement a mechanism that allows the SIR system to support incremental index updates. We also make other improvements such as support of queries with homophones and support of more types of token, that make the system more flexible.Contents 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Chinese IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.3 Effectiveness and Scalability of IR systems . . . . . . . . . . . . . 3 1.2 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Related Work 7 2.1 Inverted Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Parallel Processing in IR . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Incremental Index Updating . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 The Indexing and Retrieval Process of SIR . . . . . . . . . . . . . . . . . 14 2.5.1 Indexing Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.2 Retrieval Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Scalability and Effectiveness of SIR 19 3.1 Retrieval Process Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Index Partitioning Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Index Updating Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Other Modifications to the SIR system . . . . . . . . . . . . . . . . . . . 26 4 The New Components of SIR 29 4.1 The Broker Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Indexing Building . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 The Coordinated Collection Management Component . . . . . . . . . . . 35 5 Experimental Results and Analysis 38 5.1 The Effectiveness of the SIR system . . . . . . . . . . . . . . . . . . . . . 38 5.1.1 Topic Containing Continuous, Related Semantic Blocks . . . . . . 39 5.1.2 Topics Containing Distant Semantic Blocks . . . . . . . . . . . . . 40 5.1.3 Adjusting the Weight of the Three Scoring . . . . . . . . . . . . . 41 5.1.4 The Effect of Multi-phase Query . . . . . . . . . . . . . . . . . . 43 5.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 The Effect of Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.1 The Load Balancing Analysis . . . . . . . . . . . . . . . . . . . . 46 5.2.2 The Scalability Analysis of the SIR System . . . . . . . . . . . . . 49 6 Conclusion 51 6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . 54534385 bytesapplication/pdfen-US累加式更新索引切割設計資訊檢索平行化反轉索引平行化處理文件檢索Incremental UpdateIndex Partitioning SchemesInformation RetrievalParallel Inverted IndexParallel ProcessingText Retrieval改善以序列為基礎之文件檢索系統之有效性與彈性Improving the Effectiveness and Scalability of a Sequence-Based Text Retrieval Systemotherhttp://ntur.lib.ntu.edu.tw/bitstream/246246/54346/1/ntu-94-R92725027-1.pdf