以機器學習方法處理跨語言檢索合併問題

王昱婷; Wang, Yu-Ting

DC 欄位	值	語言
dc.contributor	陳信希	zh-TW
dc.contributor	臺灣大學:資訊工程學研究所	zh-TW
dc.contributor.author	王昱婷	zh-TW
dc.contributor.author	Wang, Yu-Ting	en
dc.creator	王昱婷	zh-TW
dc.creator	Wang, Yu-Ting	en
dc.date	2008	en
dc.date.accessioned	2010-05-18T09:55:33Z	-
dc.date.accessioned	2018-07-05T01:42:31Z	-
dc.date.available	2010-05-18T09:55:33Z	-
dc.date.available	2018-07-05T01:42:31Z	-
dc.date.issued	2008	-
dc.identifier.other	U0001-1906200818011700	en
dc.identifier.uri	http://ntur.lib.ntu.edu.tw//handle/246246/183615	-
dc.description.abstract	多語言檢索主要是允許使用者給予一種語言的查詢，檢索出多種語言的相關文件。一般而言，處理多語言檢索，首先利用查詢，在各個語言的語料庫中找出在該語言中的相關文件；利用合併的方法，將此些不同語言的相關文件合併成最終多語言的相關文件集。在此論文中的主要議題是如何使用最佳的合併方法，來達到不錯的效能。此研究中，我們使用機器學習的方法去建立一個跨語言的合併模型；透過此合併模型去調整每篇文件的合併分數。先，探討處理跨語言檢索問題過程中，有哪些是可能影響跨語言檢索效能的因素。我們從三個層面做探討；翻譯層面、文件本身的層面以及較為一般性層面的特徵。在翻譯層面，過去有不少研究顯示，跨語言檢索時，翻譯品質的好壞對檢索結果的效能佔有很大程度的影響性；除此之外，我們將查詢中的每一個字給予分類成一個類別，類別則由人為的方式下去做定義。發現有幾個類別在資料檢索過程中，佔有較大程度的影響性，甚至發現不同類別之間亦存在著某些程度的相關連；其中佔有一定影響性的類別，其翻譯品質好壞，對跨語言檢索更為重大。在文件本身層面，利用文件本身以及文件標題的長度來做為此文件所含有的資訊量指標。從此些層次取出特徵，利用機器學習的方法，不只學習出跨語言的合併模型，亦學習出在機器學習過程中哪些特徵是較具影響性的。實驗結果顯示，利用機器學習的方法，所達到的檢索效能較傳統合併的方法效能佳；且發現翻譯品質的好壞，包含組織名稱，事件名稱，抽象名詞以及專業名詞的翻譯品質對跨語言檢索最有影響性。	zh-TW
dc.description.abstract	Multilingual information retrieval aims to able users enter query in one language and access relevant documents in various languages. Usually, implementation of MLIR (multilingual information retrieval) is first retrieving each language to obtain bilingual retrieved documents lists from each language collection. Then, how to merge these bilingual lists is the main issue in this work. In this work, we use machine learning approach, FRank, to build a merge model; merging these multiple bilingual lists using the merge model score and retrieval score. Firstly, we identify some effective factors which may influence MLIR process from three levels general level, translation level and document level. On translation level, previous study showed translation quality is crucial for cross-language information retrieval. Besides, we classify each query term into a category which are pre-defined manually. From our experiment, some categories play more important roles in a query while information retrieval; moreover, there are some relationships between categories. The translation quality of those influential categories is crucial for MLIR. On document level, we extract document and document title length as the quantity of informative. On each level, we totally extract 62 features; utilizing these features, we not only train a merge model but also identify what are the effective features for MLIR merging process. In our experiment, we can achieve the best performance among all traditional merging strategies, including raw-score merging, round-robin merging, normalized by top K merging, logistic regression and 2-step re-indexing merging method. Besides, from the features picked up by FRank as weak learners, we can identify translation quality of some query term categories, translatable query terms and ambiguous degree while translating are effective while MLIR merging.	en
dc.description.tableofcontents	口試委員審定書 i文摘要 iiBSTRACT iiiIST OF FIGURES viIST OF TABLES viihapter 1 INTRODUCTION 1.1. MOTIVATION 1.2. MERGING PROBLEM 2.3. THESIS STRUCTURE 4hapter 2 TRADITIONAL MERGING STRATEGIES 5.1. HEURISTIC MERGING STRATEGY 5.1.1 RAW SCORE 6.1.2. ROUND ROBIN 7.1.3. NORMALIZED BY TOP K 8.2. LEARNING BASED MERGING STRATEGY 9.2.1. LOGISTIC REGRESSION 9.3. RETRIEVAL BASED MERGING STRATEGY 11.3.1. 2-STEP RETRIEVAL STATUS VALUE METHOD 11hapter 3 ANALYZE THE INFLUENTIAL FEATURES FOR MLIR 13.1. TRANSLATION LEVEL 13.1.1. THE IMPORTANCE OF SOME QUERY TERMS 16.1.2. QUERY TERM CATEGORY 17.1.3. IMPORTANCE AND RELATIONS OF QUERY TERM CATEGORY 18.1.3.1. EXPERIMENT SETTING 19.1.3.2. RETRIEVE EXCEPT ONE QUERY TERM CATEGORY 21.1.3.3. RETRIEVE EXCEPT TWO QUERY TERM CATEGORY 26.1.4. PROMOTE PERFORMANCE IN CLIR 30.1.4.1. EXPERIMENT SETTING 33.1.4.2. EXPERIMENTAL RESULT 34.1.5. CONCLUSION ON TRANSLATION LEVEL 35.2. DOCUMENT LEVEL 36.3. GENERAL LEVEL 37hapter 4 USING FRANK TO BUILD A MERGE MODEL 38.1. SYSTEM OVERVIEW 38.2. FEATURE SELECTION 40.3. USING FRANK APPROACH TO MERGE 46.4. EXPERIMENT 47.4.1. EXPERIMENT SETTING AND PREPROCESSING 47.4.2. LEARNING A MERGE MODEL 49.4.3. EXPERIMENT RESULT 50.4.4. DISCUSSION 52hapter 5 CONCLUSION AND FUTURE WORK 57.1. CONCLUSION 57.2. FUTURE WORK 58EFERENCES 59PPENDIX TCIR3 ENGLISH QUERY DATA SET WITH LABELED CATEGORY 61TCIR4 ENGLISH QUERY DATA SET WITH LABELED CATEGORY 65TCIR5 ENGLISG QUERY DATA SET WITH LABELED CATEGORY 70	en
dc.format	application/pdf	en
dc.format.extent	531014 bytes	-
dc.format.mimetype	application/pdf	-
dc.language	en	en
dc.language.iso	en_US	-
dc.subject	跨語言檢索	zh-TW
dc.subject	結果合併	zh-TW
dc.subject	機器學習	zh-TW
dc.subject	Multilingual Information Retrieval	en
dc.subject	Data Fusion	en
dc.subject	Machine Learning	en
dc.title	以機器學習方法處理跨語言檢索合併問題	zh-TW
dc.title	A Machine Learning Approach for Result Fusion in Multilingual Information Retrieval	en
dc.type	thesis	en
dc.identifier.uri.fulltext	http://ntur.lib.ntu.edu.tw/bitstream/246246/183615/1/ntu-97-R95922066-1.pdf	-
item.openairetype	thesis	-
item.openairecristype	http://purl.org/coar/resource_type/c_46ec	-
item.fulltext	with fulltext	-
item.grantfulltext	open	-
item.languageiso639-1	en_US	-
item.cerifentitytype	Publications	-
顯示於：	資訊工程學系

文件中的檔案：

檔案	描述	大小	格式
ntu-97-R95922066-1.pdf		23.32 kB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

Page view(s)

checked on 2024/5/11

下載

checked on 2024/5/11

Google Scholar^TM

檢查

TAIR相關文章

文件中的檔案：

Page view(s)

下載

Google ScholarTM

Google Scholar^TM