許聞廉項潔臺灣大學:蔡宗翰Tsai, Richard Tzong-HanRichard Tzong-HanTsai2007-11-262018-07-052007-11-262018-07-052006http://ntur.lib.ntu.edu.tw//handle/246246/54052生醫文獻處理的自動化,在大規模的實驗設計與分析上極為重要。為了達到前述的目標,許多具備自然語言處理 (natural language processing, NLP) 能力的資訊擷取 (information extraction, IE) 系統紛紛出現。本論文將針對其中兩項最基本的技術:專有名詞辨識 (named entity recognition, NER)、語意角色標註 (semantic role labeling, SRL) ,以及這兩項技術在自動問答系統(question answering, QA) 上的應用進行深入的探討。 在第一項專有名詞辨識(NER)問題上,我們亟需在模型中加入具有多項條件式的特徵函數,以進一步提升辨識率。然而,由於記憶體有限,且許多特徵並不利於辨識的正確性,沒有必要將所有的特徵均納入辨識模型中。因此,我們運用循序式前向搜尋法 (sequential forward search) 來尋找最有用的特徵群組組合。此外,生醫專有名詞的多變異性會造成資料稀疏 (data sparseness) 的問題,且數字部分特別容易變化,並產生許多不必要的特徵。因此,我們將應用數字正規化的方法來解決這個問題。再者,每個字的標籤 (tag) 並非僅與鄰近的字有關,有可能也跟前後文觀察範圍 (context window) 以外的資訊有關。因此,我們使用自動產生的宏觀樣版 (global pattern) 來記錄這種結構,並以其修正CRF模型標註的結果。依序使用這三項方法之後,本系統專有名詞的辨識精準率 (F-score) 可較陽春型(baseline)系統增加3.28%,到達72.98%。這個成果也超越了目前學術界所有其他系統。 在第二項語意角色標註(SRL)問題上,我們建構了一個生醫領域的語意角色自動標註系統,這個系統可以用來擷取生醫領域特有的關連性。這個建構的過程可以分成三部分:首先,我們根據賓夕法尼亞大學所發展的PropBank標註規格,在日本東京大學Tsujii實驗室所提供的GENIA剖析樹語料庫 (GENIA Treebank) 上進行語意角色的標註。我們針對生醫領域最頻繁使用且最重要的三十個動詞,標註以其為主的語意框架 (semantic frame) 及語意角色。接著,我們利用這份標註的語料庫來訓練一個採用最大熵 (maximum entropy) 模型的自動語意角色標註系統。最後,我們採用自動生成的語意角色模版 (argument type template) 來增強角色分類的精準率。在我們的實驗結果中,若使用新聞領域語料訓練的模型來標註生醫文獻,精準率 (F-score) 會從原先標註新聞語料的86.29%遽降至64.64%。若使用我們標註的生醫語料庫—BioProp訓練出的模型,則精準率可提升22.46%,到達87.10%。在更進一步加入模版特徵後,重要修飾性角色的精準率可以再顯著地提升1.57%。 最後,我們將前述兩項技術NER與SRL應用到生醫自動問答系統 (QA) 上。對生物醫學領域的學者來說,他們亟需能快速地取得研究上的相關資訊。自動問答系統讓這些學者可以很方便地使用自然語言來發問,並且從大量的文獻庫中自動擷取出答案。在本論文中發展的問答系統—BeQA是用來專門回答跟分子生物事件 (molecular event) 相關的問題。利用SRL系統對問題句及可能答案句的進行語意角色標註,Top-1 accuracy以及Top-5 MRR兩種指標都得到了顯著的成長。此外,在BeQA系統中,我們也採用了Google作為資訊檢索的引擎來提供可能的答案句。BeQA系統的最佳組態在Top-1 Accuracy上到達51.9%;在Top-5 MRR上則到達57.7%;為生醫文獻處理領域第一個經過QA完整效能評估的的系統。未來,我們將繼續加強NER、SRL以及QA系統的能力,並且將這幾項技術應用在生物醫學關連性,例如protein-protein interaction及gene-disease relation的擷取上。Processing biomedical literature automatically would be invaluable for both the design and interpretation of large-scale experiments. To this end, many information extraction (IE) systems using natural language processing (NLP) techniques have been developed for use in the biomedical field. In this dissertation, we study two main tasks: name entity recognition, semantic role labeling and their application to biomedical question-answering (QA). In the first task, adding conjunction features is necessary, but it is infeasible to include all conjunction feature groups in a NER model since the memory resource is limited and some of them are ineffective. We employ sequential forward search to select the most effective feature groups. In addition, varieties of biomedical terms cause data sparseness and generate many redundant features mostly due to the varieties in the numerical parts. We apply numerical normalization to deal with this problem. In addition, the assignment of NE tags does not merely depend on the closest neighbors but may depend on words beyond the context window. We use automatically generated global patterns to remember such structures and modify the results of CRF tagger. By employing these three techniques sequentially, the F-score becomes 72.98%, which is 3.28% better than the baseline system and also outperforms the state-of-the-art systems. In the second task, we construct a biomedical semantic role labeling (SRL) system that can be used to facilitate relation extraction. This task is divided into three steps. First, we construct a proposition bank on top of the popular biomedical GENIA treebank following the PropBank annotation scheme. We only annotate the predicate-argument structures (PAS's) of thirty frequently used biomedical predicates and their corresponding arguments. Second, we use our proposition bank to train a biomedical SRL system, which uses a maximum entropy (ME) model. Thirdly, we automatically generate argument-type templates which can be used to improve classification of biomedical argument roles. Our experimental results show that a newswire SRL system that achieves an F-score of 86.29% in the newswire domain can maintain an F-score of 64.64% when ported to the biomedical domain. By using our annotated corpus, BioProp, the F-score can be improved by 22.9%. After employing template features, the adjunct arguments such as temporal and locational arguments can be significantly improved by 1.57%. At last, we present a biomedical Question Answering (QA) system by applying the NER and SRL systems. There is a pressing need for biologists to efficiently retrieve biological information related to their research. QA system enables biologists to ask questions conveniently in natural language and to retrieve specific answers from a large number of documents. We introduce our Biomedical Question Answering sys-tem (BeQA), which is designed to answer questions related to molecular events. By using the SRL system to label semantic arguments of questions and answers as well as to help QA mapping, we have improved both of the Top-1 accuracy and Top-5 MRR. In addition, we employ Google as our page retrieval module to find out passages with answers. The best result of BeQA achieves a Top-1 accuracy of 51.9% and a Top-5 MRR of 57.7%. In our future work, not only will we improve the ability of NER, SRL and biomedical QA, but also apply them to built a relation extraction system for pro-tein-protein and gene-disease relations.1 Introduction 27 2 Biomedical Named Entity Recognition 34 2.1 Method 36 2.1.1 Formulation 37 2.1.2 Classifier-based Approaches 38 2.1.3 Sequence Models 38 2.1.4 Conditional Random Fields 42 2.2 Feature Set 45 2.2.1 Word Features 45 2.2.2 Orthographical Features 45 2.2.3 Part-of-speech Features 46 2.2.4 Word Shape Features 47 2.2.5 Affix Features 47 2.2.6 Chunk Features 48 2.2.7 Conjunction Features 48 2.3 Feature Selection 49 2.4 Numerical Normalization 52 2.5 Using Global Pattern to Improve CRF 53 2.5.1 Weakness of Sequence Models 53 2.5.2 Global Pattern Induction and Filtering 54 2.5.3 Complexity Analysis 55 2.5.4 Error Correction 55 2.6 Experiment 56 2.6.1 Datasets 56 2.6.2 Evaluation Methodology 58 2.7 Results 58 2.8 Analysis and Discussion 60 2.9 Conclusion 63 3 Biomedical Semantic Role Labeling 65 3.1 Background 65 3.2 The Biomedical Proposition Bank--BioProp 70 3.2.1 Corpus Selection 70 3.2.2 Verb Selection 70 3.2.3 PAS standard-Proposition Bank 71 3.2.4 Framesets of Biomedical Verbs 73 3.2.5 Annotation of BioProp 75 3.2.6 Related Work 75 3.3 Method 78 3.3.1 Formulation of Semantic Role Labeling 78 3.3.2 Maximum Entropy Model 79 3.3.3 Baseline Features 80 3.3.4 Named Entity Features 81 3.3.5 Biomedical Template Features 82 3.3.6 Template Generation (TG) and Filtering 83 3.3.7 Applying Generated Templates 84 3.4 Experiments 84 3.4.1 Datasets 84 3.4.2 SMILE and BIOSMILE 86 3.4.3 Experiment Design 86 3.5 Results 88 3.6 Discussion 90 4 Biomedical Question Answering System 96 4.1 Background and Related Work 97 4.1.1 Related Work 97 4.1.2 Question Types 98 4.1.3 Web-based Search Engine and Techniques Involved 98 4.2 Corpus Creation 99 4.3 Method 100 4.3.1 Evaluation Measurement 100 4.3.2 System Architecture 101 4.4 Experiment 105 4.4.1 Experiment Design 105 4.4.2 Experiment Results 106 4.5 Analysis and Discussion 108 4.5.1 The Effects of Using fARGM 108 4.5.2 The Effects of Using fARGS 109 4.5.3 Error Analysis 110 5 Conclusion and Future Work 111 6 References 114802274 bytesapplication/pdfen-US生醫文獻探勘自然語言處理專有名詞辨識語意角色標註自動問答關連性擷取資訊擷取Biomedical literature miningnatural language processingnamed entity recognitionsemantic role labelingquestion answeringrelation extractioninformation extraction生物醫學名詞辨識、語意角色自動標註及問答系統上之應用BIOMEDICAL NAMED ENTITY RECOGNITION,SEMANTIC ROLE LABELING AND THEIR APPLICATION TO QUESTION ANSWERINGthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/54052/1/ntu-95-D90922013-1.pdf