中文語篇標記解釋與語篇關係辨識及其在意見極性分析之研究

指導教授：陳信希臺灣大學：資訊工程學研究所黃瀚萱Huang, Hen-HsenHen-HsenHuang2014-11-262018-07-052014-11-262018-07-052014http://ntur.lib.ntu.edu.tw//handle/246246/261422語篇關係是語篇單元（如子句、句子、或句群）之間的修辭關係，常見的語篇關係有時序、因果、轉折、推展等。語篇關係呈現了文句承接的邏輯，影響文意的表達和解讀。利用電腦自動偵測語篇關係，是新興的研究領域。隨著Rhetoric Structure Theory Discourse Treebank (RST-DT) 與Penn Discourse Treebank (PDTB) 等語料資源釋出，英文的語篇關係分析已經有了一些成果，進而應用到自動摘要、意見分析、文本蘊涵、事件辨識等領域。反觀中文，由於語料資源的缺乏，以及中文本身的複雜性，使得中文語篇關係的研究更具挑戰性。本篇論文對於中文語篇關係識別、中文語篇標記、語篇關係與意見極性的關聯性，做了全面性的探討。我們發展了一套學習模型，可以識別句內及句間等兩種層次的語篇關係，同時也觸及語篇剖析的問題。語篇剖析可以將語篇單元之間的上下階層以及指涉範圍，解析成樹狀結構，從複雜的語句中挖掘出更多資訊。特別是中文的長句，超過三四個子句，沒有語篇結構的資訊，則不易解釋整個句子的意涵。對此，我們發展了初步的統計學習的模型，對中文句子進行句內的語篇剖析。在語篇關係識別與剖析的實驗過程中，我們發現語篇標記（一些具有語篇資訊的連接詞等詞彙，例如「因為」、「但是」）是語篇關係識別的重要線索。但在中文裡，語篇標記常有一字多義的歧義性，連帶干擾識別模型的效能。我們運用鉅量資料，配合半監督式機器學習法來探索歧義性的問題，評估每個語篇標記對於四大類語篇關係的分佈情況。從資料中習得的分佈資訊，作為語篇關係識別的特徵線索，效果比使用專家制定的詞典更好。我們也探討了語篇關係與意見極性之間的關聯。像「轉折」關係，它的兩個語篇單元常常形成對立的意見極性，較常用於呈現負面意見。相對的，「時序」和「推展」所陳述的內容，則較為中立，較少涉及情緒表態。由於語篇關係與意見極性此之間的密切關聯，語篇關係識別的結果可以作為線索，應用於意見分析。在本論文中，我們所處理的語篇關係是最基本的「時序」、「因果」、「轉折」、「推展」等四大類型。未來我們希望可以探討更細緻的語篇關係，並且進一步處理句內、句間、句群等不同層次的語篇剖析。Discourse relation is the rhetorical relation between two discourse units (i.e. clauses, sentences, or blocks of sentences). The famous discourse relations include Temporal, Contingency, Comparison, Expansion, and so on. A discourse relation indicates how its two discourse units cohere, and this information influences the meaning of text. Discourse relation is important clue to many applications such as summarization, opinion mining, textual entailment, and event recognition. Recently the research on automatically English discourse relation recognition is rapid growth due to the release of corpora like Rhetoric Structure Theory Discourse Treebank (RST-DT) and Penn Discourse Treebank (PDTB). Unlike English, Chinese discourse relation recognition is more challenging because of the lack of resources and the special issues in Chinese. In this dissertation, we give an in-depth study on Chinese discourse relation analysis. We propose a statistical algorithm to recognize the discourse relation in both levels of inter-sentential and intra-sentential. We also show our preliminary results on Chinese discourse parsing at sentence level. In Chinese, many long sentences contain more than two clauses and form complex discourse structures. Discourse parsing fetches the hierarchical structure and relation among the clauses in a given sentence. Discourse markers are key clue to discourse process, but the use of Chinese discourse marker is inherent ambiguity. To interpret the ambiguous Chinese discourse markers, we propose a semi-supervised framework to estimate the distribution of each Chinese discourse marker from a large-sized corpus, the ClueWeb09. This semi-supervised framework with the estimated distributions finally improve the performance of Chinese discourse relation recognition. Discourse relations and sentiment polarities are interactive in text. We investigate their correlation with ClueWeb09. A moderate-sized data annotated by human are analyzed and compared with the huge data heuristically labeled by machine. As a result, the association between sentiment and discourse is validated. In this dissertation, we focus on the four-way discourse relation classification. We will investigate the finer-grained classification on discourse relations in the future. In addition, we will further tackle the issue of Chinese discourse parsing at paragraph level and document level.口試委員會審定書 i 誌謝 ii 中文摘要 iii ABSTRACT v CONTENTS vii LIST OF FIGURES x LIST OF TABLES xi Chapter 1. Introduction 1 1.1. Discourse Relation Analysis 1 1.2. Types of Discourse Relations 3 1.3. Discourse Markers 7 1.4. Chinese Discourse Relations 8 1.5. Research Goals 9 1.6. Organization 10 Chapter 2. Related Work 11 2.1. Resources 11 2.2. Discourse Relation Recognition in English 13 2.3. Discourse Relation Recognition in Chinese 15 2.4. Discourse Relations and Sentiment Polarities 18 2.5. Discourse Parsing 19 Chapter 3. Discourse Relation Recognition 21 3.1. Dataset 21 3.2. Method 24 3.3. Experiments and Discussion 25 3.3.1 Results of Inter-sentential Discourse Relation Recognition 26 3.3.2 Results of Intra-sentential Discourse Relation Recognition 29 3.4. Summary 31 Chapter 4. Discourse Relation and Parsing 33 4.1. Dataset 33 4.2. Methods 36 4.3. Experiments and Discussion 39 4.4. Summary 45 Chapter 5. Discourse Relation and Sentiment Polarity 47 5.1. Linguistic Resources 47 5.2. Analysis on Human Annotated Data 48 5.2.1 Annotation 48 5.2.2 Overview of the Annotated Corpus 49 5.2.3 Frequent Discourse Markers 52 5.2.4 Association between Discourse Relation and Sentiment Polarity 54 5.3. Analysis on Large-scale Data 59 5.3.1 A Lexicon-based Method for Sentiment Analysis 59 5.3.2 Evaluation 60 5.3.3 Results and Discussion 62 5.4. Summary 65 Chapter 6. Interpretation of Discourse Markers 67 6.1. Types of Discourse Markers 69 6.2. Dataset 71 6.3. Ambiguity of Chinese Discourse Markers 72 6.3.1 Performance of Using Discourse Marker Dictionary 72 6.3.2 Thesaurus Alignment 73 6.4. A Semi-Supervised Method 77 6.4.1 Linguistic Features 78 6.4.2 A Semi-supervised Learning Algorithm 79 6.5. Experimental Results 81 6.6. Further Analyses on a Big Dataset 85 6.7. Summary 90 Chapter 7. Conclusion 92 REFERENCES 941045266 bytesapplication/pdf論文公開時間：2014/08/11論文使用權限：同意無償授權自然語言處理中文語篇分析語篇關係辨識語篇標記意見極性中文語篇標記解釋與語篇關係辨識及其在意見極性分析之研究Interpretation of Chinese Discourse Markers, Discourse Relation Recognition, and their Relationships with Sentiment Polaritythesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/261422/1/ntu-103-D97922036-1.pdf