高成炎臺灣大學:資訊工程學研究所楊允言Iunn, Un-GianUn-GianIunn2010-06-092018-07-052010-06-092018-07-052009U0001-0602200918150400http://ntur.lib.ntu.edu.tw//handle/246246/185357台語是世界上重要的語言,可惜沒有受到應有的重視。在某些方面,台語文的特性與華文或英文相當不同。本論文主要討論台語文處理技術。話字(台語羅馬字)是台語文的重要書寫系統。我們先介紹白話字的字元編碼,提及白話字數字調號做為不同白話字字元編碼的內部表示法。針對白話字文本搜尋,我們提出兩階段搜尋策略,並提出白話字音節近似搜尋的方法。我們還描述白話字顯示方法、白話字文字處理相關應用程式以及漢羅台語文斷詞方法。們提出以規則方法處理變調問題的演算法。先將每個台語詞翻成華語詞,找出其詞類標記訊息,以詞類標記和變調規則來決定變調後的聲調。我們實作出台語變調系統。此系統在訓練資料及測試資料分別達到97.4%和89.0%的變調正確率。外,我們提出詞類標記方法。我們先開發語詞對齊檢查程式將逐段對齊的兩種台語文本做語詞對齊,之後利用HMM機率模型挑選最適當的華語對應詞,再利用MEMM分類器挑選出其詞性標記。我們的方法達到91.5%的正確率。去幾年,我們建立了一些有用的線上台語文工具。希望這些工具以及我們所做的初步研究成果,能讓台語文處理相關研究更加蓬勃發展。Taiwan Southern Min (Taiwanese) is an important language that has received only a little attention in the world. The characteristic of written Taiwanese is quite different from Mandarin or English in some respects. We will focus on Taiwanese processing techniques in this dissertation.OJ is an important script of Taiwanese. We introduce character code of POJ, and mention the numbered POJ as the interchange code for various POJ encodings. Then, we propose a two-stage search strategy for POJ text search, and propose POJ syllable query expansion. We also describe the display method for POJ, POJ word processing utilities and word segmentation method for HR mixed script.e propose a rule-based tone sandhi algorithm. We translate every word into Mandarin, and obtain the POS information. Using the POS data and tone sandhi rules, we then tag each syllable with its post-sandhi tone marker. Finally we implemented a Taiwanese tone sandhi processing system. Our system achieves 97.4% and 89.0% accuracy rate with training and test data, respectively.dditionally, we propose a POS tagging method. We develop a word alignment checker to help the two Taiwanese scripts word alignment work, select the most adequate Mandarin word using Hidden Markov probabilistic model, and finally tag the word using Maximal Entropy Markov Model classifier. We achieve an accuracy rate of 91.5% on Taiwanese POS tagging work.e have established some useful online written Taiwanese tools for past several years. Based on these tools and preliminary research results, we hope the written Taiwanese processing related research can be promoted.Preface icknowledgments iv要 xibstract xiiibbreviations xxvhapter 1 Introduction 1.1 Background 1.1.1 Language Population in Taiwan 1.1.2 Southern Min Language Population 2.1.3 Another Investigation: the Taiwan Southern Min Viewers 3.1.4 The Confusing Name of This Language 5.2 Different Types of Written Taiwanese Scripts 6.2.1 The Han Characters Script 7.2.2 The Romanized Scripts 9.2.3 The Han-Romanization Mixed Script 10.2.4 Other Scripts 10.2.5 Target Scripts in This Dissertation 11.3 Issues Related to Written Taiwanese Processing 11.4 Organization of This Dissertation 12hapter 2 Resources and Survey of Written Taiwanese Processing 19.1 Digital Resources for Written Taiwanese 19.1.1 Fonts 19.1.2 Dictionary 20.1.3 Text Corpora 23.1.4 Electronic Books 27.2 Survey of Written Taiwanese Processing Techniques 28.2.1 Input Method 28.2.2 Word Segmentatation 29.2.3 POS Tagging 30.2.4 Scripts Conversion 30.2.5 Text-to-Speech 30.2.6 Translation 33.2.7 Parsing 33.3 Summary 33hapter 3 Coding, I/O for POJ, and Text Processing 35.1 Character Code of POJ 35.2 Two Kinds of POJ Representation 39.3 Search Problem with POJ Text 41.3.1 Issues with POJ Text Search 41.3.2 Two-Stage Search Method: String Matching Then Filtering 42.3.3 Query Expansions: Toneless, Glottal Stop, Checked Syllable, and Vowel 44.3.4 Examples of Search Results 47.4 POJ Text Display 49.4.1 Issues with POJ Text Display 49.4.2 POJ and Numbered POJ Conversion Method 50.4.3 POJ Graph Display 52.4.4 Examples of Display Results 53.5 Some Text Processing Utilities for POJ 55.5.1 POJ Phoneme Segmentation and Spelling Checker 55.5.2 POJ Syllable/Word/Sentence Count 57.6 Word Segmentation for HR Mixed Script 58hapter 4 Tone Sandhi Problem and Algorithm 63.1 Tone Sandhi Problem of the Taiwanese Language 63.1.1 Types of the Taiwanese Language Tone Sandhi 64.1.2 Boundary of Tone Sandhi Group 68.2 Implementation of the Taiwanese Pronunciation System 68.2.1 System Diagram 68.2.2 Observation Data and Test Data 70.2.3 POS Tagging Set 71.2.4 Tone Sandhi Marks 73.3 Rule-based Tone Sandhi Algorithm 73.4 Results, Accuracy Rate and Discussion 78.4.1 Experiment Results 78.4.2 Accuracy Rate and Related Analysis 80.4.3 Discussion 83.5 Summary and Possible Direction 85hapter 5 POS Tagging Method 87.1 Problems of POS Tagging 87.2 POS Tagging Methods 88.2.1 Origin of the Corpus 89.2.2 Word for Word Alignment 89.2.3 Searching for the Corresponding Mandarin Candidate Words 90.2.4 Selecting the Best Mandarin Translation 91.2.5 Selecting the Most Appropriate POS According to the Corresponding Mandarin Word 92.3 Results 94.4 Error Analysis 99.4.1 Incorrect Corresponding Mandarin Word Selection 99.4.2 Absence of Appropriate Mandarin Words in the OTMD 100.4.3 Unknown Words from the Viewpoint of Mandarin 101.4.4 Propagation Error 101.4.5 Other Cases 101.4.6 Summary of Error Conditions 102.5 Discussion 103.5.1 Is Improvement Possible ? 103.5.2 Hyphen Problems, Distinction between Taiwanese and Mandarin 104.5.3 The Distinction between Different Eras or Different Genres 105.6 Summary 106hapter 6 Conclusion and Future Work 109.1 Our Contributions to Written Taiwanese Resources and Processing 109.2 Future Work and Prospects for Written Taiwanese Processing Research 112eference 117ppendix 127.1 Brief Introduction to The Phoneme of Taiwanese 127.1.1 Initials 127.1.2 Vowels 128.1.3 Tones 129.1.4 Compared with Mandarin 130.2 Examples of Written Taiwanese 132.3 Terminologies 136.4 Webpages Made by Author 138.5 Differences between POJ and TL 1392296472 bytesapplication/pdfen-US台語文變調詞類標記白話字自然語言處理Written TaiwaneseTone SandhiPOS TaggingPeh-Oe-JiNatural Language Processing台語文處理技術:以變調及詞性標記為例Processing Techniques for Written Taiwanese --Tone Sandhi and POS Taggingthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/185357/1/ntu-98-D93922001-1.pdf