Processing Techniques for Written Taiwanese --Tone Sandhi and POS Tagging
Date Issued
2009
Date
2009
Author(s)
Iunn, Un-Gian
Abstract
Taiwan Southern Min (Taiwanese) is an important language that has received only a little attention in the world. The characteristic of written Taiwanese is quite different from Mandarin or English in some respects. We will focus on Taiwanese processing techniques in this dissertation.OJ is an important script of Taiwanese. We introduce character code of POJ, and mention the numbered POJ as the interchange code for various POJ encodings. Then, we propose a two-stage search strategy for POJ text search, and propose POJ syllable query expansion. We also describe the display method for POJ, POJ word processing utilities and word segmentation method for HR mixed script.e propose a rule-based tone sandhi algorithm. We translate every word into Mandarin, and obtain the POS information. Using the POS data and tone sandhi rules, we then tag each syllable with its post-sandhi tone marker. Finally we implemented a Taiwanese tone sandhi processing system. Our system achieves 97.4% and 89.0% accuracy rate with training and test data, respectively.dditionally, we propose a POS tagging method. We develop a word alignment checker to help the two Taiwanese scripts word alignment work, select the most adequate Mandarin word using Hidden Markov probabilistic model, and finally tag the word using Maximal Entropy Markov Model classifier. We achieve an accuracy rate of 91.5% on Taiwanese POS tagging work.e have established some useful online written Taiwanese tools for past several years. Based on these tools and preliminary research results, we hope the written Taiwanese processing related research can be promoted.
Subjects
Written Taiwanese
Tone Sandhi
POS Tagging
Peh-Oe-Ji
Natural Language Processing
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-98-D93922001-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):cf0aef9db5fe01288932e83ec5c98345
