Automatic Processing of Languages with Small-Scaled Corpus: Part-of-Speech Tagging and Partial Parsing SaiSiyat and Applications

Name: Automatic Processing of Languages with Small-Scaled Corpus: Part-of-Speech Tagging and Partial Parsing SaiSiyat and Applications
Author: Lin, Zhe-Min

Lin, Zhe-Min; 宋麗梅; Lin, Zhe-Min

Automatic Processing of Languages with Small-Scaled Corpus: Part-of-Speech Tagging and Partial Parsing SaiSiyat and Applications

Date Issued

2005

Date

2005

Author(s)

Lin, Zhe-Min

DOI

en-US

URI

http://ntur.lib.ntu.edu.tw//handle/246246/59388

Abstract

This thesis demonstrates an effective method to tag and parse a corpus with no more than twenty thousand words, along with three useful applications which take advantage of the manipulated corpus. The NTU corpus of Austronesian languages, an intonation-unit (IU) based corpus, is chosen to be processed. In Chapter 1, we introduce current problems in automatic processing of Austronesian languages. As small-scaled corpora limit the usage of statistical natural language processing, we are urged to find an alternative method to deal with Austronesian corpora. A new tag set is defined in Chapter 2 to reflect linguistic particularity of the object language of this thesis, SaiSiyat. Two methods to label part-of-speech tags, the gloss-based approach (accuracy rate 75%) and transformation-based error-driven learning (TBL, accuracy rate 85%), are evaluated and reported robust. Difficulties to distinguish between SaiSiyat nominative and accusative case markers are especially discussed. A partial parser is useful in preparing a corpus for noun-phrase extraction and further analyses. In Chapter 3, the tagged corpus is parsed into binary trees by a statistical approach, Kullback-Leibler divergence, and the TBL method. The former method declines quickly as IU length increases and needs huge computation time, while the accuracy rate of the latter method is a little less than 70%. Chapter 4 shows how an annotated corpus is related to linguistic research, native speakers of the object language and the public. Machine-aided annotation helps linguists to quickly rearrange collected data. An integrated platform of multimedia online corpora is also designed in this chapter, in order to serve both linguists and the public. In the last chapter, the natural language processing is discussed in early and late Wittgenstein's points of view. We agree with the idea that the meaning of a word is as many as its actual use. Thus, the computer cannot go beyond the boundary of the micro-cosmos composed by texts given in a corpus.

Subjects

基於轉換的錯誤驅動學習

標記集

線上語料庫

維特根斯坦

Formosan Austronesian languages

tag set

transformation-based error-driven learning

onlinecorpus design

fieldwork process

Wittgenstein

Type

book

Automatic Processing of Languages with Small-Scaled Corpus: Part-of-Speech Tagging and Partial Parsing SaiSiyat and Applications

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)