Repository logo
  • English
  • 中文
Log In
Have you forgotten your password?
  1. Home
  2. College of Liberal Arts / 文學院
  3. Linguistics / 語言學研究所
  4. Automatic Processing of Languages with Small-Scaled Corpus: Part-of-Speech Tagging and Partial Parsing SaiSiyat and Applications
 
  • Details

Automatic Processing of Languages with Small-Scaled Corpus: Part-of-Speech Tagging and Partial Parsing SaiSiyat and Applications

Date Issued
2005
Date
2005
Author(s)
Lin, Zhe-Min
DOI
en-US
URI
http://ntur.lib.ntu.edu.tw//handle/246246/59388
Abstract
This thesis demonstrates an effective method to tag and parse a corpus with no more than twenty thousand words, along with three useful applications which take advantage of the manipulated corpus. The NTU corpus of Austronesian languages, an intonation-unit (IU) based corpus, is chosen to be processed. In Chapter 1, we introduce current problems in automatic processing of Austronesian languages. As small-scaled corpora limit the usage of statistical natural language processing, we are urged to find an alternative method to deal with Austronesian corpora. A new tag set is defined in Chapter 2 to reflect linguistic particularity of the object language of this thesis, SaiSiyat. Two methods to label part-of-speech tags, the gloss-based approach (accuracy rate 75%) and transformation-based error-driven learning (TBL, accuracy rate 85%), are evaluated and reported robust. Difficulties to distinguish between SaiSiyat nominative and accusative case markers are especially discussed. A partial parser is useful in preparing a corpus for noun-phrase extraction and further analyses. In Chapter 3, the tagged corpus is parsed into binary trees by a statistical approach, Kullback-Leibler divergence, and the TBL method. The former method declines quickly as IU length increases and needs huge computation time, while the accuracy rate of the latter method is a little less than 70%. Chapter 4 shows how an annotated corpus is related to linguistic research, native speakers of the object language and the public. Machine-aided annotation helps linguists to quickly rearrange collected data. An integrated platform of multimedia online corpora is also designed in this chapter, in order to serve both linguists and the public. In the last chapter, the natural language processing is discussed in early and late Wittgenstein's points of view. We agree with the idea that the meaning of a word is as many as its actual use. Thus, the computer cannot go beyond the boundary of the micro-cosmos composed by texts given in a corpus.
Subjects
基於轉換的錯誤驅動學習
標記集
線上語料庫
維特根斯坦
Formosan Austronesian languages
tag set
transformation-based error-driven learning
onlinecorpus design
fieldwork process
Wittgenstein
Type
book

臺大位居世界頂尖大學之列,為永久珍藏及向國際展現本校豐碩的研究成果及學術能量,圖書館整合機構典藏(NTUR)與學術庫(AH)不同功能平台,成為臺大學術典藏NTU scholars。期能整合研究能量、促進交流合作、保存學術產出、推廣研究成果。

To permanently archive and promote researcher profiles and scholarly works, Library integrates the services of “NTU Repository” with “Academic Hub” to form NTU Scholars.

總館學科館員 (Main Library)
醫學圖書館學科館員 (Medical Library)
社會科學院辜振甫紀念圖書館學科館員 (Social Sciences Library)

開放取用是從使用者角度提升資訊取用性的社會運動,應用在學術研究上是透過將研究著作公開供使用者自由取閱,以促進學術傳播及因應期刊訂購費用逐年攀升。同時可加速研究發展、提升研究影響力,NTU Scholars即為本校的開放取用典藏(OA Archive)平台。(點選深入了解OA)

  • 請確認所上傳的全文是原創的內容,若該文件包含部分內容的版權非匯入者所有,或由第三方贊助與合作完成,請確認該版權所有者及第三方同意提供此授權。
    Please represent that the submission is your original work, and that you have the right to grant the rights to upload.
  • 若欲上傳已出版的全文電子檔,可使用Open policy finder網站查詢,以確認出版單位之版權政策。
    Please use Open policy finder to find a summary of permissions that are normally given as part of each publisher's copyright transfer agreement.
  • 網站簡介 (Quickstart Guide)
  • 使用手冊 (Instruction Manual)
  • 線上預約服務 (Booking Service)
  • 方案一:臺灣大學計算機中心帳號登入
    (With C&INC Email Account)
  • 方案二:ORCID帳號登入 (With ORCID)
  • 方案一:定期更新ORCID者,以ID匯入 (Search for identifier (ORCID))
  • 方案二:自行建檔 (Default mode Submission)
  • 方案三:學科館員協助匯入 (Email worklist to subject librarians)

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science