學術研究生涯發展計畫－桂冠型研究計畫【前瞻性現代漢語語料庫語言學：語料蒐集、處理與統計分析知識之模組化】

2011-08-012024-05-17https://scholars.lib.ntu.edu.tw/handle/123456789/680040摘要：台灣的語料庫語言學之研究，曾經是亞洲各國的先驅者。中研院平衡語料庫自從九０年初創建以來，對於台灣之語言學與語言教學界有著深刻之影響。然而經過二十年，語料很可惜的未能與時俱進，使得其共時性與代表性頗遭質疑。放眼歐美語料庫發展之先進國家，除了少數有政府或大型機構持續挹注，也都一樣要面對相同之問題。因應而生的，近年來，在語料庫語言學的最新發展進程上，已有從傳統手製語料庫，轉向半自動建立之鉅型語料庫，一直到網路作為語料庫之趨勢。大規模網路語料的可得，使得語言研究的經驗基礎更形穩固；而這些鉅量語料所呈現出來的豐富異質訊息，更是對於語言理論研究方法產生極大的衝擊。在之前的先導研究中，我們曾經提出網路平衡語料庫之倡議，亦挑選了部分領域以及社群網站予以實作，並就此引出了許多語料庫方法論上的問題。例如，當語料的規模，大到大半部分語料已經無法透過人為控管校正時，所建構之網路語料庫是不是可靠到類似傳統語料庫，可以作為一個好的語言樣本？我們又能用哪些衡量尺度來表達這個信心？再者，大規模語料與語言理論的關係是什麼？我們想像未來的語料庫，文本不應封閉靜態，語言訊息應該呈現呈現多元類型異質，且能在不斷更新的雲端中，提供不同面向的語言訊息（包括語料地理訊息、時間訊息、性別訊息、語音訊息、本體訊息等等），容許各種目的與觀點下的語料當下 (ad hoc) 運算，方便的工具與 API 介面，讓語言研究者能立刻建立各種子語料庫與理論分析模組。承續先前的工作，本研究將提出以下四個語料庫語言學環環相扣之新趨勢，並以語言學研究者之需求著眼，實作以下之知識模組 ‧ 語料之前檢視 (Exploratory Linguistic Data Analysis) ‧ 語料的多面向自動標記與混搭技術 ‧ 大規模語料庫之永續經營、評估與校對：雲端與人端計算芻議 ‧ 大規模語料計算與理論建模本研究之目的，則是希望藉由四年期之計畫，分別達成上述之目標，以接續台灣現代漢語語料庫之發展。我們相信，這些語料庫知識模組，將是具備前瞻性語言學發展之核心部分，我們有信心可以再將台灣的語料庫研究推向另一個高峰。成果語料庫及工具，亦將以 Creative Commons 的方式釋出，達到資源共享與知識累積的目的。<br> Abstract: The development of corpus linguistics has been played a crucial role in both theoretical and applied linguistics. However, the well-developed and once pioneered Balanced Corpus at Academia Sinica is facing with the issue of synchronic representativeness due to its inactivity for the past twenty years. In response to this, recently, the researches on Web as Corpus (WaC) have rapidly emerged. Corpus construction based on web data has become a focus in the field of corpus linguistics, and the concept of ‘Big Data’ is also challenging the theoretical construction. Our previous works aimed to fill the research gap of Chinese WaC, and to tackle some methodological issues raised by WaC. For example, how and to what extend can we use the evaluation metrics to assure the robustness of WaC as a good sample, in comparison with traditional manually controlled corpus? And what would be the proper relation between scaled corpus data and theoretical modeling, in the situation where heterogeneous linguistic data and meta-information can be extracted by ways of innovative web mining techniques, such as mashup programming, social network, GIS web service, etc. Following our previous works, this project aims to propose forward-looking solutions for the modularization of Chinese corpus data processing and (statistical) analysis. Four modules tailored for the scaled Chinese WaC will be explored and implemented with R, which includes: ‧ Exploratory Linguistic Data Analysis ‧ Multi-dimension annotation and Mashup ‧ Scaled corpus sustainability and evaluation: cloud and crowd computing approach ‧ Scaled corpus computation and semantic modeling We believe that this project will fill the gap of the current corpus linguistics in Taiwan, and with the accomplishment and freely release of these resources, we hope we will be pioneer again in the development of corpus linguistics.語料庫語言學&#8233&#8233網路語料庫混搭程式設計雲端計算&#8233&#8233R 與統計應用語言與網路應用Corpus linguisticsWeb as corpusMashupCloud computingRCrowd sourcing學術研究生涯發展計畫－桂冠型研究計畫【前瞻性現代漢語語料庫語言學：語料蒐集、處理與統計分析知識之模組化】