2020-08-012024-05-17https://scholars.lib.ntu.edu.tw/handle/123456789/679065摘要:質性研究學者通常會從訪談內容或是文本中萃取出關鍵訊息,加以分析並探究一些 現象;在分析大量的文字資料時,藉由文字探勘的技術往往可以以較有效率的方式 獲取文字中隱含的訊息。文字探勘的技術,可以將非結構化的文字資料轉換成結構 化的數值資料,透過斷字斷詞、特徵選取、分群與分類等方式解讀文字中傳達的訊 息。本計畫預計執行三年,將著重於文字資料的分群與分類,提出以貝氏統計推論 為架構之分群與分類方法。文字資料經轉換成結構化資料後,將會產生高維度稀疏 的詞頻矩陣,在這樣的矩陣中,大部分的統計方法都無法直接用於分析,且這樣的 矩陣中除了可能帶有許多無意義的詞彙,在進行後續分析前需先被剔除之外,許多 詞彙間可能有存在相關性,因此,本計畫預計提出以漢明距離為基礎之貝氏分群法 ,用以找出詞彙間的關聯性,並利用後驗分佈進行詞彙群的篩選與推論。在文本的 分類問題上,一則文本往往會帶有數個標籤類別,不是單一標籤類別可以解釋的 ,而過去針對多標籤分類的研究,多是立基於機器學習的概念之下,所提出的演算 法較少具有統計意涵,因此,本計畫預計以統計的角度出發,將多標籤視為多變量 分佈,利用貝氏統計的架構,提出多標籤分類的方法並探討其統計性質。<br> Abstract: In qualitative study, researchers usually would like to extract key information from interviews, texts or documents to figure out phenomena. When analyzing large- scale text data, using text mining techniques, which transform unstructured text data into structured numeric data, to explore information is usually more efficient. The techniques include word segmentation, feature selection, cluster and classification. In this three-years proposal, we will focus on text clustering and document classification, and will propose a clustering method and a multi-label classification method based on Bayesian framework. After text data transform to structured numeric data, the term matrix will be high-dimensional sparse matrix. In such a matrix, most of the statistical methods cannot be directly used for analysis, some meaningless terms in the term matrix need to be removed before further analysis, and correlation among terms should be considered in further analysis. Hence, in this proposal, we will proposed a Hamming distance-based Bayesian clustering statistical method to figure out the relationship among terms and the term clusters. Then, use the posteriors of term clusters to select important terms and to make inference. In addition, a document often have multi- label rather than only one label. In the past, multi-label classification algorithms were usually proposed by researchers in the field of machine learning resulting in these algorithms without statistical sense. Thus, we will propose a Bayesian multi-label classification method based on assuming the label vectors with multivariate distribution and discuss its related statistical properties.貝氏分群方法貝氏推論貝氏多變量羅吉斯迴歸文件分類漢明距離多標籤分類文字探勘Bayesian clustering methodBayesian inferenceBayesian multivariate logistic regressiondocument classificationHamming distancemulti-label classificationtext miningBayesian Model for Information Integration, Selection and Classification Based on Large-Scale Text Data and Its Application on Qualitative Research