以前後文增進主題模型之效能

林哲毅; Lin, Che-Yi

標題:	以前後文增進主題模型之效能 Use Context Information to Improve the Performance of Latent Dirichlet Allocation
作者:	林哲毅 Lin, Che-Yi
關鍵字:	主題模型;隱含狄利克雷分布;前後文;意義向量;機器學習;隱含主題
公開日期:	2014
摘要:	隱含狄利克雷分布模型是一種常被用來尋找文件中隱含主題的主題模型，在一些情況之下，例如︰文件數目不足的文本或者是須要前後文才能判斷意思的字，傳統的隱含狄利克雷分布模型會得到比較差的結果。造成這個問題的主因是因為一個字可以包含有多種意思，不看前後文的情況之下，很難分辨一個字真正的意思。在一些之前的研究中，他們打破原本隱含狄利克雷分布模型中對於字是彼此獨立的假設並嘗試著把字與字之間的關係加進他們所提的主題模型。在這篇研究中，我們提出了一個新的模型－包含前後文隱含狄利克雷分布模型。首先，我們的模型會將原本的文本轉成一堆帶有前後文資訊的「意義向量」，並找出這些向量彼此之間的等價關係；接著我們主題模型會在這些向量與他們之間的等價關係中找出原本文件隱含主題。包含前後文隱含狄利克雷分布模型不僅可以解決傳統的隱含狄利克雷分布模型所遇到的問題，還可以簡單的被平行及擴充。就算文本的數目不足，只要額外給予一些字與字之間的關係包含前後文隱含狄利克雷分布模型仍然可以得到不錯的結果。我們在20Newsgroup 這個文本上做了許多不同的實驗來驗證我們模型的效能，這些實驗數據顯示了我們模型的效能的確比隱原本的狄利克雷分布模型要來的好。最後，我們也列出了兩個模型在同一個文本中找出來的隱含主題。 Latent Dirichlet Allocation (LDA), is a wildly used topic model for discovering the topics in documents, however it suffers from many problems like lack of dependency between words and sparse data. The main cause of these problems is the word-sense disambiguation in the natural language. In previous works, they ignore the assumption of "bag of words" and add the dependency between each words. However, we use different approach. In order to solve these problems, we proposed a topic model called context LDA (CLDA) model. The CLDA model first build up concept vectors with context information at each position and use these vectors to distinguish the equivalent relationships between word, then we present a topic model which can take these relationship as input and model the words into latent topics. The CLDA model can not only overcome the word disambiguation problem but also be easily parallelized and extended. With some extra knowledge and slight modification, we show that our model can solve the sparse data problem easily. We conduct several experiments based on 20Newsgroup dataset; in the results we show that our model can actually improve the performance of the original LDA and fix the imbalance topic problem via using the vectors and equivalent relationships. Finally we show the examples of latent topics produced by the LDA model and our model.
URI:	http://ntur.lib.ntu.edu.tw//handle/246246/261418
Rights:	論文公開時間：2019/08/08 論文使用權限：同意有償授權(權利金給回饋學校)
顯示於：	資訊工程學系

文件中的檔案：

檔案	描述	大小	格式
ntu-103-R01922027-1.pdf		23.32 kB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

Page view(s)

checked on 2024/4/27

下載

checked on 2024/4/27

Google Scholar^TM

檢查

TAIR相關文章

文件中的檔案：

Page view(s)

下載

Google ScholarTM

Google Scholar^TM