Use Context Information to Improve the Performance of Latent Dirichlet Allocation
Date Issued
2014
Date
2014
Author(s)
Lin, Che-Yi
Abstract
Latent Dirichlet Allocation (LDA), is a wildly used topic model for discovering the topics in documents, however it suffers from many problems like lack of dependency between words and sparse data. The main cause of these problems is the word-sense disambiguation in the natural language. In previous works, they ignore the assumption of "bag of words" and add the dependency between each words. However, we use different approach. In order to solve these problems, we proposed a topic model called context LDA (CLDA) model.
The CLDA model first build up concept vectors with context information at each position and use these vectors to distinguish the equivalent relationships between word, then we present a topic model which can take these relationship as input and model the words into latent topics. The CLDA model can not only overcome the word disambiguation problem but also be easily parallelized and extended. With some extra knowledge and slight modification, we show that our model can solve the sparse data problem easily. We conduct several experiments based on 20Newsgroup dataset; in the results we show that our model can actually improve the performance of the original LDA and fix the imbalance topic problem via using the vectors and equivalent relationships. Finally we show the examples of latent topics produced by the LDA model and our model.
Subjects
主題模型
隱含狄利克雷分布
前後文
意義向量
機器學習
隱含主題
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-103-R01922027-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):cf10dd75f59504cbea235660c5da8a8c
