對話過程廣告標的推薦之研究

陳信希臺灣大學：資訊工程學研究所黃宏吉Huang, Hung-ChiHung-ChiHuang2007-11-262018-07-052007-11-262018-07-052007http://ntur.lib.ntu.edu.tw//handle/246246/54075即時通訊是目前非常熱門的網際網路應用系統。使用者以自然語言或各種符號輸入系統中進行對話，對話中系統會隨機出現與對話無關之廣告連結，本論文主要目的在建立一個對話分析系統，使出現之廣告連結能與對話者對話內容有高度相關性，由此增加使用者對廣告的興趣並點入廣告。本論文使用雅虎線上目錄系統做為資料比對來源，並將每個對話分類成雅虎目錄的十四種類別之一，如運動休閒、藝術、科學等。系統以對話中每個單詞在雅虎目錄文件中出現的頻率做為權重來源，依不同的模型，也會將對話中的單詞在對話中或各類別中之出現頻率列入考慮，形成類似傳統TFIDF的方法。將單詞取出後，系統會依不同的模型參數設計，考慮其性質，如動詞、名詞，單詞長度等再決定是否進行權重計算；另外系統也使用上位詞、下位詞及同義詞來進行對話中單詞權重之計算。本論文亦對下載後之雅虎線上目錄進行擴增運算以產生不同的資料來源，擴增的依據是原目錄結構中所附加的節點說明檔，其內含相關網頁、標題、及簡要說明，我們相信這些資訊對計算權重相當重要，而事實也證明如此，為方便起見，我們將這些資料來源都稱之為語料庫。本論文提出的最佳模型及參數中，使用名詞及擴增語料庫的效能可以達到90%的Ｆ值，亦即在一百個對話中，此模型能正確判定其中九十個的對話內容屬於那種類別，並由此來取出相關類別的廣告。本論文亦提出一種特定的統計量，稱為猜中速度，即在對話中第幾個回合能正確猜中對話的類別，目前的結果，我們有信心如果使用最佳模型，當對話進行一半時便能正確猜出其廣告類別，並送出有意義的廣告連結。我們也發展出一個決策樹，用來判定一個單詞是否為新單詞，並能有效取出其定義，另外也能再加以分類出譯音地名及人名。最後我們總結實驗結果，解釋如何實現一個完整的以對話分析做為廣告推薦之即時通訊系統，並提出一些相關議題及應用，以供未來研究之用。Instant messaging applications are the most popular applications on Internet. Users can communicate with each other by inputting texts or symbols in natural languages. While the conversation is in progress, some irrelevant advertisement links would appear randomly. Our target is to establish a dialog analysis system in which meaningful advertisement links highly relevant to the dialog contents can therefore be proposed, and thereafter the click rate of the ad links can be increased. The proposed model uses Yahoo! Directory tree as the data-comparison source, and classifies each dialog into one of the 14 categories of Yahoo! Directory, such as Recreation & Sports, Art, Science, etc. The system will calculate the weight by terms from the dialogs according to their document frequency in Yahoo! Directory tree. Also a TFIDF-similar is considered and evaluated by computing the term frequency in dialogs and each category. For bettering the data resource, we develop an expansion algorithm to expand the original Yahoo! Directory tree with its accompanying HTML files, in which some related web pages with titles, links, and snippets are saved. The experiment results show that expansion is meaningful with better performance. For convenience, we call the data sources as corpora. In the best setting of system parameters in the model, we conclude using Noun and Expansion Corpus can get the best result, which brings a 90% of F-value. This can give us confidence that we can correctly guess the commercial intentions of 90 dialogs from a given set of 100 dialogs. Besides, a special statistic, hit speed, is proposed to evaluate when our system can correctly retrieve the correct commercial category and provide relevant ad links. So far we are confident to do so in the middle round of a given conversation. We also define a decision tree which can decide new terms from dialogs and retrieve its definitions. After some refinement, we can get interesting geographical transliteration terms and people names. Finally we provide some detailed results and conclude our models to implement an effective commercial recommendation　system on IM applications, and discuss some interesting topics for future research.口試委員會審定書 i 誌謝 ii 中文摘要 iii Abstract iv Chapter 1 Introduction 1 Chapter 2 Model Design and Implementation 5 2.1 Corpus 7 2.2 Corpus Expansion 10 2.3 Scoring Algorithm 14 2.4 Language Features 18 2.5 Sponsor Search 20 2.6 Dictionary Expansion 21 2.6.1 Transliteration Search 24 2.6.2 People Search 25 2.7 Model Summary 26 Chapter 3 Experimental Results 27 3.1 Annotated Category Distribution 27 3.2 Performance Evaluation 28 3.2.1 Language Features 29 3.2.2 TFIDF Evaluation 31 3.2.3 Hypernym, Hyponym, and Synonym 34 3.2.4 Removal of Redundant Category 36 3.2.5 Threshold 36 3.2.6 Original VS Expanded Corpus 37 3.2.7 Hit Speed 40 3.3 Overall Evaluation 45 3.4 Discussion 50 Chapter 4 Conclusion and Future Work 53 Reference 56 Appendix 57 І Details of the 30 Selected Dialogs 57 II Sample of MSN Messenger XML Log File 63 Ш Sample of Yahoo! Directory Tree HTML File 64 IV Selected Words from the New Dictionary 66 List of Figures Fig. 1: Sponsor links in instant messaging applications…………..…………………...2 Fig. 2: System architecture 5 Fig. 3: Yahoo! Directory tree (partial) 8 Fig. 4: Sample HTML in Yahoo directory tree 10 Fig. 5: Tree expansion results in all categories 12 Fig. 6: Nodes containing the word ‘care’ in the corpus 16 Fig. 7: Two-way decision tree for generating new dictionary 22 Fig. 8: Annotated category distribution 28 Fig. 9: Precision results at different word length level in YahooO 30 Fig. 10: Recall results at different word length level in YahooO 30 Fig. 11: Precision results at different word length level in YahooO with non-stopword and noun only 31 Fig. 12: Recall results at different word length level in YahooO with non-stopword and noun only 31 Fig. 13: Term frequency of all words in sample dialogs 31 Fig. 14: Recall in TFIDF 33 Fig. 15: Precision in TFIDF. 33 Fig. 16: Recall in hypernym, hyponym, and synonym in DF mode. 35 Fig. 17: Precision in hypernym, hyponym, and synonym in DF mode. 35 Fig. 18: Recall in hypernym, hyponym, and synonym in TFIDF mode. 35 Fig. 19: Precision in hypernym, hyponym, and synonym in TFIDF mode. 35 Fig. 20: Threshold with word length 3 37 Fig. 21: Threshold with word length 3 and all features 37 Fig. 22: Precision results at different word length level in YahooS 38 Fig. 23: Recall results at different word length level in YahooS 38 Fig. 24: All tree precision 39 Fig. 25: All tree precision with noun only 39 Fig. 26: All tree recall 39 Fig. 27: All tree recall with noun only 39 Fig. 28: All tree recall in TFIDF 40 Fig. 29: All tree precision in TFIDF 40 Fig. 30: AvgHitSpeed-correct in YahooO for baseline and noun 44 Fig. 31: AvgHitSpeed-all YahooO for baseline and noun 44 Fig. 32: AvgHitSpeed-correct in YahooS for baseline and noun 44 Fig. 33: AvgHitSpeed-all YahooS for baseline and noun 44 Fig. 34: Precision results at different hit speed threshold 45 Fig. 35: Recall results at different hit speed threshold 45 Fig. 36: AvgHitSpeed-correct in all trees with noun 46 Fig. 37: AvgHitSpeed-all in all trees with noun 467301720 bytesapplication/pdfen-US即時通訊對話分析廣告意圖廣告推薦語料擴增線上目錄Instant messagingdialog analysiscommercial intentionadvertisement suggestioncorpus expansionon-line directory對話過程廣告標的推薦之研究Recommendation on Commercial Intention in Dialogsthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/54075/1/ntu-96-P93922003-1.pdf