電機資訊學院: 電機工程學研究所指導教授: 雷欽隆張立中Chang, Li-ChungLi-ChungChang2017-03-062018-07-062017-03-062018-07-062015http://ntur.lib.ntu.edu.tw//handle/246246/276639由於個人紀錄常常散布於不同的地方,進行資料合併時,我們必須找出描述同一人的紀錄,進而提升資料的完整性,並刪除重複的部分,雖然透過個人獨立的ID能夠輕易達成這件事情,但在大部分的情況下,我們無法確保兩筆資料同時擁有這個資訊,取得代之的是,我們必須透過多個屬性一起判斷,比如說姓名、性別及街道,若是這些屬性都吻合,我們就將兩筆紀錄視為描述同一人。此外,在現今資訊的時代,已經有諸多法律明文規範了個資保護的條款,尤其在醫學界,病患的隱私是相當受到重視的。因此,綜合上述的問題,「具隱私保護的資料比對」的研究議題近年來一直不斷地被研究與討論。 雖然已經有研究顯示利用「Bloom filter」的技術進行資料鏈結的效果,在正確性及效率方面都比其他方法來的好,但是透過頻率分析的攻擊,第三方卻能夠輕易地破解其編碼,故為了解決這個問題,RBF和CLK的演算法分別被發展了出來,他們各自都有其優缺點。在這篇論文中,我們將兩個方法做結合,擷取他們的長處,並針對缺點的部分進行改進,提出了一個創新的想法:WCLK,另外,我們也使用entropy來計算每個屬性的權重比率,透過給予每個屬性不同的權重,讓最後比對的正確率能夠有效的提升。最後,在此資料鏈結的方法中,使用者必須訂定一個「域值」來區分不同的相似值,由於在實際的狀況下,我們無法透過合併結果的準確度來計算此域值,因此,透過分群的概念,我們能夠準確地估算適當的域值。實驗的部分,我們利用Febel的工具產生資料來源,驗證我們的技術能夠有效地取代現有的方法,甚至達到更好的效果。Record linkage is the task of identifying records from multiple datasets that refer to the same individual. However, it is not an easy work because unique identities cannot be available most of the time so a set of attributes, such as extit{Forename, Gender} and extit{Street}, can be used in light of quasi-identifiers. In addition, as for privacy, various regulations and policies have been made to prohibit people from disclose of identifies, especially in the medical domain. Therefore, lots of methods of privacy-preserving record linkage (PPRL) have been developed to integrate datasets without revealing identifies associated with the records. A recent evaluation has shown that a transformation based on Bloom filter is superior to other approaches, but the encoding may be compromised through frequency-based cryptanalysis. Thus, two methods, RBF and CLK, have been proposed to solve this problem. However, both of them have their own strengths and weaknesses. In this dissertation, we merge these two methods and propose an advance one which we call WCLK. Besides, entropy is used to determine field weights. By giving different weighting to each field, we can improve the accuracy of the linkage results. Finally, without being able to access linkage quality and completeness in practice, threshold determination is a big challenge. Thus, we propose a clustering-based method to find a suitable threshold which can also lead to accurate results of record linkage. Using datasets generated by Febrl, we conduct several empirical experiments to show that our work can perform better than previous ones.3073574 bytesapplication/pdf論文公開時間: 2015/8/4論文使用權限: 同意有償授權(權利金給回饋學校)資料比對隱私保護布隆過濾器權重record linkageprivacy-preservingbloom filterfield weighting利用布隆過濾器建構具隱私保護資料比對之框架A Framework for Privacy-preserving Record Linkage Using Bloom Filterthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/276639/1/ntu-104-R02921035-1.pdf