隱私保存的高效率資料分類方法

臺灣大學: 電機工程學研究所陳銘憲林耕霈Lin, Keng-PeiKeng-PeiLin2013-03-272018-07-062013-03-272018-07-062011http://ntur.lib.ntu.edu.tw//handle/246246/254126資料分類是被廣泛使用的資料探勘技術。資料分類從已標記的資料去學習出分類器，用以去預測無標記資料的可能標記。在資料分類演算法中，支持向量機器具有目前最佳的效能。資料隱私是使用資料探勘技巧的關鍵議題。在本論文中，我們研究如何在使用支持向量機器時達成對資料隱私的保護，並且探討如何有效率的產生支持向量機器分類器。在當前的雲端運算趨勢中，運算外包漸為流行。因為支持向量機器的訓練演算法牽涉到大量的運算，將運算外包到外部的服務提供者可幫助僅具備有限運算資源的資料所有人。因為資料可能內含敏感的資訊，資料隱私是運算外包中被嚴重關切的問題。除了資料本身，從資料所產生的分類器亦是資料所有人的私有資產。現有的支持向量機器的隱私保存技巧在安全性上較弱。在第二章中，我們提出了高安全性的具備隱私保存的支持向量機器外包方法。在所提出的方法中，資料經由隨機線性轉換所打亂，因此相較於現有的作品，有較強的安全性。服務提供者從打亂的資料去產生支持向量機器分類器，而且所產生的分類器亦是打亂的形式，服務提供者無法存取。在第三章，我們探討使用支持向量機器分類器所固有的違反隱私問題。支持向量機器訓練分類器的方法是藉由解決最佳化問題來決定訓練資料集中的那些資料個體做為支持向量。支持向量是提供必要資訊用以組成支持向量機器分類器的資料個體。因為支持向量是從訓練資料集中所取出的完整個體，釋出支持向量機器分類器供公眾或他人使用將會揭露支持向量的私密內容。我們提出一個對支持向量機器分類器作後處理的方法，用以轉換其至具有隱私保存的支持向量機器分類器，來避免揭露支持向量的私密內容。此方法精確的去近似高斯核心函數支持向量機器分類器的決策函數，而不去洩漏個別支持向量的內容。具備隱私保存的支持向量機器分類器可以在不違反個別資料隱私的情況下去釋出支持向量機器分類器的預測能力。支持向量機器的效率亦是個重要的議題。因為對於大規模的資料，支持向量機器的解法收斂得很慢。在第四章，我們利用在第三章所發展的核心近似技術，用以設計一個高效率的支持向量機器訓練演算法。雖然核心函數給支持向量機器帶來了強大的分類能力，但是在訓練過程亦導致了額外的運算成本。相對的，訓練線性支持向量機器有較快的解法。我們使用核心近似技術由明確的低維度特徵值的內積去計算核心值，以利用高效率的線性支持向量機器解法去訓練非線性支持向量機器。此法不僅是一個高效率的訓練方法，還能直接獲得具備隱私保存的支持向量機器分類器，亦即其分類器沒有揭露任何的資料個體。我們做了廣泛的實驗用以驗證所提出的方法。實驗結果顯示，具備隱私保存的支持向量機器外包方法、具備隱私保存的支持向量機器分類器，以及基於核心近似的高效率支持向量機器訓練方法，皆可達到相似於一般支持向量機器分類器的分類精確度，並且具備了隱私保存及高效率的特性。Data classification is a widely used data mining technique which learns classifiers from labeled data to predict the labels of unlabeled instances. Among data classification algorithms, the support vector machine (SVM) shows the state-of-the-art performance. Data privacy is a critical concern in applying the data mining techniques. In this dissertation, we study how to achieve privacy-preservation in utilizing the SVM as well as how to efficiently generate the SVM classifier. Outsourcing has become popular in current cloud computing trends. Since the training algorithm of the SVM involves intensive computations, outsourcing to external service providers can benefit the data owner who possesses only limited computing resources. In outsourcing, the data privacy is a critical concern since there may be sensitive information contained in the data. In addition to the data, the classifier generated from the data is also private to the data owner. Existing privacy-preserving SVM outsourcing technique is weak in security. In Chapter 2, we propose a secure privacy-preserving SVM outsourcing scheme. In the proposed scheme, the data are perturbed by random linear transformation which is stronger in security than existing works. The service provider generates the SVM classifier from the perturbed data where the classifier is also in perturbed form and cannot be accessed by the service provider. In Chapter 3, we study the inherent privacy violation problem in the SVM classifier. The SVM trains a classifier by solving an optimization problem to decide which instances of the training dataset are support vectors, which are the necessarily informative instances to form the SVM classifier. Since support vectors are intact tuples taken from the training dataset, releasing the SVM classifier for public use or other parties will disclose the private content of support vectors. We propose an approach to post-process the SVM classifier to transform it to a privacy-preserving SVM classifier which does not disclose the private content of support vectors. It precisely approximates the decision function of the Gaussian kernel SVM classifier without exposing the individual content of support vectors. The privacy-preserving SVM classifier is able to release the prediction ability of the SVM classifier without violating the individual data privacy. The efficiency of the SVM is also an important issue since for large-scale data, the SVM solver converges slowly. In Chapter 4, we design an efficient SVM training algorithm based on the kernel approximation technique developed in Chapter 3. The kernel function brings powerful classification ability to the SVM, but it incurs additional computational cost in the training process. In contrast, there exist faster solvers to train the linear SVM. We capitalize the kernel approximation technique to compute the kernel evaluation by the dot product of explicit low-dimensional features to leverage the efficient linear SVM solver for training a nonlinear kernel SVM. In addition to an efficient training scheme, it obtains a privacy-preserving SVM classifier directly, i.e., its classifier does not disclose any individual instance. We conduct extensive experiments over our studies. Experimental results show that the privacy-preserving SVM outsourcing scheme, the privacy-preserving SVM classifier, and the efficient SVM training scheme based on kernel approximation achieve similar classification accuracy to a normal SVM classifier while obtains the properties of privacy-preservation and efficiency respectively.5569230 bytesapplication/pdfen-US資料探勘分類支持向量機器核心函數隱私保存外包Data miningclassificationsupport vector machineskernel functionprivacy-preservingoutsourcing隱私保存的高效率資料分類方法Efficient Data Classification with Privacy-Preservationthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/254126/1/ntu-100-F94921026-1.pdf