Relationship of Jaccard and edit distance in malware clustering and online identification (Extended abstract)

Dolev, S.; Ghanayim, M.; Binun, A.; Frenkel, S.; YEALI SUN; Dolev, S.;Ghanayim, M.;Binun, A.;Frenkel, S.;Sun, Y.S.

doi:10.1109/NCA.2017.8171380

Relationship of Jaccard and edit distance in malware clustering and online identification (Extended abstract)

Journal

2017 IEEE 16th International Symposium on Network Computing and Applications, NCA 2017

Journal Volume

2017-January

Pages

1-5

Date Issued

2017

Author(s)

Dolev, S.

Ghanayim, M.

Binun, A.

Frenkel, S.

YEALI SUN

DOI

10.1109/NCA.2017.8171380

URI

https://scholars.lib.ntu.edu.tw/handle/123456789/456051

URL

https://www.scopus.com/inward/record.uri?eid=2-s2.0-85046549864&doi=10.1109%2fNCA.2017.8171380&partnerID=40&md5=c12c42b2da487df48db166a0119f7568

Abstract

In this paper, we examine the possibility to utilize the well-known approximations of Jaccard metric in order to reduce computational complexity of Edit Distance metric estimation. The scope of our analytical results is the representing strings rather than the original (raw) textual data, still in practice we obtained a solid indication that the results can be applied to (raw) strings that have low n-gram repetitions. We formulate inequalities between the Jaccard metric and the Edit Distance, that impose upper and lower bounds on the Edit Distance values in terms of the Jaccard values. We validate our inequality over strings of API call traces where (the small) clusters obtained are refined by applying Edit Distance. Jaccard is a measure of similarity between two sets, while Edit Distance is a measure for two strings, such as traces of API calls. The computation associated with creating n-grams and using Jaccard similarity is much more efficient than the computation of Edit Distance (linear versus quadratic time complexity). Thus, our new bounds on the Edit Distance given the Jaccard value are of practical interest. Another new aspect we coped with in our research is the inherent imbalance between malicious and benign API traces that are harvested from the system, as most of the traces are benign. We performed clustering only on the malware traces where each cluster concentrates malware with some specific common essence. The obtained clustering is used with great success in classifying new query traces for being either benign or malware. The traces for our research were obtained from the KVM hypervisor Runtime Execution Introspection and Profiling (REIP) system based on Virtual Machine Introspection (VMI) techniques to profile hooked Windows API calls. © 2017 IEEE.

SDGs

[SDGs]SDG16

Other Subjects

Complex networks; Computer crime; Analytical results; Extended abstracts; Measure of similarities; On-line identification; Quadratic time; Run-time execution; Upper and lower bounds; Virtual machine introspection; Malware

Type

conference paper

Relationship of Jaccard and edit distance in malware clustering and online identification (Extended abstract)

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)