Study on Web Page Similarity Based on Layout Feature and Content Structure
Date Issued
2016
Date
2016
Author(s)
Lin, Chung-Yi
Abstract
In recent year, the mobile device has become the most common tool to access the Internet. In response to the flourish of the mobile device, Ethan Marcotte proposed a design guide which is called Responsive Web Design (RWD) [36] to make the mobile service user-friendly. According to Google statistics, 74 percent of mobile device users prefer user-friendly websites due to its readability on mobile devices. Google Search Guide [31] also reported that a user-friendly website can improve its search ranking and attract users. However, 85 percent of the websites are still user-unfriendly, such that these websites obtain the lower and lower search ranking. The major dilemma is that website rebuilding is usually constructed manually, which is time-consuming and inefficient. If a system can automatically select an appropriate website template, it will significantly improve the efficiency of website rebuilding. The critical problem of this system is how to select an appropriate template which has more features that its customer needs. The study [2] indicates that drastic changes of visual appearance of Web pages have a negative influence on readers. In order not to affect user-experience, this thesis proposes a system that will efficiently sort out the templates which are similar with respect to their layout features and visual appearance. Our experimental results indicate the effectiveness of our approach and show that our approach can find the similar templates precisely. In this thesis, we focus on discussing issues related to the similarity of the Web pages. Since a website consists of multiple Web pages, the proposed method can be extended to measure the similarities between “Websites”.
Subjects
DOM (Document Object Model) tree
Web page similarity
tree edit distance
Web page segmentation
visual information of web page layout
Type
thesis
File(s)
Loading...
Name
ntu-105-R03525086-1.pdf
Size
23.54 KB
Format
Adobe PDF
Checksum
(MD5):51554c182e158405287196ff6821eae1