An efficient crawling algorithm for large-scale real-time social stream data collection based on popularity prediction
Date Issued
2015
Date
2015
Author(s)
Chou, Shih-En
Abstract
Social media has greatly changed the way we communicate and huge amount of social behavior data is thus recorded and accumulated simultaneously. The data is now widely applied to many emerging research issues in combination with social behavior analysis. More recently, time domain analysis is especially popular on conducting behavior change investigation, in which people take snapshots on a particular subject of network on regular intervals, and hot messages (posts) are in urgent need of snapshot so as to precisely learn about user’s behavior as time moves. Scraping social networking sites such as Twitter, Facebook, etc. is not an easy task for data acquisition departments of most institutions since these sites often have complex structures and also restrict the amount and frequency of the data that they let out to common crawlers. To get more snapshots, groups often consume more computation power and network resources; even increase the load of OSN (Online Social Network) sites. In addition, the current privacy control policies do not allow different groups to share data with one another. These become challenges for an individual research group to collect sufficient data by using existing crawling scheduling algorithms or collaborating with other partners. In this paper, we propose “Novel Crawling Ordering Algorithm”, which allows our crawlers to focus on popular content by collecting and analyzing user behaviors. The designed crawler can also solve the problems of large-scale vertical crawling and dynamic web page problems. The performance of our crawling ordering algorithm” is evaluated by some designed metrics. And the experimental results tell us that this algorithm can save up to 40% of requests by crawling top 99.5 % popular social stream.
Subjects
Social Network
Crawler Design
Information Retrieval
Behavior Analysis
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-104-R01525052-1.pdf
Size
23.54 KB
Format
Adobe PDF
Checksum
(MD5):6ca23f25de6c218c44e91aa1aa6737c9