Profile-Based Focused Crawler for Social Media-Sharing Websites

Zhang, Zhiyong; Nasraoui, Olfa

doi:10.1109/ictai.2008.119

Cited by 9 publications

(11 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Web 2.0. The advent of the user-generated content philosophy and the participatory culture that was brought by Web 2.0 sites such as blogs, forums and social media, formed a new generation of specialized crawlers that focused on forum [29][30][31][32][33][34], blog/microblog [35,36], and social media [37][38][39][40] spidering. The need for specialized crawlers for these websites emerged from the quality and creation rate of content usually found in forums/blogs, the well-defined structure that is inherent in forums/blogs that makes it possible to even develop frameworks for creating blog crawlers [41], and the implementation particularities that make other types of crawlers inappropriate or inefficient for the task.…”

Section: Usage Typologymentioning

confidence: 99%

inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat Intelligence

et al. 2021

View full text Add to dashboard Cite

In today’s world, technology has become deep-rooted and more accessible than ever over a plethora of different devices and platforms, ranging from company servers and commodity PCs to mobile phones and wearables, interconnecting a wide range of stakeholders such as households, organizations and critical infrastructures. The sheer volume and variety of the different operating systems, the device particularities, the various usage domains and the accessibility-ready nature of the platforms creates a vast and complex threat landscape that is difficult to contain. Staying on top of these evolving cyber-threats has become an increasingly difficult task that presently relies heavily on collecting and utilising cyber-threat intelligence before an attack (or at least shortly after, to minimize the damage) and entails the collection, analysis, leveraging and sharing of huge volumes of data. In this work, we put forward inTIME, a machine learning-based integrated framework that provides an holistic view in the cyber-threat intelligence process and allows security analysts to easily identify, collect, analyse, extract, integrate, and share cyber-threat intelligence from a wide variety of online sources including clear/deep/dark web sites, forums and marketplaces, popular social networks, trusted structured sources (e.g., known security databases), or other datastore types (e.g., pastebins). inTIME is a zero-administration, open-source, integrated framework that enables security analysts and security stakeholders to (i) easily deploy a wide variety of data acquisition services (such as focused web crawlers, site scrapers, domain downloaders, social media monitors), (ii) automatically rank the collected content according to its potential to contain useful intelligence, (iii) identify and extract cyber-threat intelligence and security artifacts via automated natural language understanding processes, (iv) leverage the identified intelligence to actionable items by semi-automatic entity disambiguation, linkage and correlation, and (v) manage, share or collaborate on the stored intelligence via open standards and intuitive tools. To the best of our knowledge, this is the first solution in the literature to provide an end-to-end cyber-threat intelligence management platform that is able to support the complete threat lifecycle via an integrated, simple-to-use, yet extensible framework.

show abstract

Section: Usage Typologymentioning

confidence: 99%

inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat Intelligence

et al. 2021

View full text Add to dashboard Cite

show abstract

“…To this end, we proposed a DOM path string-based method for page classification that was reported elsewhere [12]. This paper is organized as follows.…”

Section: Introductionmentioning

confidence: 99%

Exploiting Tags and Social Profiles to Improve Focused Crawling

Zhang

Nasraoui

Zwol³

2009

2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology

Self Cite

View full text Add to dashboard Cite

Recent years have transformed the Web from a Web of content to a Web of applications and social content. Thus, it has become crucial to be able to tap on this social aspect of the Web whenever possible, in addition to its content, particularly for focused crawling. In this paper, we present a novel profile-based focused crawling system for dealing with the increasingly popular social media-sharing web sites without assuming any privileged access to the internal private databases of such websites, nor any requirement for the existence of APIs for the extraction of social data. Our experiments prove the robustness of our profile-based focused crawler, as well as a significant improvement in harvest ratio, compared to breadthfirst and OPIC crawlers, when crawling the flickr web site for two different topics.

show abstract

“…Knowing this structure enables a crawler to prioritize some types of pages (e.g., recent user-generated content) over others, or to spread its e ort evenly to obtain a representative sample [4,6,15], or to avoid downloading the same page via di erent URLs [26]. Information about how the site is organized is provided manually, recognized by heuristics, or learned by recognizing consistent pa erns in the site [4,13,15,25,28].…”

Section: Introductionmentioning

confidence: 99%

A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

Xu,

Gao,

Callan

2018

Preprint

View full text Add to dashboard Cite

Existing techniques for e ciently crawling social media sites rely on URL pa erns, query logs, and human supervision. is paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efciently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the di erent types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling e ciently, and does a be er job of staying focused on user-created contents than baseline methods.

show abstract

Profile-Based Focused Crawler for Social Media-Sharing Websites

Cited by 9 publications

References 11 publications

inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat Intelligence

inTIME: A Machine Learning-Based Framework for Gathering and Leveraging Web Data to Cyber-Threat Intelligence

Exploiting Tags and Social Profiles to Improve Focused Crawling

A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites

Contact Info

Product

Resources

About