Focused Crawl of Web Archives to Build Event Collections

Klein, Martin; Balakireva, Lyudmila; Sompel, Herbert Van de

doi:10.1145/3201064.3201085

Cited by 15 publications

(9 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors of [14] evaluated the feasibility of utilizing topical crawlers for building event collections on the existing web archives to reduce the time of the crawling process. They reported good results mostly for events that happened in the past.…”

Section: Topical Crawling Methodsmentioning

confidence: 99%

Cost-Sensitive Topical Data Acquisition From the Web

Naghibi¹,

Anvari²,

Forghani³

et al. 2019

IJDKP

View full text Add to dashboard Cite

The cost of acquiring training data instances for induction of data mining models is one of the main concerns in real-world problems. The web is a comprehensive source for many types of data which can be used for data mining tasks. But the distributed and dynamic nature of web dictates the use of solutions which can handle these characteristics. In this paper, we introduce an automatic method for topical data acquisition from the web. We propose a new type of topical crawlers that use a hybrid link context extraction method for topical crawling to acquire on-topic web pages with minimum bandwidth usage and with the lowest cost. The new link context extraction method which is called Block Text Window (BTW), combines a text window method with a block-based method and overcomes challenges of each of these methods using the advantages of the other one. Experimental results show the predominance of BTW in comparison with state of the art automatic topical web data acquisition methods based on standard metrics.

show abstract

Section: Topical Crawling Methodsmentioning

confidence: 99%

Cost-Sensitive Topical Data Acquisition From the Web

Naghibi¹,

Anvari²,

Forghani³

et al. 2019

IJDKP

View full text Add to dashboard Cite

show abstract

“…The problem of knowing what to collect from the web has also been treated in the digital library research community as a focused crawling problem. In focused crawling the goal is to collect content about particular topics (Risse et al, 2012), events (Klein, Balakireva, & Van de Sompel, 2018;Yang, Chitturi, Wilson, Magdy, & Fox, 2012 ), or to collect content that has a particular characteristic such as popularity (Page, Brin, Motwani, & Winograd, 1999), importance Baeza-Yates, Marin, Castillo, & Rodriguez (2005)] or social engagement (Gossen, Demidova, & Risse, 2015 ;Milligan, Ruest, & Lin, 2016;Nwala, Weigle, & Nelson, 2018 ). Generally speaking these approaches take the focus to be a topic, event, person, organization that can be qualified by the types of media (documents, audio, video).…”

Section: Digital Librariesmentioning

confidence: 99%

Appraisal Practices in Web Archives

Summers¹

2019

Preprint

View full text Add to dashboard Cite

This paper explores the art and science of deciding what web archives collect by reviewing the literature of archival appraisal through the theoretical lens of Science and Technology Studies. I suggest that our anxieties around what web archives remember and forget, get embodied in dreams (and nightmares) of Big Data and The Cloud. These notions are best understood by attending to the specific material practices of people working with memory and machines. The disciplinary perspective of software studies can provide insight into how these material practices of appraisal operate in response to, and outside of traditional conceptions of the archive, and also as an instrument of governmentality.

show abstract

“…Most focused crawling is performed on the live Web. Unfortunately, the live web is plagued by link rot and content drift, consequently, Klein et al [19] demonstrated that focused crawling on the archived Web results in more relevant collections than focused crawling on the live Web, for events that occurred in the distant past. Additionally, Klein et al proposed extracting seeds from external references contained in the Wikipedia page of an event.…”

Section: Related Workmentioning

confidence: 99%

“…In some other cases, archived collections are initiated months or years after the precipitating event. This could have serious consequences since Web archive collections that start late could omit webpages that address the early stages of events [19,24]. Consequently, it is important to start collecting seeds for Web archive collections early.…”

Section: Introductionmentioning

confidence: 99%

Using Micro-Collections in Social Media to Generate Seeds for Web Archive Collections

Nwala

Weigle

Nelson

2019

2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

View full text Add to dashboard Cite

In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from credible and authoritative sources, but it is time consuming to collect these seeds. Two main strategies adopted by curators for discovering seeds include scraping Web (e.g., Google) Search Engine Result Pages (SERPs) and social media (e.g., Twitter) SERPs. In this work, we studied three social media platforms in order to provide insight on the characteristics of seeds generated from different sources. First, we developed a simple vocabulary for describing social media posts across different platforms. Second, we introduced a novel source for generating seeds from URIs in the threaded conversations of social media posts created by single or multiple users. Users on social media sites routinely create and share posts about news events consisting of hand-selected URIs of news stories, tweets, videos, etc. In this work, we call these posts micro-collections, and we consider them as an important source for seeds because the effort taken to create micro-collections is an indication of editorial activity, and a demonstration of domain expertise. Third, we generated 23,112 seed collections with text and hashtag queries from 449,347 social media posts from Reddit, Twitter, and Scoop.it. We collected in total 120,444 URIs from the conventional scraped SERP posts and micro-collections. We characterized the resultant seed collections across multiple dimensions including the distribution of URIs, precision, ages, diversity of webpages, etc. We showed that seeds generated by scraping SERPs had a higher median probability (0.63) of producing relevant URIs than micro-collections (0.5). However, micro-collections were more likely to produce seeds with a higher precision than conventional SERP collections for Twitter collections generated with hashtags. Also, micro-collections were more likely to produce older webpages and more non-HTML documents.

show abstract

Focused Crawl of Web Archives to Build Event Collections

Cited by 15 publications

References 16 publications

Cost-Sensitive Topical Data Acquisition From the Web

Cost-Sensitive Topical Data Acquisition From the Web

Appraisal Practices in Web Archives

Using Micro-Collections in Social Media to Generate Seeds for Web Archive Collections

Contact Info

Product

Resources

About