Extracting Event-Centric Document Collections from Large-Scale Web Archives

Gossen, Gerhard; Demidova, Elena; Risse, Thomas

doi:10.1007/978-3-319-67008-9_10

Cited by 13 publications

(34 citation statements)

References 22 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our previous work we developed methods to infer missing categorical information in noisy and sparse Web markup data [31], increasing usefulness of this data for event-centric applications. Furthermore, we consider event-centric focused crawls from the Web [7] and Web archives [8], [9], as well as Twitter data regarding events and traffic. Another source of event-centric information is the recently proposed EventKG knowledge graph [10,11].…”

Section: Integration Of Web-based Mobility Datamentioning

confidence: 99%

Data4UrbanMobility: Towards Holistic Data Analytics for Mobility Applications in Urban Regions

Tempelmeier

Rietz²,

Lishchuk

et al. 2019

Companion Proceedings of the 2019 World Wide Web Conference

Self Cite

View full text Add to dashboard Cite

With the increasing availability of mobility-related data, such as GPS-traces, Web queries and climate conditions, there is a growing demand to utilize this data to better understand and support urban mobility needs. However, data available from the individual actors, such as providers of information, navigation and transportation systems, is mostly restricted to isolated mobility modes, whereas holistic data analytics over integrated data sources is not sufficiently supported. In this paper we present our ongoing research in the context of holistic data analytics to support urban mobility applications in the Data4UrbanMobility (D4UM) project. First, we discuss challenges in urban mobility analytics and present the D4UM platform we are currently developing to facilitate holistic urban data analytics over integrated heterogeneous data sources along with the available data sources. Second, we present the MiC app -a tool we developed to complement available datasets with intermodal mobility data (i.e. data about journeys that involve more than one mode of mobility) using a citizen science approach. Finally, we present selected use cases and discuss our future work. 1

show abstract

Section: Integration Of Web-based Mobility Datamentioning

confidence: 99%

Data4UrbanMobility: Towards Holistic Data Analytics for Mobility Applications in Urban Regions

Tempelmeier

Rietz²,

Lishchuk

et al. 2019

Companion Proceedings of the 2019 World Wide Web Conference

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, in this work, we leverage the judgment of humans on social media. In a similar work, Gossen et al [13] adapted some portions of the topic and event focused sub-collection in a method to extract event-centric documents from Web archives based on a specialized focused extraction algorithm. They defined two broad kinds of events based on time: planned and unexpected.…”

Section: Acronym Post Count Author Countmentioning

confidence: 99%

“…Therefore, the choice of queries was not arbitrary. Instead, we developed a temporal classification system (partly informed by Gossen et al [13]) of real world stories and events based on three temporal ( For the expectation attribute, an event may be expected or unexpected. For example, the Ebola outbreak event was unexpected.…”

Section: Topic Selectionmentioning

confidence: 99%

Using Micro-Collections in Social Media to Generate Seeds for Web Archive Collections

Nwala

Weigle

Nelson

2019

2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

View full text Add to dashboard Cite

In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events ranging from elections to disease outbreaks. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from credible and authoritative sources, but it is time consuming to collect these seeds. Two main strategies adopted by curators for discovering seeds include scraping Web (e.g., Google) Search Engine Result Pages (SERPs) and social media (e.g., Twitter) SERPs. In this work, we studied three social media platforms in order to provide insight on the characteristics of seeds generated from different sources. First, we developed a simple vocabulary for describing social media posts across different platforms. Second, we introduced a novel source for generating seeds from URIs in the threaded conversations of social media posts created by single or multiple users. Users on social media sites routinely create and share posts about news events consisting of hand-selected URIs of news stories, tweets, videos, etc. In this work, we call these posts micro-collections, and we consider them as an important source for seeds because the effort taken to create micro-collections is an indication of editorial activity, and a demonstration of domain expertise. Third, we generated 23,112 seed collections with text and hashtag queries from 449,347 social media posts from Reddit, Twitter, and Scoop.it. We collected in total 120,444 URIs from the conventional scraped SERP posts and micro-collections. We characterized the resultant seed collections across multiple dimensions including the distribution of URIs, precision, ages, diversity of webpages, etc. We showed that seeds generated by scraping SERPs had a higher median probability (0.63) of producing relevant URIs than micro-collections (0.5). However, micro-collections were more likely to produce seeds with a higher precision than conventional SERP collections for Twitter collections generated with hashtags. Also, micro-collections were more likely to produce older webpages and more non-HTML documents.

show abstract

“…Previous work by Gossen et al [9] inspired this work. They developed a focused extraction (not web crawling) system to create event-centric collections from a large static archival collection stored on a server under their control.…”

Section: Related Workmentioning

confidence: 99%

“…Moreover, to the best of our knowledge, focused crawling across web archives has never been attempted. Inspired by previous work by Gossen et al [9], in this paper, we present a framework to build eventspecific collections by focused crawling of web archives. We utilize the Memento protocol [20] and the associated crossweb-archive infrastructure [3] to crawl Mementos in 22 web archives.…”

Section: Introductionmentioning

confidence: 99%

Focused Crawl of Web Archives to Build Event Collections

Klein

Balakireva

Sompel

2018

Proceedings of the 10th ACM Conference on Web Science

View full text Add to dashboard Cite

Event collections are frequently built by crawling the live web on the basis of seed URIs nominated by human experts. Focused web crawling is a technique where the crawler is guided by reference content pertaining to the event. Given the dynamic nature of the web and the pace with which topics evolve, the timing of the crawl is a concern for both approaches. We investigate the feasibility of performing focused crawls on the archived web. By utilizing the Memento infrastructure, we obtain resources from 22 web archives that contribute to building event collections. We create collections on four events and compare the relevance of their resources to collections built from crawling the live web as well as from a manually curated collection. Our results show that focused crawling on the archived web can be done and indeed results in highly relevant collections, especially for events that happened further in the past.

show abstract

Extracting Event-Centric Document Collections from Large-Scale Web Archives

Cited by 13 publications

References 22 publications

Data4UrbanMobility: Towards Holistic Data Analytics for Mobility Applications in Urban Regions

Data4UrbanMobility: Towards Holistic Data Analytics for Mobility Applications in Urban Regions

Using Micro-Collections in Social Media to Generate Seeds for Web Archive Collections

Focused Crawl of Web Archives to Build Event Collections

Contact Info

Product

Resources

About