Web archive profiling through CDX summarization

Alam, Sawood; Nelson, Michael L.; Sompel, Herbert De; Balakireva, Lyudmila; Shankar, Harihar; Rosenthal, David S.

doi:10.1007/s00799-016-0184-4

Cited by 19 publications

(24 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In previous work [8,9], we explored the middle ground where archive profiles are neither as minimal as storing just the TLD (which results in many false positives) nor as detailed as collecting every URI-R present in every archive (which goes stale very quickly and is difficult to maintain). We first defined various profiling policies, summarized CDX files according to those policies, evaluated associated costs and benefits, and prepared gold standard datasets [8,9]. In our experiments, we correctly identified about 78% of the URIs that were or were not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and identified 94% URIs with less than 10% relative cost without any false negatives.…”

Section: Related Workmentioning

confidence: 99%

“…We were able to make routing decisions of 80% of the requests correctly while maintaining about 90% Recall by discovering only 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile. MementoMap is a continuation of this effort to make it more flexible and portable 6 https://groups.google.com/forum/#!topic/memento-dev/YE4rt6L5ICg by eliminating the need for rigid profiling policies we defined earlier [8,9] (which are still good for baseline evaluation purposes) and replacing them with an adaptive approach in which the level of detail is dynamically controlled with a number of parameters.…”

Section: Related Workmentioning

confidence: 99%

“…Table 4 and Figure 5 summarize the distribution of URI-Ms over URI-Rs in Arquivo.pt. Almost 2M unique URI-Rs in Arquivo.pt have an average of 2.46 mementos per URI-R (γ value [9]), but this distribution is not uniform. The top 30% URI-Rs account for 70% of the mementos, for a Gini Coefficient of 0.42 [41].…”

Section: Archived Vs Accessed Resourcesmentioning

confidence: 99%

See 2 more Smart Citations

MementoMap Framework for Flexible and Adaptive Web Archive Profiling

Alam

Weigle

Nelson

et al. 2019

2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

Self Cite

View full text Add to dashboard Cite

In this work we propose MementoMap, a flexible and adaptive framework to efficiently summarize holdings of a web archive. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the Arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail from its HTML pages that have an HTTP status code of 200 OK. Additionally, we designed a single-pass, memoryefficient, and parallelization-friendly algorithm to compact a large MementoMap into a small one and an in-file binary search method for efficient lookup. We analyzed more than three years of Mem-Gator (a Memento aggregator) logs to understand the response behavior of 14 public web archives. We evaluated MementoMaps by measuring their Accuracy using 3.3M unique URIs from MemGator logs. We found that a MementoMap of less than 1.5% Relative Cost (as compared to the comprehensive listing of all the unique original URIs) can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive while maintaining 100% Recall (i.e., zero false negatives).

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

MementoMap Framework for Flexible and Adaptive Web Archive Profiling

Alam

Weigle

Nelson

et al. 2019

2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

Self Cite

View full text Add to dashboard Cite

show abstract

“…This prevents simply referencing all aggregated archives' CDX files for a URI to determine the non-redirecting count of mementos. In this work, we utilize the aggregated holdings of multiple Web archives as well as the CDXJ format [2], an extension of CDX. MemGator's CDXJ generation is derived from the archives' Memento endpoints, specifically their Link formatted TimeMaps, and transformed into the CDXJ format that allows quicker, more reliable parsing of the datetime that the included URI-Ms represent.…”

Section: Archival Indexingmentioning

confidence: 99%

“…We leveraged MemGator's CDXJ [2] interface (example output in Figure 3) for simple datetime extraction, structured JSON-formatted metadata of each memento's attributes, and more human readable output compared to the conventional Link ( Figure 1) or JSON formatted TimeMaps. Collection was run on a late 2013 MacBook Pro running OS X version 10.11.4 with a 2.4 GHz Intel i5 processor, 8 GB of RAM, and a 250 GB SSD disk.…”

Section: Data Collectionmentioning

confidence: 99%

Impact of URI Canonicalization on Memento Count

Kelly¹,

Alkwai²,

Nelson³

et al. 2017

2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

Self Cite

View full text Add to dashboard Cite

Quantifying the captures of a URI over time is useful for researchers to identify the extent to which a Web page has been archived. Memento TimeMaps provide a format to list mementos (URI-Ms) for captures along with brief metadata, like Memento-Datetime, for each URI-M. However, when some URI-Ms are dereferenced, they simply provide a redirect to a different URI-M (instead of a unique representation at the datetime), often also present in the TimeMap. This infers that confidently obtaining an accurate count quantifying the number of non-forwarding captures for a URI-R is not possible using a TimeMap alone and that the magnitude of a TimeMap is not equivalent to the number of representations it identifies. In this work we discuss this particular phenomena in depth. We also perform a breakdown of the dynamics of counting mementos for a particular URI-R (google.com) and quantify the prevalence of the various canonicalization patterns that exacerbate attempts at counting using only a TimeMap. For google.com we found that 84.9% of the URI-Ms result in an HTTP redirect when dereferenced. We expand on and apply this metric to TimeMaps for seven other URI-Rs of large Web sites and thirteen academic institutions. Using a ratio metric DI for the number of URI-Ms without redirects to those requiring a redirect when dereferenced, five of the eight large web sites' and two of the thirteen academic institutions' TimeMaps had a ratio of ratio less than one, indicating that more than half of the URI-Ms in these TimeMaps result in redirects when dereferenced.

show abstract

The Past Web

2021

View full text Add to dashboard Cite

Table Of Content 0.1 Dedication 0.2 Foreword 0.3 Preface Part 1 The era of information abundance and memory scarcity Chapter 1.0 Part introduction Chapter 1.1 The problem of web ephemera Chapter 1.2 Web archives preserve our digital collective memory Part 2 Collecting before it vanishes Chapter 2.0 Part introduction Chapter 2.1 National web archiving in Australia -representing the comprehensive Chapter 2.2 Web Archiving Singapore: The Realities of National Web Archiving Chapter 2.3 Archiving the social media -the Twitter case Chapter 2.4 Creating Event-Centric Collections from Web ArchivesPart 3 Access methods to analyse the Past web Chapter 3.0 Part introduction Chapter 3.1 Full-text and URL search Chapter 3.2 A holistic view on Web archives Chapter 3.3 Interoperability for Accessing Versions of Web Resources with the Memento Protocol Chapter 3.4 Linking Twitter archives with TV archives Chapter 3.5 Image analytics in web archives Part 4 Researching the past Web Chapter 4.0 Part introduction Chapter 4.1 Digital archaeology in the web of links: reconstructing a late-90s web sphere Chapter 4.2 Quantitative approaches to the Danish web archive Chapter 4.3 Critical Web Archive Research Chapter 4.4 Exploring Online Diasporas: London's French and Latin American Communities in the UK Web Archive Chapter 4.5 Platform and app histories: Assessing source availability in web archives and app repositories Part 5 Web archives as infrastructures to develop innovative tools Chapter 5.0 Part introduction Chapter 5.1 The need for infrastructures for the study of web-archived material Chapter 5.2 Automatic generation of timelines for past events Chapter 5.3 Political opinions of the past Web Chapter 5.4 Framing web archives with browsers contemporary to a website's creation Chapter 5.5 Big Data analytics over past web data Part 6 The Past Web: a look into the future.This book is dedicated to Vitalino Gomes who taught me that the value of a Man is in his Integrity.

show abstract

Web archive profiling through CDX summarization

Cited by 19 publications

References 15 publications

MementoMap Framework for Flexible and Adaptive Web Archive Profiling

MementoMap Framework for Flexible and Adaptive Web Archive Profiling

Impact of URI Canonicalization on Memento Count

The Past Web

Contact Info

Product

Resources

About