2016
DOI: 10.1007/s00799-016-0184-4
|View full text |Cite
|
Sign up to set email alerts
|

Web archive profiling through CDX summarization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
23
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 19 publications
(24 citation statements)
references
References 15 publications
1
23
0
Order By: Relevance
“…In previous work [8,9], we explored the middle ground where archive profiles are neither as minimal as storing just the TLD (which results in many false positives) nor as detailed as collecting every URI-R present in every archive (which goes stale very quickly and is difficult to maintain). We first defined various profiling policies, summarized CDX files according to those policies, evaluated associated costs and benefits, and prepared gold standard datasets [8,9]. In our experiments, we correctly identified about 78% of the URIs that were or were not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and identified 94% URIs with less than 10% relative cost without any false negatives.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…In previous work [8,9], we explored the middle ground where archive profiles are neither as minimal as storing just the TLD (which results in many false positives) nor as detailed as collecting every URI-R present in every archive (which goes stale very quickly and is difficult to maintain). We first defined various profiling policies, summarized CDX files according to those policies, evaluated associated costs and benefits, and prepared gold standard datasets [8,9]. In our experiments, we correctly identified about 78% of the URIs that were or were not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and identified 94% URIs with less than 10% relative cost without any false negatives.…”
Section: Related Workmentioning
confidence: 99%
“…We were able to make routing decisions of 80% of the requests correctly while maintaining about 90% Recall by discovering only 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile. MementoMap is a continuation of this effort to make it more flexible and portable 6 https://groups.google.com/forum/#!topic/memento-dev/YE4rt6L5ICg by eliminating the need for rigid profiling policies we defined earlier [8,9] (which are still good for baseline evaluation purposes) and replacing them with an adaptive approach in which the level of detail is dynamically controlled with a number of parameters.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…This prevents simply referencing all aggregated archives' CDX files for a URI to determine the non-redirecting count of mementos. In this work, we utilize the aggregated holdings of multiple Web archives as well as the CDXJ format [2], an extension of CDX. MemGator's CDXJ generation is derived from the archives' Memento endpoints, specifically their Link formatted TimeMaps, and transformed into the CDXJ format that allows quicker, more reliable parsing of the datetime that the included URI-Ms represent.…”
Section: Archival Indexingmentioning
confidence: 99%
“…We leveraged MemGator's CDXJ [2] interface (example output in Figure 3) for simple datetime extraction, structured JSON-formatted metadata of each memento's attributes, and more human readable output compared to the conventional Link ( Figure 1) or JSON formatted TimeMaps. Collection was run on a late 2013 MacBook Pro running OS X version 10.11.4 with a 2.4 GHz Intel i5 processor, 8 GB of RAM, and a 250 GB SSD disk.…”
Section: Data Collectionmentioning
confidence: 99%