2022
DOI: 10.1080/23257962.2022.2100336
|View full text |Cite
|
Sign up to set email alerts
|

Creating order from the mess: web archive derivative datasets and notebooks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
1

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 13 publications
0
4
1
Order By: Relevance
“…Together these examples demonstrate the significant work required—conceptual re-framing as well as human and technical resources—to re-order WARC data to accommodate researchers’ information needs. I contrast my findings with recent work that positions the WARC as an “unaccessioned collection,” analogous to a “shipping container full of banker boxes holding dozens of jumbled records” which has not yet been processed by archivists (Ruest et al, 2022: 2). I argue here that WARC data is in fact an organized set of records, but ordered according to the logic of the crawler rather than the logics of human users or curators.…”
Section: Discussion: Materials Complications For Data As Collectionscontrasting
confidence: 62%
See 3 more Smart Citations
“…Together these examples demonstrate the significant work required—conceptual re-framing as well as human and technical resources—to re-order WARC data to accommodate researchers’ information needs. I contrast my findings with recent work that positions the WARC as an “unaccessioned collection,” analogous to a “shipping container full of banker boxes holding dozens of jumbled records” which has not yet been processed by archivists (Ruest et al, 2022: 2). I argue here that WARC data is in fact an organized set of records, but ordered according to the logic of the crawler rather than the logics of human users or curators.…”
Section: Discussion: Materials Complications For Data As Collectionscontrasting
confidence: 62%
“…I examine here how the WARC file format has come to play an essential role in web archiving, and the actors driving its development as a standard. While prior work briefly addresses the WARC format's structure and resulting challenges for research (Ruest et al, 2022), my investigation here traces both the format's origins and undertakes a close reading of its design. By drawing parallels to the case of the MVZ, I identify the WARC's role in standardizing collection methods for the web archiving community.…”
Section: Deconstructing the Warcmentioning
confidence: 99%
See 2 more Smart Citations
“…Some organizations like the Library of Congress [9] and the UK Web Archive [15] have found a way to offer more access to web archives without providing the primary datasets. These organizations offer access to derivative datasets, which are made of sampled data taken from a web archive [12]. Derivative datasets can include things like a collection of PDF files from a web archive, a Geoindex, or crawled URL indexes.…”
Section: Related Workmentioning
confidence: 99%