2023
DOI: 10.1177/20539517231163172
|View full text |Cite
|
Sign up to set email alerts
|

All WARC and no playback: The materialities of data-centered web archives research

Abstract: This paper examines the Web ARChive (WARC) file format, revealing how the format has come to play a central role in the development and standardization of interoperable tools and methods for the international web archiving community. In the context of emerging big data approaches, I consider the sociotechnical relationships between material construction of data and information infrastructures for collecting and research. Analysis is inspired by Star and Griesemer's historical case of the Museum of Vertebrate Z… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 37 publications
(48 reference statements)
0
3
0
Order By: Relevance
“…The material analysis reveals several ordering principles embedded in web archiving's central artefacts, demonstrating how processes of crawling and web-resource discovery are rooted in a vision of the web which has not yet accounted for the configurations of actors and materials that comprise proprietary platforms. For instance, a core finding (as explored in greater depth in Maemura 2023) is that the WARC's design pre-configures the URL as the central object of analysis for work with web archives. This design choice must also be read in historical context, since the WARC was developed alongside the Heritrix web crawler, whose development dates to 2004, and is based on older crawler technology from web analytics company Alexa Internet, which was founded by Internet Archive's Brewster Kahle in 1995.…”
Section: Emily Maemura University Of Illinois Urbana-champaignmentioning
confidence: 99%
See 1 more Smart Citation
“…The material analysis reveals several ordering principles embedded in web archiving's central artefacts, demonstrating how processes of crawling and web-resource discovery are rooted in a vision of the web which has not yet accounted for the configurations of actors and materials that comprise proprietary platforms. For instance, a core finding (as explored in greater depth in Maemura 2023) is that the WARC's design pre-configures the URL as the central object of analysis for work with web archives. This design choice must also be read in historical context, since the WARC was developed alongside the Heritrix web crawler, whose development dates to 2004, and is based on older crawler technology from web analytics company Alexa Internet, which was founded by Internet Archive's Brewster Kahle in 1995.…”
Section: Emily Maemura University Of Illinois Urbana-champaignmentioning
confidence: 99%
“…While Alexa Internet's crawlers were instrumental to support URL-based discovery and analytics for a Web that had evolved beyond the use of directory listings, that vision of web discovery and use is inherently tied to a specific period of the Web. While the Web has continued to evolve through periods of platformisation, and an emerging era of decentralisation, the older logics of discovery continue to persist in the foundational tools and technologies of web archiving today: examining the WARC as a data artefact reveals how it is ordered according to the logic of the crawler, and does not capture the broader range of curation choices and decisions made by archivists (Maemura 2023). In effect, the core tools, workflows, and practices of institutional web archiving programs have inherited logics of material ordering from a 'pre-platform' web.…”
Section: Emily Maemura University Of Illinois Urbana-champaignmentioning
confidence: 99%
“…Despite the multiple software implementations, the Wayback Machine model of archival replay, using WARC files as input, is the predominant model in public web archives; of the 17 web archives listed in Table 1, only archive.is does not use WARC files and Wayback Machine modalities. While the Wayback Machine model of replay has clearly been successful, researchers have begun to consider how the standardization of WARC files and the Wayback Machine model itself have shaped the field of web archiving (e.g., [53,54]). One of the key assumptions in the Wayback Machine model is that HTTP responses are stored in WARC files, and then replayed through the web archive with a mixture of client-side and server-side transformations to both recreate the past (e.g., rewrite links to point back into the web archive) and "brand" the replayed content as the past web (e.g., archival banners).…”
Section: Extract Any New Uris Thatmentioning
confidence: 99%