Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 2020
DOI: 10.1145/3383583.3398542
|View full text |Cite
|
Sign up to set email alerts
|

The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

Abstract: The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addre… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 37 publications
0
5
0
Order By: Relevance
“…7) Experiment with column-oriented formats for archive indexes -In addition to formats that are common within the web archiving field, like WARC, WAT, WET, and CDX, other formats such as Parquet, which is used in cloud-based or big-data oriented environments, are making inroads into the field of web archiving [16]. Apache Parquet files are a column-oriented file type that is used to store and retrieve data more efficiently.…”
Section: Methodsmentioning
confidence: 99%
“…7) Experiment with column-oriented formats for archive indexes -In addition to formats that are common within the web archiving field, like WARC, WAT, WET, and CDX, other formats such as Parquet, which is used in cloud-based or big-data oriented environments, are making inroads into the field of web archiving [16]. Apache Parquet files are a column-oriented file type that is used to store and retrieve data more efficiently.…”
Section: Methodsmentioning
confidence: 99%
“…Interoperability between datasets, formats, and software tools is essential for integrating datasets from multiple sources, and is a continuing challenge for knowledge infrastructures. The FITS format and tools layered upon these standards are showing their age, raising concerns among astronomers for future interoperability (Mink, 2015;Thomas et al, 2015;Wang & Xie, 2020). New tools such as Jupyter Notebooks provide scientists the ability to release executable packages of data, pipelines, and workflows.…”
Section: Theme 4: Using and Reusing Data Productsmentioning
confidence: 99%
“…The study proceeds from an experimental assessment of two formats in the absence of a specific task of choosing alternatives. The authors of [27] pursue the goal of finding an alternative for the WARC format when developing web services. Apache Parquet and Apache Avro are also alternatives in this study.…”
Section: E Analysis Of Data Storage Formatsmentioning
confidence: 99%