Yahoo mail servers have been receiving an enormous number of messages each day for the past 17 years. The vast majority of todays messages are machine-generated (about 90% of the messages), based on a boilerplate with a small number of specific per-recipient changes. We show that the popular Zlib compression to gzip format fails to fully utilize the high similarity between these machine-generated messages. In this paper we analyze the data redundancy in Yahoo mail, and present methods to reduce its space requirements while using the standard Zlib library. Our results show we can further reduce the compressed data size by a factor of almost 2.5, compared to traditional gzip compression.
Data Compression Conference
This paper proposes a general optimization framework to allocate computing resources to the compression of massive and heterogeneous data sets incident upon a communication or storage system. The framework is formulated using abstract parameters, and builds on rigorous tools from optimization theory. The outcome is a set of algorithms that together can reach optimal compression allocation in a realistic scenario involving a multitude of content types and compression tools. This claim is demonstrated by running the optimization algorithms on publicly available data sets, and showing up to 25% size reduction, with equal compute-time budget using standard compression tools.
Data Compression Conference
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.