Sorting is one of the most fundamental algorithms in Computer Science and a common operation in databases not just for sorting query results but also as part of joins (i.e., sortmerge-join) or indexing. In this work, we introduce a new type of distribution sort that leverages a learned model of the empirical CDF of the data. Our algorithm uses a model to efficiently get an approximation of the scaled empirical CDF for each record key and map it to the corresponding position in the output array. We then apply a deterministic sorting algorithm that works well on nearly-sorted arrays (e.g., Insertion Sort) to establish a totally sorted order.We compared this algorithm against common sorting approaches and measured its performance for up to 1 billion normally-distributed double-precision keys. The results show that our approach yields an average 3.38× performance improvement over C++ STL sort, which is an optimized Quicksort hybrid, 1.49× improvement over sequential Radix Sort, and 5.54× improvement over a C++ implementation of Timsort, which is the default sorting function for Java and Python.
Today’s society is part of a shared digital life, with an Internet population of 3.2 billion people. Though this colossal data infrastructure enables communication, information sharing, and collaboration, it is a place that favors a paradigm of continuous collection and storage of data, without much analysis of how that disrupts certain social norms and induces cases of violations of fundamental rights like privacy, freedom, and protection from discrimination.In 2016, the European Union adopted the General Data Protection Regulation, which introduced a right for individuals to have their personal data erased. This opened a discussion on privacy and identity concerns in the context of perpetual stigmatization and discrimination due to obsolete data that remains on the web. Through analyses of some cases in the U.S. and E.U., this paper will investigate the challenges of importing a similar legal framework for the erasure of personal data in the U.S., while ensuring the freedom of expression and maintaining the quality of the search engines and respective websites.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.