Large organizations like YouTube are dealing with exploding data volume and increasing demand for data driven applications. Broadly, these can be categorized as: reporting and dashboarding, embedded statistics in pages, time-series monitoring, and ad-hoc analysis. Typically, organizations build specialized infrastructure for each of these use cases. This, however, creates silos of data and processing, and results in a complex, expensive, and harder to maintain infrastructure. At YouTube, we solved this problem by building a new SQL query engine - Procella. Procella implements a superset of capabilities required to address all of the four use cases above, with high scale and performance, in a single product. Today, Procella serves hundreds of billions of queries per day across all four workloads at YouTube and several other Google product areas.
Scientific data is often distributed through repositories that host a large number of files in formats such as NetCDF or HDF5. With recent and anticipated increases in the size of observational and simulation data, it is important to transport just the data that are of interest from a large distributed dataset. Unfortunately, existing portals provide limited querying interfaces -typically a set of predefined hard coded subsettings, limiting user's querying flexibility.This paper describes a system that addresses this gap. The relational algebra is adapted for scientific array querying allowing us to adapt a subset of SQL for this domain, which enables nuanced subsetting conditions to be applied on a set of dataset files within a repository. A query processing algorithm extracts and collects data from relevant datasets, based on metadata that was earlier extracted using an automatic metadata extraction engine. Finally, the system stitches a new structured, NetCDF, file to be returned as a resultset, allowing the returned data to be used and analyzed by existing tools. The system has been extensively evaluated to show its ability to handle increasing data and/or number of files.
Modern computing devices and user interfaces have necessitated highly interactive querying. Some of these interfaces issue a large number of dynamically changing and continuous queries to the backend. In others, users expect to inspect results during the query formulation process, in order to guide or help them towards specifying a full-fledged query. Thus, users end up issuing a fast-changing workload to the underlying database. In such situations, the user's query intent can be thought of as being in flux. In this paper, we show that the traditional query execution engines are not well-suited for this new class of highly interactive workloads. We propose a novel model to interpret the variability of likely queries in a workload. We implemented a cyclic scan-based approach to process queries from such workloads in an efficient and practical manner while reducing the overall system load. We evaluate and compare our methods with traditional systems and demonstrate the scalability of our approach, enabling thousands of queries to run simultaneously within interactive response times given low memory and CPU requirements.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.