Abstract-In the last decade, the increased use and growth of social media, unconventional web technologies, and mobile applications, have all encouraged development of a new breed of database models. NoSQL data stores target the unstructured data, which by nature is dynamic and a key focus area for "Big Data" research. New generation data can prove costly and unpractical to administer with SQL databases due to lack of structure, high scalability, and elasticity needs. NoSQL data stores such as MongoDB and Cassandra provide a desirable platform for fast and efficient data queries. This leads to increased importance in areas such as cloud applications, e-commerce, social media, bio-informatics, and materials science. In an effort to combine the querying capabilities of conventional database systems and the processing power of the MapReduce model, this paper presents a thorough evaluation of the Cassandra NoSQL database when used in conjunction with the Hadoop MapReduce engine. We characterize the performance for a wide range of representative use cases, and then compare, contrast, and evaluate so that application developers can make informed decisions based upon data size, cluster size, replication factor, and partitioning strategy to meet their performance needs.
The progressive transition in the nature of both scientific and industrial datasets has been the driving force behind the development and research interests in the NoSQL model. Loosely structured data poses a challenge to traditional data store systems, and when working with the NoSQL model, these systems are often considered impractical and costly. As the quantity and quality of unstructured data grows, so does the demand for a processing pipeline that is capable of seamlessly combining the NoSQL storage model and a "Big Data" processing platform such as MapReduce. Although MapReduce is the paradigm of choice for data-intensive computing, Java-based frameworks such as Hadoop require users to write MapReduce code in Java while Hadoop Streaming module allows users to define non-Java executables as map and reduce operations. When confronted with legacy C/C++ applications and other non-Java executables, there arises a further need to allow NoSQL data stores access to the features of Hadoop Streaming. We present approaches in solving the challenge of integrating NoSQL data stores with MapReduce under non-Java application scenarios, along with advantages and disadvantages of each approach. We compare Hadoop Streaming alongside our own streaming framework, MARISSA, to show performance implications of coupling NoSQL data stores like Cassandra with MapReduce frameworks that normally rely on file-system based data stores. Our experiments also include Hadoop-C*, which is a setup where a Hadoop cluster is co-located with a Cassandra cluster in order to process data using Hadoop with non-java executables.
Abstract- 1The progressive transition in the nature of both scientific and industrial datasets has been the driving force behind the development and research interests in the NoSQL data model. Loosely structured data poses a challenge to traditional data store systems, and when working with the NoSQL model, these systems are often considered impractical and expensive. As the quantity of unstructured data grows, so does the demand for a processing pipeline that is capable of seamlessly combining the NoSQL storage model and a "Big Data" processing platform such as MapReduce. Although, MapReduce is the paradigm of choice for data-intensive computing, Java-based frameworks such as Hadoop require users to write MapReduce code in Java. Hadoop Streaming, on the other hand, allows users to define non-Java executables as map and reduce operations. Similarly, for legacy C/C++ applications and other non-Java executables, there is a need to allow NoSQL data stores access to the features of Hadoop Streaming. In this paper, we present approaches in solving the challenge of integrating NoSQL data stores with MapReduce for non-Java application scenarios, along with advantages and disadvantages of each approach. We compare Hadoop Streaming alongside our own streaming framework, MARISSA, to show performance implications of coupling NoSQL data stores like Cassandra with MapReduce frameworks that normally rely on file-system based data stores.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.