Dawei Jiang scite author profile

MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. MapReduce can achieve better performance with the allocation of more compute nodes from the cloud to speed up computation; however, this approach of "renting more nodes" is not cost effective in a pay-as-you-go environment. Users desire an economical elastically scalable data processing system, and therefore, are interested in whether MapReduce can offer both elastic scalability and efficiency. In this paper, we conduct a performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. We identify five design factors that affect the performance of Hadoop, and investigate alternative but known methods for each factor. We show that by carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5 for the same benchmark used in [19], and is thus more comparable to that of parallel database systems. Our results show that it is therefore possible to build a cloud data processing system that is both elastically scalable and efficient.

show abstract

JUNO Conceptual Design Report

Adam¹,

An²,

An³

et al. 2015

Preprint

View full text Add to dashboard Cite

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

Jiang

Tung

Chen

2011

IEEE Trans. Knowl. Data Eng.

132

View full text Add to dashboard Cite

Efficient B-tree based indexing for cloud data processing

et al. 2010

View full text Add to dashboard Cite

A Cloud may be seen as a type of flexible computing infrastructure consisting of many compute nodes, where resizable computing capacities can be provided to different customers. To fully harness the power of the Cloud, efficient data management is needed to handle huge volumes of data and support a large number of concurrent end users. To achieve that, a scalable and high-throughput indexing scheme is generally required. Such an indexing scheme must not only incur a low maintenance cost but also support parallel search to improve scalability. In this paper, we present a novel, scalable B + -tree based indexing scheme for efficient data processing in the Cloud. Our approach can be summarized as follows. First, we build a local B + -tree index for each compute node which only indexes data residing on the node. Second, we organize the compute nodes as a structured overlay and publish a portion of the local B + -tree nodes to the overlay for efficient query processing. Finally, we propose an adaptive algorithm to select the published B + -tree nodes according to query patterns. We conduct extensive experiments on Amazon's EC2, and the results demonstrate that our indexing scheme is dynamic, efficient and scalable.

show abstract

ES<sup>2</sup>: A cloud data storage system for supporting both OLTP and OLAP

Cao

Chen

Guo

et al. 2011

View full text Add to dashboard Cite

Cloud computing represents a paradigm shift driven by the increasing demand of Web based applications for elastic, scalable and efficient system architectures that can efficiently support their ever-growing data volume and large-scale data analysis. A typical data management system has to deal with real-time updates by individual users, and as well as periodical large scale analytical processing, indexing, and data extraction. While such operations may take place in the same domain, the design and development of the systems have somehow evolved independently for transactional and periodical analytical processing. Such a system-level separation has resulted in problems such as data freshness as well as serious data storage redundancy. Ideally, it would be more efficient to apply ad-hoc analytical processing on the same data directly. However, to the best of our knowledge, such an approach has not been adopted in real implementation.Intrigued by such an observation, we have designed and implemented epiC, an elastic power-aware data-itensive Cloud platform for supporting both data intensive analytical operations (ref. as OLAP) and online transactions (ref. as OLTP). In this paper, we present ES 2 -the elastic data storage system of epiC, which is designed to support both functionalities within the same storage. We present the system architecture and the functions of each system component, and experimental results which demonstrate the efficiency of the system.

show abstract

Deep Depthwise Separable Convolutional Network for Change Detection in Optical Aerial Images

Liu

Jiang

Zhang

et al. 2020

IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing

View full text Add to dashboard Cite

In this article, a remote sensing image change detection method based on depthwise separable convolution with U-Net is proposed, which omits the tedious steps of generating and analyzing the difference map in the traditional remote sensing image change detection method. First, two images having c-channel each can be specifically stacked into a 2c-channel image, and the change detection can be converted to an image segmentation problem, an improved full convolution network (FCN) called U-Net is exploited to directly separate the changing regions. Because the capability of the deep convolution network is proportional to the depth of the network and a deeper convolution network means the increase of the training parameters, we then replace the original convolution in FCN by the depthwise separable convolution, making the entire network lighter, while the model performs slightly better than the traditional convolution operation. Besides that, another innovation in our proposed method is to use a preference control loss function to meet the different needs of precision and recall rate. Experimental results validate the effectiveness and robustness of the proposed method.

show abstract

Supporting Database Applications as a Service

Hui

Jiang

et al. 2009

View full text Add to dashboard Cite

Multi-tenant data management is a form of Software as a Service (SaaS), whereby a third party service provider hosts databases as a service and provides its customers with seamless mechanisms to create, store and access their databases at the host site. One of the main problems in such a system, as we shall discuss in this paper, is scalability, namely the ability to serve an increasing number of tenants without too much query performance degradation. A promising way to handle the scalability issue is to consolidate tuples from different tenants into the same shared tables. However, this approach introduces two problems: 1) The shared tables are too sparse. 2) Indexing on shared tables is not effective. To resolve the problems, we propose a multi-tenant database system called M-Store, which provides storage and indexing services for multi-tenants. To improve the scalability of the system, we develop two techniques in M-Store: Bitmap Interpreted Tuple (BIT) and Multi-Separated Index (MSI). BIT is efficient in that it does not store NULLs from unused attributes in the shared tables and MSI provides flexibility since it only indexes each tenant's own data on frequently accessed attributes. We extended MySQL based on our proposed design and conducted extensive experiments. The experimental results show that our proposed approach is a promising multi-tenancy storage and indexing scheme which can be easily integrated into existing DBMS.

show abstract

epiC: an extensible and scalable system for processing Big Data

Jiang

Chen

et al. 2015

The VLDB Journal

View full text Add to dashboard Cite

The Big Data problem is characterized by the so called 3V features: Volume -a huge amount of data, Velocity -a high data ingestion rate, and Variety -a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions to the Big Data problem are largely based on the MapReduce framework (aka its open source implementation Hadoop). Although Hadoop handles the data volume challenge successfully, it does not deal with the data variety well since the programming interfaces and its associated data processing model is inconvenient and inefficient for handling structured data and graph data. This paper presents epiC, an extensible system to tackle the Big Data's data variety challenge. epiC introduces a general Actor-like concurrent programming model, independent of the data processing models, for specifying parallel computations. Users process multi-structured datasets with appropriate epiC extensions, the implementation of a data processing model best suited for the data type and auxiliary code for mapping that data processing model into epiC's concurrent programming model. Like Hadoop, programs written in this way can be automatically parallelized and the runtime system takes care of fault tolerance and inter-machine communications. We present the design and implementation of epiC's concurrent programming model. We also present two customized data processing model, an optimized MapReduce extension and a relational model, on top of epiC. Experiments demonstrate the effectiveness and efficiency of our proposed epiC.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Dawei Jiang

The performance of MapReduce

JUNO Conceptual Design Report

MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

Efficient B-tree based indexing for cloud data processing

ES<sup>2</sup>: A cloud data storage system for supporting both OLTP and OLAP

Deep Depthwise Separable Convolutional Network for Change Detection in Optical Aerial Images

Supporting Database Applications as a Service

epiC: an extensible and scalable system for processing Big Data

Contact Info

Product

Resources

About