M. Suhail Rehman scite author profile

MapReduce is by far one of the most successful realizations of large-scale data-intensive cloud computing platforms. MapReduce automatically parallelizes computation by running multiple map and/or reduce tasks over distributed data across multiple machines.Hadoop is an open source implementation of MapReduce. When Hadoop schedules reduce tasks, it neither exploits data locality nor addresses partitioning skew present in some MapReduce applications. This might lead to increased cluster network traffic. In this paper we investigate the problems of data locality and partitioning skew in Hadoop. We propose Center-of-Gravity Reduce Scheduler (CoGRS), a locality-aware skew-aware reduce task scheduler for saving MapReduce network traffic. In an attempt to exploit data locality, CoGRS schedules each reduce task at its center-ofgravity node, which is computed after considering partitioning skew as well. We implemented CoGRS in Hadoop-0.20.2 and tested it on a private cloud as well as on Amazon EC2. As compared to native Hadoop, our results show that CoGRS minimizes off-rack network traffic by averages of 9.6% and 38.6% on our private cloud and on an Amazon EC2 cluster, respectively. This reflects on job execution times and provides an improvement of up to 23.8%.

show abstract

Fast and scalable list ranking on the GPU

Rehman

Kothapalli

Narayanan

2009

View full text Add to dashboard Cite

Initial Findings for Provisioning Variation in Cloud Computing

Rehman

Sakr

2010

View full text Add to dashboard Cite

Towards Understanding Data Analysis Workflows using a Large Notebook Corpus

Rehman

2019

View full text Add to dashboard Cite

Teaching the cloud - experiences in designing and teaching an undergraduate-level course in cloud computing at the Carnegie Mellon University in Qatar

Rehman

Sakr

2011

View full text Add to dashboard Cite

A demonstration of RELIC

2021

View full text Add to dashboard Cite

The ad-hoc, heterogeneous process of modern data science typically involves loading, cleaning, and mutating dataset(s) into multiple versions recorded as artifacts by various tools within a single data science workflow. Lineage information, including the source datasets, data transformation programs or scripts, or manual annotations, is rarely captured, making it difficult to infer the relationships between artifacts in a given workflow retrospectively. We demonstrate Relic, a tool to retrospectively infer the lineage of data artifacts generated as a result of typical data science workflows, with an interactive demonstration that allows users to input artifact files and visualize the inferred lineage in a web-based setting.

show abstract

FuzzyData

Rehman

Elmore

2022

View full text Add to dashboard Cite

Dataframes have become a popular means to represent, transform and analyze data. This approach has gained traction and a large user base for data science practitioners -resulting in a new wave of systems that implement a dataframe API but allow for performance, efficiency, and distributed/parallel extensions to systems such as R and pandas. However, unlike relational databases and NoSQL systems with a variety of benchmarking, testing, and workload generation suites, there is an acute lack of similar tools for dataframe-based systems. This paper presents FuzzyData, a first step in providing an extensible workflow generation system that targets dataframe-based APIs. We present an abstract data processing workflow model, random table and workflow generators, and three clients implemented using our model. Using FuzzyData, we can encode a real-world workflow or randomly generate workflows using various parameters. These workflows can be scaled and replayed on multiple systems to provide stress testing, performance evaluation, and a breakdown of performance bottlenecks present on popular dataframe systems.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

M. Suhail Rehman

A performance prediction model for the CUDA GPGPU platform

Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic

Fast and scalable list ranking on the GPU

Initial Findings for Provisioning Variation in Cloud Computing

Towards Understanding Data Analysis Workflows using a Large Notebook Corpus

Teaching the cloud - experiences in designing and teaching an undergraduate-level course in cloud computing at the Carnegie Mellon University in Qatar

A demonstration of RELIC

FuzzyData

Contact Info

Product

Resources

About