RABID -- A General Distributed R Processing Framework Targeting Large Data-Set Problems

Lin, Haoming; Yang, Shuo; Midkiff, Samuel P.

doi:10.1109/bigdata.congress.2013.67

Cited by 9 publications

(4 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many emerging parallel R packages, such as RHIPE, SparkR [17], RABID [18], Snowfall, Rmpi and pbdMPI [19], can be used to parallelize R processes. RHIPE is a Hadoop MapReduce based R package that transforms R functions into MapReduce jobs.…”

Section: Methodsmentioning

confidence: 99%

Optimising parallel R correlation matrix calculations on gene expression data using MapReduce

et al. 2014

View full text Add to dashboard Cite

BackgroundHigh-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot meet the requirement of large-scale molecular data due to poor performance of the correlation matrix calculation. With high-throughput sequencing technologies promising to produce even larger datasets per subject, we expect the performance of the state-of-the-art statistical algorithms to be further impacted unless efforts towards optimisation are carried out. MapReduce is a widely used high performance parallel framework that can solve the problem.ResultsIn this paper, we evaluate the current parallel modes for correlation calculation methods and introduce an efficient data distribution and parallel calculation algorithm based on MapReduce to optimise the correlation calculation. We studied the performance of our algorithm using two gene expression benchmarks. In the micro-benchmark, our implementation using MapReduce, based on the R package RHIPE, demonstrates a 3.26-5.83 fold increase compared to the default Snowfall and 1.56-1.64 fold increase compared to the basic RHIPE in the Euclidean, Pearson and Spearman correlations. Though vanilla R and the optimised Snowfall outperforms our optimised RHIPE in the micro-benchmark, they do not scale well with the macro-benchmark. In the macro-benchmark the optimised RHIPE performs 2.03-16.56 times faster than vanilla R. Benefiting from the 3.30-5.13 times faster data preparation, the optimised RHIPE performs 1.22-1.71 times faster than the optimised Snowfall. Both the optimised RHIPE and the optimised Snowfall successfully performs the Kendall correlation with TCGA dataset within 7 hours. Both of them conduct more than 30 times faster than the estimated vanilla R.ConclusionsThe performance evaluation found that the new MapReduce algorithm and its implementation in RHIPE outperforms vanilla R and the conventional parallel algorithms implemented in R Snowfall. We propose that MapReduce framework holds great promise for large molecular data analysis, in particular for high-dimensional genomic data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new algorithm as a basis for optimising high-throughput molecular data correlation calculation for Big Data.

show abstract

Section: Methodsmentioning

confidence: 99%

Optimising parallel R correlation matrix calculations on gene expression data using MapReduce

et al. 2014

View full text Add to dashboard Cite

show abstract

“…R demonstrates superiority in statistical computing, graphical plotting and data analysis compared with graphical user interface (GUI) software. Moreover, R excels in big data analysis [76][77][78], data mining [79] and visualisation [80] modelling, plotting and image processing, which are supported through the variety of its built-in packages.…”

Section: Related Workmentioning

confidence: 99%

Satellite Image Processing by Python and R Using Landsat 9 OLI/TIRS and SRTM DEM Data on Côte d’Ivoire, West Africa

Lemenkova

Debeir

2022

J. Imaging

View full text Add to dashboard Cite

In this paper, we propose an advanced scripting approach using Python and R for satellite image processing and modelling terrain in Côte d’Ivoire, West Africa. Data include Landsat 9 OLI/TIRS C2 L1 and the SRTM digital elevation model (DEM). The EarthPy library of Python and `raster’ and `terra’ packages of R are used as tools for data processing. The methodology includes computing vegetation indices to derive information on vegetation coverage and terrain modelling. Four vegetation indices were computed and visualised using R: the Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index 2 (EVI2), Soil-Adjusted Vegetation Index (SAVI) and Atmospherically Resistant Vegetation Index 2 (ARVI2). The SAVI index is demonstrated to be more suitable and better adjusted to the vegetation analysis, which is beneficial for agricultural monitoring in Côte d’Ivoire. The terrain analysis is performed using Python and includes slope, aspect, hillshade and relief modelling with changed parameters for the sun azimuth and angle. The vegetation pattern in Côte d’Ivoire is heterogeneous, which reflects the complexity of the terrain structure. Therefore, the terrain and vegetation data modelling is aimed at the analysis of the relationship between the regional topography and environmental setting in the study area. The upscaled mapping is performed as regional environmental analysis of the Yamoussoukro surroundings and local topographic modelling of the Kossou Lake. The algorithms of the data processing include image resampling, band composition, statistical analysis and map algebra used for calculation of the vegetation indices in Côte d’Ivoire. This study demonstrates the effective application of the advanced programming algorithms in Python and R for satellite image processing.

show abstract

“…A number of academic (Ricardo [13], RHIPE [17], RABID [19]) and commercial (RHadoop [5], BigR [33]) projects have looked at integrating R with Apache Hadoop. SparkR follows a similar approach but inherits the functionality [23] and performance [3] benefits of using Spark as the execution engine.…”

Section: Related Workmentioning

confidence: 99%

“…However, data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets. Prior research has addressed some of these limitations through better I/O support [35], integration with Hadoop [13,19] and by designing distributed R runtimes [28] that can be integrated with DBMS engines [25].…”

Section: Introductionmentioning

confidence: 99%

SparkR

Venkataraman

Yang

Liu

et al. 2016

Proceedings of the 2016 International Conference on Management of Data

View full text Add to dashboard Cite

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

show abstract

RABID -- A General Distributed R Processing Framework Targeting Large Data-Set Problems

Cited by 9 publications

References 2 publications

Optimising parallel R correlation matrix calculations on gene expression data using MapReduce

Optimising parallel R correlation matrix calculations on gene expression data using MapReduce

Satellite Image Processing by Python and R Using Landsat 9 OLI/TIRS and SRTM DEM Data on Côte d’Ivoire, West Africa

SparkR

Contact Info

Product

Resources

About