Parallel and distributed approach for processing large-scale XML datasets

Fadika, Zacharia; Head, Michael R.; Govindaraju, Madhusudhan

doi:10.1109/grid.2009.5353070

Cited by 18 publications

(17 citation statements)

References 17 publications

(15 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In fact, LEMO-MR and Twister have similar performance, even as Twister shows slightly faster runs than LEMO-MR. Hadoop, here proves to be slower in this CPU-intensive scenario. Hadoop uses a number of overheadprone operations such as data chunk replication, constant worker node pings along with speculative and redundant jobs, mainly for fault-tolerance reasons [20]. These measures can however impede performance in any application setting [3] However, in [4] this situation is shown to be further aggravated in CPU-intensive scenarios where the pre-dominance of CPU operations and the long processing nature of tasks make duplicating tasks and speculative jobs launched by Hadoop more costly than in data-intensive or memory-intensive cases.…”

Section: Grid and Cloud Computing Research Lab Cluster At Binghamton mentioning

confidence: 99%

Benchmarking MapReduce Implementations for Application Usage Scenarios

Fadika

Dede

Govindaraju

et al. 2011

2011 IEEE/ACM 12th International Conference on Grid Computing

Self Cite

View full text Add to dashboard Cite

Abstract-The MapReduce paradigm provides a scalable model for large scale data-intensive computing and associated fault-tolerance. With data production increasing daily due to ever growing application needs, scientific endeavors, and consumption, the MapReduce model and its implementations need to be further evaluated, improved, and strengthened. Several MapReduce frameworks with various degrees of conformance to the key tenets of the model are available today, each, optimized for specific features. HPC application and middleware developers must thus understand the complex dependencies between MapReduce features and their application. We present a standard benchmark suite for quantifying, comparing, and contrasting the performance of MapReduce platforms under a wide range of representative use cases. We report the performance of three different MapReduce implementations on the benchmarks, and draw conclusions about their current performance characteristics. The three platforms we chose for evaluation are the widely used Apache Hadoop implementation, Twister, which has been discussed in the literature, and LEMO-MR, our own implementation. The performance analysis we perform also throws light on the available design decisions for future implementations, and allows Grid researchers to choose the MapReduce implementation that best suits their application's needs.

show abstract

Section: Grid and Cloud Computing Research Lab Cluster At Binghamton mentioning

confidence: 99%

Benchmarking MapReduce Implementations for Application Usage Scenarios

Fadika

Dede

Govindaraju

et al. 2011

2011 IEEE/ACM 12th International Conference on Grid Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…The ability to provide more nodes, and thus more power to applications while they are running, has thusfar not been explored in the MapReduce context. In a corollary condition, the inability to predict before making a job live, what the optimal cluster configuration should be for that particular job can create a condition where extraneous nodes hinder the performance produced by the optimal number of nodes required for a job [14]. A user in such straights would be unable to remove the extra nodes, and let the cluster run efficiently at its best, but would instead have to suffer the loss of performance, and wait to correct the condition in subsequent runs.…”

Section: Motivations For a Dynamically Elastic Mapreduce Platformmentioning

confidence: 99%

“…We not only test the impact of early and late node additions, but also the impact of diverse cluster sizes on DELMA, as well as that of progressive node addition. Our prior work with MapReduce applications [14] has shown that for application turn-around time to be efficient, the overhead introduced by additional processing units must dwarf the time gained by the work produced by the participating nodes. In a traditional MapReduce context, this condition is encountered when input processing sizes are small, or when the processing per input element within the input itself is insignificant.…”

Section: Distributed Large-scale Data Processingmentioning

confidence: 99%

“…In a traditional MapReduce context, this condition is encountered when input processing sizes are small, or when the processing per input element within the input itself is insignificant. In [14], we showed that a cluster of 5 Hadoop nodes could be outstaged by a single system for input sizes below a given threshold of data. In brief, a MapReduce job would yield poor performance if the time used to provide the nodes the input is significantly greater than the time spent by the collection of nodes processing that input.…”

Section: Distributed Large-scale Data Processingmentioning

confidence: 99%

See 1 more Smart Citation

DELMA: Dynamically ELastic MapReduce Framework for CPU-Intensive Applications

Fadika

Govindaraju

2011

2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Self Cite

View full text Add to dashboard Cite

Abstract-Since its introduction, MapReduce implementations have been primarily focused towards static compute cluster sizes. In this paper, we introduce the concept of dynamic elasticity to MapReduce. We present the design decisions and implementation tradeoffs for DELMA, (Dynamically ELastic MApReduce), a framework that follows the MapReduce paradigm, just like Hadoop MapReduce, but that is capable of growing and shrinking its cluster size, as jobs are underway. In our study, we test DELMA in diverse performance scenarios, ranging from diverse node additions to node additions at various points in the application run-time with various dataset sizes. The applicability of the MapReduce paradigm extends far beyond its use with large-scale data intensive applications, and can also be brought to bear in processing long running distributed applications executing on small-sized clusters. In this work, we focus both on the performance of processing hierarchical data in distributed scientific applications, as well as the processing of smaller but demanding input sizes primarily used in small clusters. We run experiments for datasets that require CPU intensive processing, ranging in size from millions of input data elements to process, up to over half a billion elements, and observe the positive scalability patterns exhibited by the system. We show that for such sizes, performance increases with data and cluster size increases. We conclude on the benefits of providing MapReduce with the capability of dynamically growing and shrinking its cluster configuration by adding and removing nodes during jobs, and explain the possibilities presented by this model.

show abstract

“…They believed that a parallel XML processing model should be a cost-effective solution for the XML performance issue in the multicore era. [24] adapted the Hadoop implementation to determine the threshold data sizes and computation work required per node for a distributed solution to be effective. They also presented an analysis of parallelism using PIXIMAL toolkit for processing large-scale XML datasets that utilizes the capabilities for parallelism that were available in the emerging multi-core architectures.…”

Section: Abdul Nizar M and P Sreenivasa Kumar(2009)[1]mentioning

confidence: 99%

XML Parsing on Multicore Processors and Data Representation in . NET Tree Control

Kaur¹,

Sohal²

2014

IJCA

View full text Add to dashboard Cite

The purpose of this research is to optimize the parsing process of the XML files. There are several ways to parse the XML files. But to comply with the advanced multicore CPUs and their fast performance the XML parsing logics need to be refined and optimized with parallel processing approach. The parallel XML parsing is a step towards this approach. It makes the reading of XML data faster because parser runs on more engines to extract the data. There are several advantages of parallel XML parsing like fast execution, high throughput, time saving, proper CPU utilization and load balancing. To perform the parsing processes simultaneously, the XML files need to be split in small uniform portions. Now it will execute the parsing logic on multiple threads on each CPU's core to parse the each portion of XML file without interfering with each others. In other words, an each segment will be an input to the parser running on different threads on different CPU cores. To enhance the system performance the multicore processors based devices have been introduced. Such system's processing is much faster than conventional sequential processing systems especially when it does repetitive calculations on vast amounts of data. This technique becomes more important when a candidate system or development application is model based application which operates on the XML files. This approach plays a significant role to enhance the application's capability to process large amount of data, improve application performances by providing quick results and eventually expeditious the application processing and dependent operations.

show abstract

Parallel and distributed approach for processing large-scale XML datasets

Cited by 18 publications

References 17 publications

Benchmarking MapReduce Implementations for Application Usage Scenarios

Benchmarking MapReduce Implementations for Application Usage Scenarios

DELMA: Dynamically ELastic MapReduce Framework for CPU-Intensive Applications

XML Parsing on Multicore Processors and Data Representation in . NET Tree Control

Contact Info

Product

Resources

About