2009 10th IEEE/ACM International Conference on Grid Computing 2009
DOI: 10.1109/grid.2009.5353070
|View full text |Cite
|
Sign up to set email alerts
|

Parallel and distributed approach for processing large-scale XML datasets

Abstract: Abstract-An emerging trend is the use of XML as the data format for many distributed scientific applications, with the size of these documents ranging from tens of megabytes to hundreds of megabytes. Our earlier benchmarking results revealed that most of the widely available XML processing toolkits do not scale well for large sized XML data. A significant transformation is necessary in the design of XML processing for scientific applications so that the overall application turn-around time is not negatively af… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2011
2011
2018
2018

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 18 publications
(17 citation statements)
references
References 17 publications
(15 reference statements)
0
17
0
Order By: Relevance
“…In fact, LEMO-MR and Twister have similar performance, even as Twister shows slightly faster runs than LEMO-MR. Hadoop, here proves to be slower in this CPU-intensive scenario. Hadoop uses a number of overheadprone operations such as data chunk replication, constant worker node pings along with speculative and redundant jobs, mainly for fault-tolerance reasons [20]. These measures can however impede performance in any application setting [3] However, in [4] this situation is shown to be further aggravated in CPU-intensive scenarios where the pre-dominance of CPU operations and the long processing nature of tasks make duplicating tasks and speculative jobs launched by Hadoop more costly than in data-intensive or memory-intensive cases.…”
Section: Grid and Cloud Computing Research Lab Cluster At Binghamton mentioning
confidence: 99%
“…In fact, LEMO-MR and Twister have similar performance, even as Twister shows slightly faster runs than LEMO-MR. Hadoop, here proves to be slower in this CPU-intensive scenario. Hadoop uses a number of overheadprone operations such as data chunk replication, constant worker node pings along with speculative and redundant jobs, mainly for fault-tolerance reasons [20]. These measures can however impede performance in any application setting [3] However, in [4] this situation is shown to be further aggravated in CPU-intensive scenarios where the pre-dominance of CPU operations and the long processing nature of tasks make duplicating tasks and speculative jobs launched by Hadoop more costly than in data-intensive or memory-intensive cases.…”
Section: Grid and Cloud Computing Research Lab Cluster At Binghamton mentioning
confidence: 99%
“…The ability to provide more nodes, and thus more power to applications while they are running, has thusfar not been explored in the MapReduce context. In a corollary condition, the inability to predict before making a job live, what the optimal cluster configuration should be for that particular job can create a condition where extraneous nodes hinder the performance produced by the optimal number of nodes required for a job [14]. A user in such straights would be unable to remove the extra nodes, and let the cluster run efficiently at its best, but would instead have to suffer the loss of performance, and wait to correct the condition in subsequent runs.…”
Section: Motivations For a Dynamically Elastic Mapreduce Platformmentioning
confidence: 99%
“…We not only test the impact of early and late node additions, but also the impact of diverse cluster sizes on DELMA, as well as that of progressive node addition. Our prior work with MapReduce applications [14] has shown that for application turn-around time to be efficient, the overhead introduced by additional processing units must dwarf the time gained by the work produced by the participating nodes. In a traditional MapReduce context, this condition is encountered when input processing sizes are small, or when the processing per input element within the input itself is insignificant.…”
Section: Distributed Large-scale Data Processingmentioning
confidence: 99%
See 1 more Smart Citation
“…They believed that a parallel XML processing model should be a cost-effective solution for the XML performance issue in the multicore era. [24] adapted the Hadoop implementation to determine the threshold data sizes and computation work required per node for a distributed solution to be effective. They also presented an analysis of parallelism using PIXIMAL toolkit for processing large-scale XML datasets that utilizes the capabilities for parallelism that were available in the emerging multi-core architectures.…”
Section: Abdul Nizar M and P Sreenivasa Kumar(2009)[1]mentioning
confidence: 99%