Applying Twister to Scientific Applications

Zhang, Bingjing; Ruan, Yang; Wu, Tak-Lon; Qiu, Judy; Hughes, Adam L.; Fox, Geoffrey

doi:10.1109/cloudcom.2010.37

Cited by 27 publications

(11 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the Reduce phase of MapReduce, we used Twister (see Table 1) [72,74,15]. In Twister, all communication avoids using intermediate disk and is built around ActiveMQ (see Table 1) in Java Twister and around Azure primitives in the Microsoft cloud.…”

Section: Methodsmentioning

confidence: 99%

Visualizing the protein sequence universe

Stanberry

Higdon

Haynes

et al. 2012

Proceedings of the 3rd International Workshop on Emerging Computational Methods for the Life Sciences

Self Cite

View full text Add to dashboard Cite

Modern biology is experiencing a rapid increase in data volumes that challenges our analytical skills and existing cyberinfrastructure. Exponential expansion of the Protein Sequence Universe (PSU), the protein sequence space, together with the costs and complexities of manual curation creates a major bottleneck in life sciences research. Existing resources lack scalable visualization tools that are instrumental for functional annotation. Here, we describe a multidimensional scaling (MDS) implementation to create a 3D embedding of the PSU that allows visualizing the relationships between large numbers of proteins. To demonstrate the method, we use sequence similarity scores as a measure of proximity. An example of the prokaryotic PSU shows that the low-dimensional representation preserves important grouping features such as relative proximity of functionally similar clusters and clear structural separation between clusters with specific and general functions. The advantages of the method and its implementation include the ability to scale to large numbers of sequences, integrate different similarity measures with other functional and experimental data, and facilitate protein annotation. Transdisciplinary approaches akin to the one described in this paper are urgently needed to quickly and efficiently translate the influx of new data into tangible innovations and groundbreaking discoveries. *

show abstract

Section: Methodsmentioning

confidence: 99%

Visualizing the protein sequence universe

Stanberry

Higdon

Haynes

et al. 2012

Proceedings of the 3rd International Workshop on Emerging Computational Methods for the Life Sciences

Self Cite

View full text Add to dashboard Cite

show abstract

“…In future work, we will improve the Kmeans algorithm [8][9] [42] and apply the Map-Collective framework to other iterative applications [43] including Multi-Dimensional Scaling where the allgather primitive is needed. We will also extend current work to include an allreduce collective that is an alternative approach to Kmeans.…”

Section: Discussionmentioning

confidence: 99%

High performance clustering of social images in a map-collective programming model

Zhang

Qiu

2013

Proceedings of the 4th Annual Symposium on Cloud Computing

Self Cite

View full text Add to dashboard Cite

Large-scale iterative computations are common in many important data mining and machine learning algorithms needed in analytics and deep learning. In most of these applications, individual iterations can be specified as MapReduce computations, leading to the Iterative MapReduce programming model for efficient execution of data-intensive iterative computations interoperably between HPC and cloud environments. Further one needs additional communication patterns from those familiar in MapReduce and we base our initial architecture on collectives that integrate capabilities developed by the MPI and MapReduce communities. This leads us to the MapCollective programming model which here we develop based on requirements of a range of applications by extending our existing Iterative MapReduce environment Twister. This paper studies the implications of large scale Social Image clustering where large scale problems study 10-100 million images represented as points in a high dimensional (up to 2048) vector space which need to be divided into up to 1-10 million clusters. This Kmeans application needs 5 stages in each iteration: Broadcast, Map, Shuffle, Reduce and Combine, and this paper focuses on collective communication stages where large data transfers demand performance optimization. By comparing and combining ideas from MapReduce and MPI communities, we show that a topologyaware and pipeline-based broadcasting method gives better performance than other MPI and (Iterative) MapReduce systems.

show abstract

“…Professor Fox developed an iterative MapReduce architecture software Twister. The manner of Twister MapReduce is "configure once, and run many time" [13,14]. In this paper, a parallel feature selection method based on MapReduce model is proposed.…”

Section: Introductionmentioning

confidence: 99%

Parallel Feature Selection Based on MapReduce

Sun

2013

Lecture Notes in Electrical Engineering

View full text Add to dashboard Cite

Feature selection is an important research topic in machine learning and pattern recognition. It is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. However, in recent years, data has become increasingly larger in both number of instances and number of features in many applications. Classical feature selection method is out of work in processing large-scale dataset because of expensive computational cost. For improving computational speed, parallel feature selection is taken as the efficient method. MapReduce is an efficient distributional computing model to process large-scale data mining problems. In this paper, a parallel feature selection method based on MapReduce model is proposed. Large-scale dataset is partitioned into sub-datasets. Feature selection is operated on each computational node. Selected feature variables are combined into one feature vector in Reduce job. The parallel feature selection method is scalable. The efficiency of the method is illustrated through example analysis.

show abstract

Applying Twister to Scientific Applications

Cited by 27 publications

References 20 publications

Visualizing the protein sequence universe

Visualizing the protein sequence universe

High performance clustering of social images in a map-collective programming model

Parallel Feature Selection Based on MapReduce

Contact Info

Product

Resources

About