High performance clustering of social images in a map-collective programming model

Zhang, Bingjing; Qiu, Judy

doi:10.1145/2523616.2525952

Cited by 4 publications

(3 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous work shows many machine learning algorithms can be implemented in the MapReduce paradigm [7]; later on, model communication is improved by collective communication operations in iterative MapReduce [12,19,6]. How- [15] CGS C PowerGraph LDA [2] CGS C Yahoo!…”

Section: Discussionmentioning

confidence: 99%

Model-centric computation abstractions in machine learning applications

Zhang

Peng

Qiu

2016

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond

Self Cite

View full text Add to dashboard Cite

We categorize parallel machine learning applications into four types of computation models and propose a new set of model-centric computation abstractions. This work sets up parallel machine learning as a combination of training data-centric and model parameter-centric processing. The analysis uses Latent Dirichlet Allocation (LDA) as an example, and experimental results show that an efficient parallel model update pipeline can achieve similar or higher model convergence speed compared with other work.

show abstract

Section: Discussionmentioning

confidence: 99%

Model-centric computation abstractions in machine learning applications

Zhang

Peng

Qiu

2016

Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond

Self Cite

View full text Add to dashboard Cite

show abstract

“…We categorize these into three [32]; for (B) an MPIbased K-Means implementation [51]. We examine the following hybrid approaches: (C.1) Python Scripting implementation using Pilots [8] (Pilot-KMeans), (C.2) a Spark K-Means [52] and (C.3) a HARP implementation [50]. HARP introduces an abstraction for collective operations within Hadoop jobs [50].…”

Section: High-performance Big Data Stack: a Convergence Of Paradigms?mentioning

confidence: 99%

“…We examine the following hybrid approaches: (C.1) Python Scripting implementation using Pilots [8] (Pilot-KMeans), (C.2) a Spark K-Means [52] and (C.3) a HARP implementation [50]. HARP introduces an abstraction for collective operations within Hadoop jobs [50]. While (C.1) provides an interoperable implementation of the MapReduce programming model for HPC environments, (C.2) and (C.3) enhance Hadoop for efficient iterative computations and introduce collective operations to Hadoop environments.…”

Section: High-performance Big Data Stack: a Convergence Of Paradigms?mentioning

confidence: 99%

A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

Jha

Qiu

Luckow

et al. 2014

2014 IEEE International Congress on Big Data

Self Cite

View full text Add to dashboard Cite

Abstract-Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.

show abstract