A performance analysis framework for identifying potential benefits in GPGPU applications

Sim, Jaewoong; Dasgupta, Aniruddha; Kim, Hyesoon; Vuduc, Richard

doi:10.1145/2145816.2145819

Cited by 154 publications

(72 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In some recent papers [34,35], the influence on execution time of the implementation factors, such as processor occupancy, thread synchronization, organization of memory accesses etc. is analyzed.…”

Section: Introductionmentioning

confidence: 99%

Computational cost estimates for parallel shared memory isogeometric multi-frontal solvers

Woźniak

Kuźnik

Paszyński

et al. 2014

Computers & Mathematics with Applications

View full text Add to dashboard Cite

In this paper we present computational cost estimates for parallel shared memory isogeometric multi-frontal solver. The estimates show that the ideal isogeometric shared memory parallel direct solver scales as O(p 2 log(N/p)) for one dimensional problems, O(Np 2 ) for two dimensional problems, and O(N 4/3 p 2 ) for three dimensional problems, where N is the number of degrees of freedom, and p is the polynomial order of approximation. The computational costs of the shared memory parallel isogeometric direct solver are compared with those corresponding to the sequential isogeometric direct solver, being the latest equal to O(Np 2 ) for the one dimensional case, O(N 1.5 p 3 ) for the two dimensional case, and O(N 2 p 3 ) for the three dimensional case. The shared memory version significantly reduces both the scalability in terms of N and p. Theoretical estimates are compared with numerical experiments performed with linear, quadratic, cubic, quartic, and quintic B-splines, in one and two spatial dimensions.Keywords: isogeometric finite element method, multi-frontal direct solver, computational cost, NVIDIA CUDA GPU Preprint submitted to Computers & Mathematics with ApplicationsMarch 27, 20141. Introduction Classical higher order finite element methods (FEM) [17,18] maintain only C 0 -continuity at the element interfaces, while isogeometric analysis (IGA) utilizes B-splines as basis functions, and thus, it delivers C k global continuity [14]. The higher continuity obtained across elements allows IGA to attain optimal convergence rates for any polynomial order, while using fewer degrees of freedom [3,1]. Nevertheless, this reduced count in the number of degrees of freedom may not immediately correlate with a computational cost reduction, since solution time per degree of freedom augments as the continuity is increased [10,13]. In spite of the increased cost of highercontinuous spaces, they have proven very popular and useful. For example, higher-continuous spaces have allowed the solution of higher-order partial di↵erential equations with elegance [7,28,29,51,16,15] as well as several non-linear problems of engineering interest [31,6,30,5,20,11,9,4]. Thus, e cient multi-frontal solvers for higher-continuous spaces are important.The multi-frontal solver is one of the state-of-the art algorithm for solving linear systems of equations [22,26]. It is a generalization of the frontal solver algorithm proposed in [33,21]. The multi-frontal algorithm constructs an assembly tree based on the analysis of the connectivity data or the geometry of the computational mesh. Finite elements are joint into pairs and fully assembled unknowns are eliminated within frontal matrices associated to multiple branches of the tree. The process is repeated until the root of the assembly tree is reached. Finally, the common interface problem is solved and partial backward substitutions are recursively called on the assembly tree.There exist parallel versions of the multi-frontal direct solver algorithm targeting distributed-memory, share...

show abstract

Section: Introductionmentioning

confidence: 99%

Computational cost estimates for parallel shared memory isogeometric multi-frontal solvers

Woźniak

Kuźnik

Paszyński

et al. 2014

Computers & Mathematics with Applications

View full text Add to dashboard Cite

show abstract

“…Focuses only on Real-Time Embedded System MARTE [7] Supports shed light of bottlenecks of GPGPU applications. Supports programmers in measurements as well as metrics during run time it assumes that a memory instruction is always followed by consecutive dependent instructions; hence, MLP is always one.…”

Section: Resultsmentioning

confidence: 99%

“…This approach enables to obtain performance measures such as throughout and response time throughout software life-cycle. Moreover, Sim, Jaewoong, et al [7] proposed a framework in order to analyze the performance, which supports shed light of bottlenecks of GPGPU applications. In addition, this framework helps GPGPU Profile tools and supports programmers in measurements as well as metrics during run time.…”

Section: Primary Studiesmentioning

confidence: 99%

Preliminary Study of Software Performance Models

Issamjebreen¹,

Awad²

2016

ijacsa

View full text Add to dashboard Cite

Abstract-Context: Software performance models can be obtained by applying for specific roles, skills and techniques in software life cycle, and it depends on formulating the software problem as well as gathering the performance requirements. This paper presents a preliminary review of the software performance models. This constitutes a reference for the IT companies and personnel that help them select the suitable model for their projects. Also, the study helps researchers find out further research areas in this field. A preliminary review according to a predefined strategy is used to conduct previous approaches of software performance models integrated with software development cycle in early software cycle. A review has been done for exploring and comparing the software performance models that are published previously. This study results in a comprehensive review for the existing software performance models. This review composes a clear reference for highlighting the weak and strength points of these models.

show abstract

“…17 candidate features were assembled from a previous study of performance counters [34], and computed theoretical values [35]. For each candidate feature they compute its coarsening delta, reflecting the change in each feature value caused by coarsening: f ∆ = ( f a f ter − f be f ore )/ f be f ore , adding it to the feature set.…”

Section: Case Study B: Opencl Thread Coarsening Factormentioning

confidence: 99%

End-to-End Deep Learning of Optimization Heuristics

Cummins

Petoumenos

Wang

et al. 2017

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

141

154

View full text Add to dashboard Cite

Accurate automatic optimization heuristics are necessary for dealing with the complexity and diversity of modern hardware and software. Machine learning is a proven technique for learning such heuristics, but its success is bound by the quality of the features used. These features must be hand crafted by developers through a combination of expert domain knowledge and trial and error. This makes the quality of the final model directly dependent on the skill and available time of the system architect.Our work introduces a better way for building heuristics. We develop a deep neural network that learns heuristics over raw code, entirely without using code features. The neural network simultaneously constructs appropriate representations of the code and learns how best to optimize, removing the need for manual feature creation. Further, we show that our neural nets can transfer learning from one optimization problem to another, improving the accuracy of new models, without the help of human experts.We compare the effectiveness of our automatically generated heuristics against ones with features hand-picked by experts. We examine two challenging tasks: predicting optimal mapping for heterogeneous parallelism and GPU thread coarsening factors. In 89% of the cases, the quality of our fully automatic heuristics matches or surpasses that of state-of-theart predictive models using hand-crafted features, providing on average 16% and 12% more performance with no human effort expended on designing features.

show abstract

A performance analysis framework for identifying potential benefits in GPGPU applications

Cited by 154 publications

References 19 publications

Computational cost estimates for parallel shared memory isogeometric multi-frontal solvers

Computational cost estimates for parallel shared memory isogeometric multi-frontal solvers

Preliminary Study of Software Performance Models

End-to-End Deep Learning of Optimization Heuristics

Contact Info

Product

Resources

About