We present the Glasgow Parallel Reduction Machine (GPRM), a novel, flexible framework for parallel task-composition based many-core programming. We allow the programmer to structure programs into task code, written as C++ classes, and communication code, written in a restricted subset of C++ with functional semantics and parallel evaluation. In this paper we discuss the GPRM, the virtual machine framework that enables the parallel task composition approach. We focus the discussion on GPIR, the functional language used as the intermediate representation of the bytecode running on the GPRM. Using examples in this language we show the flexibility and power of our task composition framework. We demonstrate the potential using an implementation of a merge sort algorithm on a 64-core Tilera processor, as well as on a conventional Intel quad-core processor and an AMD 48-core processor system. We also compare our framework with OpenMP tasks in a parallel pointer chasing algorithm running on the Tilera processor. Our results show that the GPRM programs outperform the corresponding OpenMP codes on all test platforms, and can greatly facilitate writing of parallel programs, in particular non-data parallel algorithms such as reductions.
Abstract-Processors with large numbers of cores are becoming commonplace. In order to take advantage of the available resources in these systems, the programming paradigm has to move towards increased parallelism. However, increasing the level of concurrency in the program does not necessarily lead to better performance. Parallel programming models have to provide flexible ways of defining parallel tasks and at the same time, efficiently managing the created tasks. OpenMP is a widely accepted programming model for shared-memory architectures. In this paper we highlight some of the drawbacks in the OpenMP tasking approach, and propose an alternative model based on the Glasgow Parallel Reduction Machine (GPRM) programming framework. As the main focus of this study, we deploy our model to solve a fundamental linear algebra problem, LU factorisation of sparse matrices. We have used the SparseLU benchmark from the BOTS benchmark suite, and compared the results obtained from our model to those of the OpenMP tasking approach. The TILEPro64 system has been used to run the experiments. The results are very promising, not only because of the performance improvement for this particular problem, but also because they verify the task management efficiency, stability, and flexibility of our model, which can be applied to solve problems in future many-core systems.
Abstract. Systems with large numbers of cores have become commonplace. Accordingly, applications are shifting towards increased parallelism. In a general-purpose system, applications residing in the system compete for shared resources. Thread and task scheduling in such a multithreaded multiprogramming environment is a significant challenge. In this study, we have chosen the Intel Xeon Phi system as a modern platform to explore how popular parallel programming models, namely OpenMP, Intel Cilk Plus and Intel TBB (Threading Building Blocks) scale on manycore architectures. We have used three benchmarks with different features which exercise different aspects of the system performance. Moreover, a multiprogramming scenario is used to compare the behaviours of these models when all three applications reside in the system. Our initial results show that it is to some extent possible to infer multiprogramming performance from single-program cases.
SUMMARYIntel's Xeon Phi is a highly parallel x86 architecture chip made by Intel. It has a number of novel features which make it a particularly challenging target for the compiler writer. This paper describes the techniques used to port the Glasgow Vector Pascal Compiler (VPC) to this architecture and assess its performance by comparisons of the Xeon Phi with 3 other machines running the same algorithms. Copyright c 0000 John Wiley & Sons, Ltd. Nvidia GPU CONTEXTThis work was done as part of the EU funded CLOPEMA project whose aim is to develop a cloth folding robot using real time stereo vision. At the start of the project we used a Java legacy software package, C3D [1] that is capable of performing the necessary ranging calculations. When processing the robot's modern high resolution images it was prohibitively slow for real time applications, taking about 20 minutes to process a single pair of images.To improve performance, a new Parallel Pyramid Matcher (PPM) was written in Vector Pascal [2] † , using the legacy software as design basis. The new PPM allowed the use of both SIMD and multi-core parallelism [3]. It performs about 20 times faster on commodity PC chips such as the Intel Sandybridge, than the legacy software. With the forthcoming release of the Xeon Phi it was anticipated to be able to obtain further acceleration running the same PPM code on the Xeon Phi. Hence, taking advantage of more cores and wider SIMD registers, whilst relying on the automatic parallelisation feature of the language. The key step in this would be to modify the compiler to produce Xeon Phi code. However, the Xeon Phi turned out to be considerably more complex compared to previous Intel platforms. Porting of the Glasgow Vector Pascal compiler became an entirely new challenge, and required a different porting approach than previous architectures. PREVIOUS RELATED WORKVector Pascal [4,2] is an array language and as such shares features from other array languages such as APL [5], ZPL [6,7,8] Assignment C [11,12]. The original APL and its descendent J were interpretive languages in which each application of a function to array arguments produced an array result. Whilst it is possible to naively generate a compiler that uses the same approach it is considered inefficient as it leads to the formation of an unnecessary number of array temporaries. This reduces locality of reference and thus cache performance. The key innovation in efficient array language compiler development was Budd's [13] principle to create a single loop nest for each array assignment and to create temporaries as scalar results. This principle was subsequently rediscovered by other implementers of data parallel languages or sub-languages [14]. It has been used in the Saarbrucken [15] Note that the # notation is not supported. Instead index sets are usually elided, provided that the corresponding positions in the arrays are intended. If offsets are intended the index sets can now be explicitly referred to using the predeclared array of index sets iota. iota[0] ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.