Hervé Paulino scite author profile

Alexandre

et al. 2013

perpétuo e sem limites geográficos, de arquivar e publicar esta dissertação através de exemplares impressos reproduzidos em papel ou de forma digital, ou por qualquer outro meio conhecido ou que venha a ser inventado, e de a divulgar através de repositórios científicos e de admitir a sua cópia e distribuição com objectivos educacionais ou de investigação, não comerciais, desde que seja dado crédito ao autor e editor. iv Eu dedico esta tese a todos aqueles que, de uma forma ou de outra, contribuiram para que ela se torna-se possível. Esta tese não é so minha, é também vossa. A huge Thank You to Soraia Assis, whose friendship got me by the rough times, and whose smile warmed my own. Without you this thesis would be but a dream.I thank my parents for all the sacrifices that they endured, so that I would be fortunate enough to study, and eventually reach this mark in my life.Thank you to all my family members and friends, whose names were not mention so far. They are the silent heroes, whose word is spoken throughout this thesis. Obrigado a todos! vii viii AbstractThe Graphics Processing Unit (GPU) is gaining popularity as a co-processor to the Central Processing Unit (CPU), due to its ability to surpass the latter's performance in certain application fields. Nonetheless, harnessing the GPU's capabilities is a non-trivial exercise that requires good knowledge of parallel programming. Thus, providing ways to extract such computational power has become an emerging research topic.In this context, there have been several proposals in the field of GPGPU (Generalpurpose Computation on Graphics Processing Unit) development. However, most of these still offer a low-level abstraction of the GPU computing model, forcing the developer to adapt application computations in accordance with the SPMD model, as well as to orchestrate the low-level details of the execution. On the other hand, the higher-level approaches have limitations that prevent the full exploitation of GPUs when the purpose goes beyond the simple offloading of a kernel.To this extent, our proposal builds on the recent trend of applying the notion of algorithmic patterns (skeletons) to GPU computing. We propose Marrow, a high-level algorithmic skeleton framework that expands the set of skeletons currently available in this field. Marrow's skeletons orchestrate the execution of OpenCL computations and introduce optimizations that overlap communication and computation, thus conjoining programming simplicity with performance gains in many application scenarios. Additionally, these skeletons can be combined (nested) to create more complex applications.

Execution of compound multi‐kernel OpenCL computations in multi‐CPU/multi‐GPU environments

Soldado

Alexandre

Concurrency and Computation

2015

SUMMARYCurrent computational systems are heterogeneous by nature, featuring a combination of CPUs and graphics processing units (GPUs). As the latter are becoming an established platform for high-performance computing, the focus is shifting towards the seamless programming of these hybrid systems as a whole. The distinct nature of the architectural and execution models in place raises several challenges, as the best hardware configuration is behavior and workload dependent. In this paper, we address the execution of compound, multi-kernel, open computing language computations in multi-CPU/multi-GPU environments. We address how these computations may be efficiently scheduled onto the target hardware, and how the system may adapt itself to changes in the workload to process and to fluctuations in the CPU's load. An experimental evaluation attests the performance gains obtained by the conjoined use of the CPU and GPU devices, when compared with GPU-only executions, and also by the use of data-locality optimizations in CPU environments.

A Multi-threaded Asynchronous Language

Marques²,

Lopes³

et al. 2003

Abstract. We describe a reference implementation of a multi-threaded run-time system for a core programming language based on a process calculus. The core language features processes running in parallel and communicating through asynchronous messages as the fundamental abstractions. The programming style is fully declarative, focusing on the interaction patterns between processes. The parallelism, implicit in the syntax of the programs, is effectively extracted by the language compiler and explored by the run-time system.

Session-Based Compilation Framework for Multicore Programming

Yoshida

Vasconcelos

et al. 2009

Abstract. This paper outlines a general picture of our ongoing work under EU Mobius and Sensoria projects on a type-based compilation and execution framework for a class of multicore CPUs. Our focus is to harness the power of concurrency and asynchrony in one of the major forms of multicore CPUs based on distributed, non-coherent memory, through the use of type-directed compilation. The key idea is to regard explicit asynchronous data transfer among local caches as typed communication among processes. By typing imperative processes with a variant of session types, we obtain both type-safe and efficient compilation into processes distributed over multiple cores with local memories.

Single Operation Multiple Data - Data Parallelism at Subroutine Level

Marques

2012

Abstract-The parallel nature of the multi-core architectural design can only be fully exploited by concurrent applications. This status quo pushed productivity to the forefront of the language design concerns. The community is demanding for new solutions in the design, compilation, and implementation of concurrent languages, making this research area one of great importance and impact. To that extent this paper proposes the expression of data parallelism at subroutine level. The calling of a subroutine in this context spawns several execution flows, each operating on distinct partitions of the input dataset. Such computations can be expressed by simply annotating sequential subroutines with data distribution and reduction policies, delegating the management of the parallel execution to a dedicated runtime system. The paper overviews the key concepts of the model, illustrating them with some small programming examples, and describes a Java implementation built on top of the X10 [1] runtime system. A performance evaluation attests that this approach can provide good performance gains without burdening the programmer with the writing of specialized code.