Flexible Scheduling and Thread Allocation for Synchronous Parallel Tasks

Keßler, Christoph; Hansson, Erik

doi:10.1007/bf03342029

Cited by 5 publications

(2 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All threads are started on program start-up and the amount of threads stays constant during the execution. We omit the treatise of the Replica task model [11] here and assume the background tasks just continue their execution unaffected, in lock-step style when debugging the active group. The control flow of threads follows a hierarchical model tied to static scope, with instructions diverging and later, inside the same static scope, merging back the control flow of threads into groups.…”

Section: Simulatormentioning

confidence: 99%

Towards a parallel debugging framework for the massively multi-threaded, step-synchronous REPLICA architecture

Mäkelä

Leppänen

Forsell

2013

Proceedings of the 14th International Conference on Computer Systems and Technologies

View full text Add to dashboard Cite

Modern chip-multiprocessors pack an increasing amount of computational cores with each generation. Along with new computational power comes a problem of managing a large pool of active threads. Traditional debuggers often deal with concurrency style multi-threading with emphasis on a single thread. The problem of thread management when debugging parallel programs is analyzed and solutions are suggested. A related debugging framework for the massively multi-threaded, synchronous REPLICA architecture is proposed. INTRODUCTIONA common trend in the chip development on many application domains has been the strive for more parallelism by increasing the number of on-chip processor cores. For example, even the low-end server, desktop and mobile processor markets mainly offer two to sixteen core solutions with a 2 to 4 digit number of general purpose computational GPU (graphics processing unit) cores. Along with the number of cores, the machines and their computational models need to scale to a massive number of threads. While there are many competing solutions for programming, compiling, optimizing and simulating applications on all these platforms, we believe the tools for debugging parallel applications can still improve. For instance, the applications can benefit from improvements in abstractions that better fit the parallel nature of programming the massively multi-core systems.Debuggers attempt to tackle two main categories of concurrency bugs: synchronization bugs and performance bugs. Synchronization bugs include race conditions and deadlocks. Performance bugs arise from unnecessary thread overhead due to thread creation or context switch overhead, and memory access patterns that are suboptimal for a given processor's memory hierarchy. Approaches for solving these issues are e.g. the tracing of concurrent events and resource usage with timestamps and profiling memory and CPU usage. [2] Modern debuggers often make few basic assumptions about the threading model. First, when debugging concurrent programs, either a single thread is considered at a time for execution or "background" threads are pre-emptively scheduled -possibly with further restrictions to better control their effects. When stepping through code, executing the program in global lock-steps could change the execution semantics and would also be considered slow especially as the number of threads grows. This approach makes it difficult to reason about the global machine state in a deterministic way. In the case of statically scheduled dataparallel algorithms, this asynchronicity adds overhead without benefiting the algorithm. The traditional debugging style is better suited for independent concurrency and does not fully capture the abstractions of parallel execution, e.g. traditional data-parallel algorithms [9].As a corollary to the assumption of non-deterministic scheduling of threads, it is hard to reason about the exact position and relations between of threads. A reasonably designed program could be e.g. divided into parallel and sequential sections...

show abstract

Section: Simulatormentioning

confidence: 99%

Towards a parallel debugging framework for the massively multi-threaded, step-synchronous REPLICA architecture

Mäkelä

Leppänen

Forsell

2013

Proceedings of the 14th International Conference on Computer Systems and Technologies

View full text Add to dashboard Cite

show abstract

“…This is however not the final scope of the runtime library. We also have plans to extend the library with a multitude of standard parallel algorithms, generic parallel data structures, and with a runtime system handling the execution and scheduling of task level parallelism [60]. It is also natural that the library provides support for functions similar to those in the standard C library.…”

Section: Parallel Runtime Librarymentioning

confidence: 99%

Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture

Hansson¹

2014

Self Cite

View full text Add to dashboard Cite

In this thesis we describe techniques for code generation and global optimization for a PRAM-NUMA multicore architecture. We specifically focus on the REPLICA architecture which is a family massively multithreaded very long instruction word (VLIW) chip multiprocessors with chained functional units that has a reconfigurable emulated shared on-chip memory. The on-ship memory system supports two execution modes, PRAM and NUMA, which can be switched between at run-time.PRAM mode is considered the standard execution mode and targets mainly applications with very high thread level parallelism (TLP). In contrast, NUMA mode is optimized for sequential legacy applications and applications with low amount of TLP. Different versions of the REPLICA architecture have different number of cores, hardware threads and functional units. In order to utilize the REPLICA architecture efficiently we have made several contributionsto the development of a compiler for REPLICA target code generation. It supports both code generation for PRAM mode and NUMA mode and can generate code for different versions of the processor pipeline (i.e. for different numbers of functional units). It includes optimization phases to increase the utilization of the available functional units. We have also contributed to quantitative the evaluation of PRAM and NUMA mode. The results show that PRAM mode often suits programs with irregular memory access patterns and control flow best while NUMA mode suites regular programs better. However, for a particular program it is not always obvious which mode, PRAM or NUMA, will show best performance. To tackle this we contributed a case study for generic stencil computations, using machine learning derived cost models in order to automatically select at runtime which mode to execute in. We extended this to also include a sequence of kernels

show abstract