Modern chip-multiprocessors pack an increasing amount of computational cores with each generation. Along with new computational power comes a problem of managing a large pool of active threads. Traditional debuggers often deal with concurrency style multi-threading with emphasis on a single thread. The problem of thread management when debugging parallel programs is analyzed and solutions are suggested. A related debugging framework for the massively multi-threaded, synchronous REPLICA architecture is proposed.
INTRODUCTIONA common trend in the chip development on many application domains has been the strive for more parallelism by increasing the number of on-chip processor cores. For example, even the low-end server, desktop and mobile processor markets mainly offer two to sixteen core solutions with a 2 to 4 digit number of general purpose computational GPU (graphics processing unit) cores. Along with the number of cores, the machines and their computational models need to scale to a massive number of threads. While there are many competing solutions for programming, compiling, optimizing and simulating applications on all these platforms, we believe the tools for debugging parallel applications can still improve. For instance, the applications can benefit from improvements in abstractions that better fit the parallel nature of programming the massively multi-core systems.Debuggers attempt to tackle two main categories of concurrency bugs: synchronization bugs and performance bugs. Synchronization bugs include race conditions and deadlocks. Performance bugs arise from unnecessary thread overhead due to thread creation or context switch overhead, and memory access patterns that are suboptimal for a given processor's memory hierarchy. Approaches for solving these issues are e.g. the tracing of concurrent events and resource usage with timestamps and profiling memory and CPU usage. [2] Modern debuggers often make few basic assumptions about the threading model. First, when debugging concurrent programs, either a single thread is considered at a time for execution or "background" threads are pre-emptively scheduled -possibly with further restrictions to better control their effects. When stepping through code, executing the program in global lock-steps could change the execution semantics and would also be considered slow especially as the number of threads grows. This approach makes it difficult to reason about the global machine state in a deterministic way. In the case of statically scheduled dataparallel algorithms, this asynchronicity adds overhead without benefiting the algorithm. The traditional debugging style is better suited for independent concurrency and does not fully capture the abstractions of parallel execution, e.g. traditional data-parallel algorithms [9].As a corollary to the assumption of non-deterministic scheduling of threads, it is hard to reason about the exact position and relations between of threads. A reasonably designed program could be e.g. divided into parallel and sequential sections...