A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks

Journal of Parallel and Distributed Computing

2010

Self Cite

Abstract-Non-uniform memory architectures with cache coherence (ccNUMA) are becoming increasingly common, not just for large-scale high performance platforms but also in the context of multi-cores architectures. Under ccNUMA, data placement may influence overall application performance significantly as references resolved locally to a processor/core impose lower latencies than remote ones.This work develops a novel hardware-assisted page placement paradigm based on automated tracing of the memory references made by application threads. Two placement schemes, modeling both single-level and multi-level latencies, allocate pages near processors that most frequently access that memory page. These schemes leverage performance monitoring capabilities of contemporary microprocessors to efficiently extract an approximate trace of memory accesses. This information is used to decide page affinity, i.e., the node to which the page is bound. The method operates entirely in user space, is widely automated, and handles not only static but also dynamic memory allocation.Experiments show that this method, although based on lossy tracing, can efficiently and effectively improve page placement, leading to an average wall-clock execution time saving of over 20% for the tested benchmarks on the SGI Altix with a 2x remote access penalty and 12% on AMD Opterons with a 1.3-2.0x access penalty. This is accompanied by a one-time tracing overhead of 2.7% over the overall original program wallclock time.

Section: Trace-guided Page Placementmentioning

confidence: 99%

Feedback-directed page placement for ccNUMA via hardware-generated memory traces

Marathe

Thakkar

Journal of Parallel and Distributed Computing

2010

Self Cite

“…RSDs were originally proposed to track inter-procedural side effects on common substructures of arrays to promote compiler-aided parallelization [15]. Marathe et al adapted the RSD representation and proposed PRSDs for memory trace compression [21,20]. Budanur et al further designed Extended-PRSDs to perform multilevel scalable parallel memory tracing in SCALAMEMTRACE [3].…”

Section: Related Workmentioning

confidence: 99%

Elastic and scalable tracing and accurate replay of non-deterministic events

Proceedings of the 27th International ACM Conference on International Conference on Supercomputing

2013

Self Cite

SCALATRACE represents the state-of-the-art of parallel application tracing for high performance computing (HPC). This paper presents SCALATRACE II, a next generation tracer that delivers even higher trace compression capability, even when events are not always regular. In this work, we contribute a spectrum of novel compression and replay techniques that are fundamentally different from our past approaches. SCALATRACE II features a redesigned low-level encoding scheme of trace data such that data elements are elastic and self-explanatory. With this new encoding scheme, trace compression is enhanced by introducing innovative intra-node and inter-node trace compression algorithms that guarantee high compression rates in a loop structure agnostic fashion. In practice, the improved compression scheme is particularly efficient for scientific codes that demonstrate inconsistent behavior across time steps and nodes. A novel approach is further contributed to probabilistically replay sequences of non-deterministic events. To assess the compression efficacy of SCALATRACE II, we conduct experiments not only with computational kernels but also a real-world application, the Parallel Ocean Program (POP). Compared to the first generation SCALATRACE, we observe key improvements on trace compression for benchmarks with inconsistent time step behavior and diverging task level behavior while retaining timing accuracy even under probabilistic replay.

“…Our work differs in that is further develops concepts of in-situ compression from ScalaTrace [22] and METRIC [17,20,15,16,18,19]. ScalaTrace addresses intra-task and inter-process compression of communication traces, but not memory traces.…”

Section: Related Workmentioning

confidence: 99%

Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?

Budanur

SIGMETRICS Perform. Eval. Rev.

Gamblin

2011

Self Cite

Concurrency levels in large-scale supercomputers are rising exponentially, and shared-memory nodes with hundreds of cores and non-uniform memory access latencies are expected within the next decade. However, even current petascale systems with tens of cores per node suffer from memory bottlenecks. As core counts increase, memory issues will become critical for the performance of large-scale supercomputers. Trace analysis tools are thus vital for diagnosing the root causes of memory problems. However, existing memory tracing tools are expensive due to prohibitively large trace sizes, or they collect only statistical summaries and omit potentially valuable information.In this paper, we present ScalaMemTrace, a novel technique for collecting memory traces in a scalable manner. ScalaMemTrace builds on prior trace methods with aggressive compression techniques to allow lossless representation of memory traces for dense algebraic kernels, with nearconstant trace size irrespective of the problem size or the number of threads. We further introduce a replay mechanism for ScalaMemTrace traces, and discuss the results of our prototype implementation on the x86 64 architecture.