Abstract-Multi-tiered memory systems, such as those based on Intel R Xeon Phi TM processors, are equipped with several memory tiers with different characteristics including, among others, capacity, access latency, bandwidth, energy consumption, and volatility. The proper distribution of the application data objects into the available memory layers is key to shorten the timeto-solution, but the way developers and end-users determine the most appropriate memory tier to place the application data objects has not been properly addressed to date.In this paper we present a novel methodology to build an extensible framework to automatically identify and place the application's most relevant memory objects into the Intel Xeon Phi fast on-package memory. Our proposal works on top of inproduction binaries by first exploring the application behavior and then substituting the dynamic memory allocations. This makes this proposal valuable even for end-users who do not have the possibility of modifying the application source code. We demonstrate the value of a framework based in our methodology for several relevant HPC applications using different allocation strategies to help end-users improve performance with minimal intervention. The results of our evaluation reveal that our proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions.
With larger and larger systems being constantly deployed,\ud
trace-based performance analysis of parallel\ud
applications has become a daunting task. Even if\ud
the amount of performance data gathered per single\ud
process is small, traces rapidly become unmanageable\ud
when merging together the information collected\ud
from all processes.\ud
In general, an e cient analysis of such a large volume\ud
of data is subject to a previous ltering step that\ud
directs the analyst's attention towards what is meaningful\ud
to understand the observed application behavior.\ud
Furthermore, the iterative nature of most scienti\ud
c applications usually ends up producing repetitive\ud
information. Discarding irrelevant data aims at reducing\ud
both the size of traces, and the time required\ud
to perform the analysis and deliver results.\ud
In this paper, we present an on-line analysis framework\ud
that relies on clustering techniques to intelligently\ud
select the most relevant information to understand\ud
how does the application behave, while keeping\ud
the trace volume at a reasonable size.Peer ReviewedPostprint (published version
Understanding the behavior of a parallel application is crucial if we are to tune it to achieve its maximum performance. Yet the behavior the application exhibits may change over time and depend on the actual execution scenario: particular inputs and program settings, the number of processes used, or hardware-specific problems. So beyond the details of a single experiment a far more interesting question arises: how does the application behavior respond to changes in the execution conditions? In this paper, we demonstrate that object tracking concepts from computer vision have huge potential to be applied in the context of performance analysis. We leverage tracking techniques to analyze how the behavior of a parallel application evolves through multiple scenarios where the execution conditions change. This method provides comprehensible insights on the influence of different parameters on the application behavior, enabling us to identify the most relevant code regions and their performance trends. Copyright 2013 ACM.Peer ReviewedPostprint (published version
On the road to Exascale computing, both performance and power areas are meant to be tackled at different levels, from system to processor level. The processor itself is the main responsible for the serial node performance and also for the most of the energy consumed by the system. Thus, it is important to have tools to simultaneously analyze both performance and energy efficiency at processor level.\ud
Performance tools have allowed analysts to understand, and even improve, the performance of an application that runs in a system. With the advent of recent processor capabilities to measure its own power consumption, performance tools can increase their collection of metrics by adding those related to energy consumption and provide a correlation between the source code, its performance and its energy efficiency.\ud
In this paper, we present a performance tool that has been extended to gather such energy metrics. The results of this tool are passed to a mechanism called folding that produces detailed metrics and source code references by using coarse grain sampling. We have used the tool with multiple serial benchmarks as well as parallel applications to demonstrate its usefulness by locating hot spots in terms of performance and power drained.Peer ReviewedPostprint (published version
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.