Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers.
The TULIPP project aims to simplify development of embedded vision applications with low-power and real-time requirements by providing a complete image processing system package called the TULIPP Starter Kit. To achieve this, the chosen high-performance embedded vision platform needs to be extended with performance analysis and power measurement features. The lack of such features plagues most embedded vision platforms in general and practitioners have adopted adhoc methods to circumvent the problem. In this paper, we describe four generic utilities that complement and refine the capabilities of existing platforms for embedded vision applications. Concretely, we describe a novel power measurement and analysis utility, a platform-optimized image processing library, a dynamic partial reconfiguration utility, and an utility providing support for using the real-time OS HIPPEROS within Xilinx SDSoC. Collectively, these utilities enable efficient development of image processing applications on the TULIPP hardware platform. In future work, we will evaluate the relative benefit of these utilities on key embedded image processing metrics such as frame rate and power consumption.
Abstract-Many industrial domains rely on vision-based applications which require to comply with severe performance and embedded requirements. TULIPP will develop a reference platform, which consists of a hardware system, a tool chain and a real-time operating system. This platform defines implementation rules and interfaces to tackle power consumption issues while delivering high, energy efficient and guaranteed computing performance for image processing applications. Using this reference platform will enable designers to develop a complete solution at a reduced cost to meet the typical embedded systems requirements: Size, Weight and Power. Moreover, for less constrained systems which performance requirements cannot be fulfilled by one instance of the platform, the reference platform will also be scalable so that the resulting boards can be chained for higher processing power. The instance of the reference platform developed during the project will be use-case driven and split between the implementation of: a reference hardware architecture -a scalable low-power board; a low-power operating system and image processing libraries; a productivityenhancing tool chain. It will lead to three proof-of-concept demonstrators across different application domains: real-time and low-power medical image processing product prototype of surgical X-ray system (mobile c-arm); embedded image processing systems within Unmanned Aerial Vehicles (UAVs); automotive real time embedded systems for driver assistance. TULIPP will set up an ecosystem and will closely work with standardization organizations to propose new standards derived from its reference platform to the industry.
Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.