Network processors (NPs) are widely used in many types of networking equipment due to their high performance and flexibility. For most NPs, software cache is used instead of hardware cache due to the chip area, cost and power constraints. Therefore, programmers should take full responsibility for software cache management which is neither intuitive nor easy to most of them. Actually, without an effective use of it, long memory access latency will be a critical limiting factor to overall applications. Prior researches like hardware multi-threading, wide-word accesses and packet access combination for caching have already been applied to help programmers to overcome this bottleneck. However, most of them do not make enough use of the characteristics of packet processing applications and often perform intraprocedural optimizations only. As a result, the binary codes generated by those techniques often get lower performance than that comes from hand-tuned assembly programming for some applications.In this paper, we propose an algorithm including two techniques -Critical Path Based Analysis (CPBA) and Global Adaptive Localization (GAL), to optimize the software cache performance of packet processing applications. Packet processing applications usually have several hot paths and CPBA tries to insert localization instructions according to their execution frequencies. For further optimizations, GAL eliminates some redundant localization instructions by interprocedural analysis and optimizations. Our algorithm is applied on some representative applications. Experiment results show that it leads to an average speedup by a factor of 1.974.
Garbage collection (GC), especially full GC, would non- trivially impact overall application performance, especially for those memory-hungry ones handling large data sets. This paper presents an in-depth performance analysis on the full GC performance of Parallel Scavenge (PS), a state-of-the-art and the default garbage collector in the HotSpot JVM, using traditional and big-data applications running atop JVM on CPU (e.g., Intel Xeon) and many-integrated cores (e.g., Intel Xeon i). The analysis uncovers that unnecessary memory accesses and calculations during reference updating in the compaction ase are the main causes of lengthy full GC. To this end, this paper describes an incremental query model for reference calculation, which is further embodied with three schemes (namely optimistic, sort-based and region-based) for different query patterns. Performance evaluation shows that the incremental query model leads to averagely 1.9X (up to 2.9X) in full GC and 19.3% (up to 57.2%) improvement in application throughput, as well as 31.2% reduction in pause time over the vanilla PS collector on CPU, and the numbers are 2.1X (up to 3.4X), 11.1% (up to 41.2%) and 34.9% for Xeon i accordingly.
With the rapid growth of multimedia and game, these applications put more and more pressure on the processing ability of modern processors. Multiple SIMD architecture is widely used in multimedia processing field as a multimedia accelerator.With the consideration of power consumption and chip size, shared memory multiple SIMD architecture is mainly used in embedded SOCs. In order to further fit mobile environment, there is the constraint of limited register number as well. Although shared memory multiple SIMD architecture simplify the chip design, these constraints are the major obstacles to map the real multimedia applications to these architectures. Until now, to our best knowledge, there is little research on the optimizing techniques for shared memory multiple SIMD architecture.In this paper, we present a compiler framework, which aims at automatically generating high performance codes for shared memory multiple SIMD architecture. In this framework, we reduce the competition of shared data bus through increasing the register locality, improve the utilization of data bus by read-only data vector replication and solve the problem of limited register number through a resource allocation algorithm. The framework also handlers the issues concerning on data transformation. As the experimental results shown, this framework is successful in mapping real multimedia applications to shared memory multiple SIMD architecture. It leads to an average speedup by a factor of 3.19 and an average utilization of SM-SIMD architecture with 8 SIMD units by a factor of 52.6%.
Phase analysis, which classifies the set of execution intervals with similar execution behavior and resource requirements, has been widely used in a variety of dynamic systems, including dynamic cache reconfiguration, prefetching and race detection. While phase granularity has been a major factor to the accuracy of phase prediction, it has not been well investigated yet and most dynamic systems usually adopt a fine-grained prediction scheme. However, such a scheme can only take account of recent local phase information and could be frequently interfered by temporary noises due to instant phase changes, which might notably limit the prediction accuracy. In this paper, we make the first investigation on the potential of multi-level phase analysis (MLPA), where different granularity phase analysis are combined together to improve the overall accuracy. The key observation is that a coarse-grained interval, which usually consists of stably-distributed fine-grained intervals, can be accurately identified based on the fine-grained intervals at the beginning of its execution. Based on the observation, we design and implement a MLPA scheme. In such a scheme, a coarse-grained phase is first identified based on the fine-grained intervals at the beginning of its execution. The following fine-grained phases in it are then predicted based on the sequence of fine-grained phases in the coarse-grained phase. Experimental results show such a scheme can notably improve the prediction accuracy. Using Markov fine-grained phase predictor as the baseline, MLPA can improve prediction accuracy by 20%, 39% and 29% for next phase, phase change and phase length prediction for SPEC2000 accordingly, yet incur only about 2% time overhead and 40% space overhead (about 360 bytes in total). To demonstrate the effectiveness of MLPA, we apply it to a dynamic cache reconfiguration system which dynamically adjusts the cache size to reduce the power consumption and access time of data cache. Experimental results show that MLPA can further reduce the average cache size by 15% compared to the fine-grained scheme.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.