Zehan Cui scite author profile

Main memory system is a shared resource in modern multicore machines, resulting in serious interference, which causes performance degradation in terms of throughput slowdown and unfairness. Numerous new memory scheduling algorithms have been proposed to address the interference problem. However, these algorithms usually employ complex scheduling logic and need hardware modification to memory controllers, as a result, industrial venders seem to have some hesitation in adopting them.This paper presents a practical software approach to effectively eliminate the interference without hardware modification. The key idea is to modify the OS memory management subsystem to adopt a page-coloring based bank-level partition mechanism (BPM), which allocates specific DRAM banks to specific cores (threads). By using BPM, memory controllers can passively schedule memory requests in a core-cluster (or thread-cluster) way.We implement BPM in Linux 2.6.32.15 kernel and evaluate BPM on 4-core and 8-core real machines by running randomly generated 20 multi-programmed workloads (each contains 4/8 benchmarks) and multi-threaded benchmark. Experimental results show that BPM can improve the overall system throughput by 4.7% on average (up to 8.6%), and reduce the maximum slowdown by 4.5% on average (up to 15.8%). Moreover, BPM also saves 5.2% of the energy consumption of memory system.

show abstract

Going vertical in memory management: Handling multiplicity by multi-policy

Liu

Cui

et al. 2014

View full text Add to dashboard Cite

show abstract

MIMS: Towards a Message Interface Based Memory System

Chen¹,

Chen²,

Yuan³

et al. 2014

J. Comput. Sci. Technol.

View full text Add to dashboard Cite

Memory system is often the main bottleneck in chipmultiprocessor (CMP) systems in terms of latency, bandwidth and efficiency, and recently additionally facing capacity and power problems in an era of big data. A lot of research works have been done to address part of these problems, such as photonics technology for bandwidth, 3D stacking for capacity, and NVM for power as well as many micro-architecture level innovations. Many of them need a modification of current memory architecture, since the decades-old synchronous memory architecture (SDRAM) has become an obstacle to adopt those advances. However, to the best of our knowledge, none of them is able to provide a universal memory interface that is scalable enough to cover all these problems.In this paper, we argue that a message-based interface should be adopted to replace the traditional bus-based interface in memory system. A novel message interface based memory system (MIMS) is proposed. The key innovation of MIMS is that processor and memory system communicate through a universal and flexible message interface. Each message packet could contain multiple memory requests or commands along with various semantic information. The memory system is more intelligent and active by equipping with a local buffer scheduler, which is responsible to process packet, schedule memory requests, and execute specific commands with the help of semantic information. The experimental results by simulator show that, with accurate granularity message, the MIMS would improve performance by 53.21%, while reducing energy delay product (EDP) by 55.90%, the effective bandwidth utilization is improving by 62.42%. Further more, combining multiple requests in a packet would reduce link overhead and provide opportunity for address compression.However main memory that acts as the bridge between high level data and low level processor is failed to scale, leading memory system to be a main bottleneck. Besides the wellknown memory wall problem [53], the memory system also faces many other challenges (walls), which are concluded as followed:Memory wall (Latency): The original "memory wall" referred to memory access latency problem [53] and it was the main problem in memory system until mid-2000s when the CPU frequency race slowed down. Then came the multi/many core age. The situation has changed a bit that queuing delays have become a major bottleneck, and might contribute more than 70% of memory latency [50]. Thus for future memory architecture, it should place a higher priority to reduce queuing delays. Exploiting higher parallelism in memory system could reduce queuing delays because it is able to de-queue requests faster [50].Bandwidth wall: The increasing number of concurrent memory requests along with the increasing amount of data, result in heavy bandwidth pressure. However the bandwidth of memory is failing to scale due to the relatively slow growth of pin counts of processor module (about 10% per year [2]). This has been concluded as bandwidth wall [46]. The average memory bandwidth for...

show abstract

Evaluation and Optimization of Breadth-First Search on NUMA Cluster

Cui

Chen

et al. 2012

View full text Add to dashboard Cite

Graph is widely used in many areas. Breadth-First Search (BFS), a key subroutine for many graph analysis algorithms, has become the primary benchmark for Graph500 ranking. Due to the high communication cost of BFS, multisocket nodes with large memory capacity (NUMA) are supposed to reduce network pressure. However, the longer latency to remote memory may cause problem if not treated well. In this work, we first demonstrate that simply spawning and binding one MPI process for each socket can achieve the best performance for MPI/OpenMP hybrid programmed BFS algorithm, resulting in 1.53X of performance on 16 nodes. Nevertheless, we notice that one MPI process per socket may exacerbate the communication cost. We propose to share some communication data structure among the processes inside the same node, to eliminate most of the intra-node communication.To fully utilize the network bandwidth, we make all the processes in a node to perform communication simultaneously. We further adjust the granularity of a key bitmap for better cache locality to speed up the computation. With all the optimizations for NUMA, communication and computation together, 2.44X of performance is achieved on 16 nodes, which is 39.2 Billion Traversed Edges per Second for an R-MAT graph of scale 32 (4 billion vertices and 64 billion edges).

show abstract

DTail

Cui

McKee

Zha

et al. 2014

View full text Add to dashboard Cite

DRAM cells must be refreshed (or rewritten) periodically to maintain data integrity, and as DRAM density grows, so does the refresh time and energy. Not all data need to be refreshed with the same frequency, though, and thus some refresh operations can safely be delayed. Tracking such information allows the memory controller to reduce refresh costs by judiciously choosing when to refresh different rows. Solutions that store imprecise information miss opportunities to avoid unnecessary refresh operations, but the storage for tracking complete information scales with memory capacity. We therefore propose a flexible approach to refresh management that tracks complete refresh information within the DRAM itself, where it incurs negligible storage costs (0.006% of total capacity) and can be managed easily in hardware or software. Completely tracking multiple types of refresh information (e.g., row retention time and data validity) maximizes refresh reduction and lets us choose the most effective refresh schemes. Our evaluations show that our approach saves 25-82% of the total DRAM energy over prior refresh-reduction mechanisms.

show abstract

Scattered superpage: A case for bridging the gap between superpage and page coloring

Chen

Wang

Cui

et al. 2013

View full text Add to dashboard Cite

CMD

Chen

Wei

Cui

et al. 2014

View full text Add to dashboard Cite

show abstract

BPM/Bpm+

Liu¹,

Cui²,

et al. 2014

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

The main memory system is a shared resource in modern multicore machines that can result in serious interference leading to reduced throughput and unfairness. Many new memory scheduling mechanisms have been proposed to address the interference problem. However, these mechanisms usually employ relative complex scheduling logic and need modifications to Memory Controllers (MCs), which incur expensive hardware design and manufacturing overheads.This article presents a practical software approach to effectively eliminate the interference without any hardware modifications. The key idea is to modify the OS memory management system and adopt a pagecoloring-based Bank-level Partitioning Mechanism (BPM) that allocates dedicated DRAM banks to each core (or thread). By using BPM, memory requests from distinct programs are segregated across multiple memory banks to promote locality/fairness and reduce interference. We further extend BPM to BPM+ by incorporating channel-level partitioning, on which we demonstrate additional gain over BPM in many cases. To achieve benefits in the presence of diverse application memory needs and avoid performance degradation due to resource underutilization, we propose a dynamic mechanism upon BPM/BPM+ that assigns appropriate bank/channel resources based on application memory/bandwidth demands monitored through PMU (performance-monitoring unit) and a low-overhead OS page table scanning process. We implement BPM/BPM+ in Linux 2.6.32.15 kernel and evaluate the technique on four-core and eightcore real machines by running a large amount of randomly generated multiprogrammed and multithreaded workloads. Experimental results show that BPM/BPM+ can improve the overall system throughput by 4.7%/5.9%, on average, (up to 8.6%/9.5%) and reduce the unfairness by an average of 4.2%/6.1% (up to 15.8%/13.9%).Extension of Conference Paper. The additional contributions of this manuscript over the previously published work of L. Liu et al. at PACT-2012 include: (1) We analyze and propose DRAM channel-level partitioning and extend BPM to BPM+; (2) We implement the proposed BPM+ in CentOS Linux 5.4 with kernel 2.6.32.15; (3) We evaluate the impact of BPM+ on the system performance and fairness and show additional improvement over BPM; (4) We devise and implement a PMU-based dynamic BPM/BPM+ scheduling mechanism, which works well in practice and brings additional optimization opportunities. . 2014. BPM/BPM+: Softwarebased dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems. ACM Trans.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zehan Cui

A software memory partition approach for eliminating bank-level interference in multicore systems

Going vertical in memory management: Handling multiplicity by multi-policy

MIMS: Towards a Message Interface Based Memory System

Evaluation and Optimization of Breadth-First Search on NUMA Cluster

DTail

Scattered superpage: A case for bridging the gap between superpage and page coloring

CMD

BPM/Bpm+

Contact Info

Product

Resources

About