Data, addresses, and instructions are compressed by maintaining only significant bytes with two or three extension bits appended to indicate the signijicant byte positions. This significance compression method is integrated into a 5-stage pipeline, with the extension bitsflowing down the pipeline to enable pipeline operations only for the signijicant bytes. Consequently registel; logic, and cache activity (and dynamic power) are substantially reduced.An initial trace-driven stud.y shows reduction in activity of approximately 30-40% for each pipeline stage. Several pipeline organizations are studied. A byte serial pipeline is the simplest implementation, but suffers a CPI (cycles per instruction) increase of 79% compared with a conventional 32-bit pipeline. Widening certain pipeline stages in order to balance processing bandwidth leads to an implementation with a CPI 24% higher than the baseline 32-bit design. Finally, full-width pipeline stages with operand gating achieve a CPI within 2-6% of the baseline 32-bit pipeline.
One of the main concerns in today's processor design is the issue logic. Instruction-level parallelism is usually favored by an out-of-order issue mechanism where instructions can issue independently of the program order. The out-of-order scheme yields the best performance but at the same time introduces important hardware costs such as an associative look-up, which might be prohibitive for wide issue processors with large instruction windows. This associative search may slow-down the clock-rate and it has an important impact on power consumption. In this work, two new issue schemes that reduce the hardware complexity of the issue logic with minimal ikapact on the average number of instructions executed per cycle are presented.
In IntroductionIt is well known that current superscalar organisations are approaching a point of dimishing returns. It is not trivial to change from a 4-way issue to an 8-way issue architecture due to its hardware complexity and its implications in the cycle time. Nevertheless, the instruction level parallelism (ILP) that an 8-way issue processor can exploit is much beyond that of a 4-way issue one. One of the solutions to this challenge is clustering. Clustering offers the advantages of the partitioned schemes where one can achieve high rates of ILP and sustain a high clock rate. A partitioned architecture tends to make the hardware simpler and its control and data paths faster. For instance, it has fewer register file ports, fewer data bus sources/destinations, and fewer alternatives for many control decisions.Current processors are partitioned into two subsystems (the integer and the floating point one). As it has been shown [14,16], the FP subsystem can be easily extended to execute simple integer and logical operations. For instance, both register files nowadays hold 64-bit values, and simple integer units (no multiplication and division) can be embedded within a small hardware cost due to today's transistor budgets. Furthermore, the hardware modifications required of the existing architectures are minimal. This work focuses on this type of clustered architecture with two clusters, one for integer calculations and another one for integer and floating-point calculations. The advantage of this architecture is that now its floating-point registers, data path and mainly, its issue logic are used 100% of the time in any application.There are two main issues concerning clustered architectures. The first one is the communication overhead between clusters. Since inter-cluster communications can easily take one or more cycles, the higher the number of communications the lower the performance will be due to the delay introduced between dependent instructions. The second issue is the workload balance. If the workload is not optimally balanced, one of the clusters might have more work than it can manage and the other might be less productive than it can be. Thus, in order to achieve the highest performance we have to balance the workload optimally and, at the same time, minimise the number of communications. The workload balance and the communication overhead depend on the technique used to distribute the program instructions between both clusters. Programs can be partitioned either at compile-time (statically) or at run-time (dynamically). The latter approach relies on a steering logic that decides in which cluster each decoded instruction will be executed. The steering logic is responsible for maximising the trade-off between communication and workload balance and therefore, it is a key issue in the design. In this work, a new steering scheme is proposed and its performance is evaluated. We show that the proposed scheme outperforms a previously proposed static approach for the same architecture [16]. Moreov...
Multicore architectures are ruling the recent microprocessor design trend. This is due to different reasons: better performance, threadlevel parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one); and, ease and reuse of design.This paper presents a thorough evaluation of multicore architectures. The architecture we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1 and L2, and a shared bus interconnect. We consider parallel shared memory applications. We explore the design space related to the number of cores, L2 cache size and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption and temperature. Design tradeoffs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed and they show the importance of considering thermal effects at the architectural level to achieve the best design choice.
The issue logic of dynamically scheduled superscalar processors is one of their most complex and power-consuming parts. In this paper we present alternative issue-logic designs that are much simpler than the traditional scheme while they retain most of its ability to exploit ILP. These alternative schemes are based on the observation that most values produced by a program are used by very few instructions, and the latencies of most operations are deterministic.
Chip multiprocessor (CMP) systems have made the on-chip caches a critical resource shared among co-scheduled threads. Limited off-chip bandwidth, increasing on-chip wire delay, destructive inter-thread interference, and diverse workload characteristics pose key design challenges. To address these challenge, we propose CMP cooperative caching (CC), a unified framework to efficiently organize and manage on-chip cache resources. By forming a globally managed, shared cache using cooperative private caches. CC can effectively support two important caching applications: (1) reduction of average memory access latency and (2) isolation of destructive inter-thread interference.CC reduces the average memory access latency by balancing between cache latency and capacity optimizations. Based private caches, CC naturally exploits their access latency benefits. To improve the effective cache capacity, CC forms a "shared" cache using replication control and LRU-based global replacement policies. Via cooperation throttling, CC provides a spectrum of caching behaviors between the two extremes of private and shared caches, thus enabling dynamic adaptation to suit workload requirements. We show that CC can achieve a robust performance advantage over private and shared cache schemes across different processor, cache and memory configurations, and a wide selection of multithreaded and multiprogrammed workloads.To isolate inter-thread caching interference, we add a time-sharing aspect on top of spatial cache partitioning. Our approach uses Multiple Time-sharing Partitions (MTP) to simultaneously improve throughput and fairness while maintaining QoS over the longer term. Each MTP partition unfairly improves at least one thread's throughput, and partitions favoring different threads are scheduled in a cooperative, timesharing manner to either maintain fairness and QoS, or implement priority. We also integrate MTP with CC's LRU-based capacity sharing policy to combine their benefits. The integrated scheme-Cooperative Caching Partitioning (CCP)-divides the total execution epochs into those controlled by either MTP or the ii baseline CC policy, respectively, according to the fraction of threads that can benefit from each of them. Our simulation results show that for a wide range of multiprogrammed workloads, CCP can improve throughput, fairness and QoS for workloads suffering from destructive interference, while achieving the performance benefit of the baseline CC policy for other workloads.iii
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.