Somayeh Sardashti scite author profile

The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86). The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

show abstract

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy Optimization

Sardashti

Wood

2014

IEEE Micro

View full text Add to dashboard Cite

Skewed Compressed Caches

Sardashti

Seznec

Wood

2014

View full text Add to dashboard Cite

Abstract-Cache compression seeks the benefits of a larger cache with the area and power of a smaller cache. Ideally, a compressed cache increases effective capacity by tightly compacting compressed blocks, has low tag and metadata overheads, and allows fast lookups. Previous compressed cache designs, however, fail to achieve all these goals.In this paper, we propose the Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance. SCC tracks superblocks to reduce tag overhead, compacts blocks into a variable number of sub-blocks to reduce internal fragmentation, but retains a direct tag-data mapping to find blocks quickly and eliminate extra metadata (i.e., no backward pointers). SCC does this using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account. In our experiments, SCC provides on average 8% (up to 22%) higher performance, and on average 6% (up to 20%) lower total energy, achieving the benefits of the recent Decoupled Compressed Cache [26] with a factor of 4 lower area overhead and lower design complexity.

show abstract

A Primer on Compression in the Memory Hierarchy

Sardashti

Arelakis

Stenstrm

2015

Synthesis Lectures on Computer Architecture

View full text Add to dashboard Cite

Synthesis Lectures on Computer Architecture publishes 50-to 100-page publications on topics pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals. e scope will largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA, MICRO, and ASPLOS.All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means-electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher.

show abstract

F1 query

et al. 2018

View full text Add to dashboard Cite

F1 Query is a stand-alone, federated query processing platform that executes SQL queries against data stored in different filebased formats as well as different storage systems at Google (e.g., Bigtable, Spanner, Google Spreadsheets, etc.). F1 Query eliminates the need to maintain the traditional distinction between different types of data processing workloads by simultaneously supporting: (i) OLTP-style point queries that affect only a few records; (ii) low-latency OLAP querying of large amounts of data; and (iii) large ETL pipelines. F1 Query has also significantly reduced the need for developing hard-coded data processing pipelines by enabling declarative queries integrated with custom business logic. F1 Query satisfies key requirements that are highly desirable within Google: (i) it provides a unified view over data that is fragmented and distributed over multiple data sources; (ii) it leverages datacenter resources for performant query processing with high throughput and low latency; (iii) it provides high scalability for large data sizes by increasing computational parallelism; and (iv) it is extensible and uses innovative approaches to integrate complex business logic in declarative query processing. This paper presents the end-to-end design of F1 Query. Evolved out of F1, the distributed database originally built to manage Google's advertising data, F1 Query has been in production for multiple years at Google and serves the querying needs of a large number of users and systems.

show abstract

Decoupled compressed cache

Sardashti

Wood

2013

View full text Add to dashboard Cite

In multicore processor systems, last-level caches (LLCs) play a crucial role in reducing system energy by i) filtering out expensive accesses to main memory and ii) reducing the time spent executing in high-power states. Cache compression can increase effective cache capacity and reduce misses, improve performance, and potentially reduce system energy. However, previous compressed cache designs have demonstrated only limited benefits due to internal fragmentation and limited tags.In this paper, we propose the Decoupled Compressed Cache (DCC), which exploits spatial locality to improve both the performance and energy-efficiency of cache compression. DCC uses decoupled super-blocks and non-contiguous sub-block allocation to decrease tag overhead without increasing internal fragmentation. Non-contiguous sub-blocks also eliminate the need for energy-expensive re-compaction when a block's size changes. Compared to earlier compressed caches, DCC increases normalized effective capacity to a maximum of 4 and an average of 2.2 for a wide range of workloads. A further optimized Co-DCC (Co-Compacted DCC) design improves the average normalized effective capacity to 2.6 by co-compacting the compressed blocks in a super-block. Our simulations show that DCC nearly doubles the benefits of previous compressed caches with similar area overhead. We also demonstrate a practical DCC design based on a recent commercial LLC design.

show abstract

Yet Another Compressed Cache

Sardashti

Seznec

Wood

2016

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Cache memories play a critical role in bridging the latency, bandwidth, and energy gaps between cores and off-chip memory. However, caches frequently consume a significant fraction of a multicore chip's area and thus account for a significant fraction of its cost. Compression has the potential to improve the effective capacity of a cache, providing the performance and energy benefits of a larger cache while using less area. The design of a compressed cache must address two important issues: (i) a low-latency, low-overhead compression algorithm that can represent a fixed-size cache block using fewer bits and (ii) a cache organization that can efficiently store the resulting variable-size compressed blocks. This article focuses on the latter issue. Here, we propose Yet Another Compressed Cache (YACC), a new compressed cache design that targets improving effective cache capacity with a simple design. YACC uses super-blocks to reduce tag overheads while packing variable-size compressed blocks to reduce internal fragmentation. YACC achieves the benefits of two state-of-the art compressed caches-Decoupled Compressed Cache (DCC) [Sardashti and Wood 2013a, 2013b] and Skewed Compressed Cache (SCC) [Sardashti et al. 2014]-with a more practical and simpler design. YACC's cache layout is similar to conventional caches, with a largely unmodified tag array and unmodified data array. Compared to DCC and SCC, YACC requires neither the significant extra metadata (i.e., back pointers) needed by DCC to track blocks nor the complexity and overhead of skewed associativity (i.e., indexing ways differently) needed by SCC. An additional advantage over previous work is that YACC enables modern replacement mechanisms, such as RRIP. For our benchmark set, compared to a conventional uncompressed 8MB LLC, YACC improves performance by 8% on average and up to 26%, and reduces total energy by 6% on average and up to 20%. An 8MB YACC achieves approximately the same performance and energy improvements as a 16MB conventional cache at a much smaller silicon footprint, with only 1.6% greater area than an 8MB conventional cache. YACC performs comparably to DCC and SCC but is much simpler to implement.

show abstract

Thread-Sensitive Instruction Issue for SMT Processors

Robatmili

Yazdani

Sardashti

et al. 2004

IEEE Comput. Arch. Lett.

View full text Add to dashboard Cite

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.