Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor

Gibert, Enric; Sánchez, J.; González, Antonio

doi:10.1109/micro.2002.1176244

Cited by 14 publications

(19 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One simply provides the microarchitecture and topology in an abstract form; i.e., location, number, and spatial relationship of microarchitectural resources such as PEs, caches, and register files. While we demonstrated SPS's effectiveness for the TRIPS ISA and microarchitecture, we believe it is applicable to schedulers for other partitioned architectures such as WaveScalar [38], and may be useful for clustered VLIWs [20] and RAW [67,34].…”

Section: Spatial Path Schedulingmentioning

confidence: 97%

Tera-OP Reliable Intelligently Adaptive Processing System (TRIPS) Implementation

Keckler¹,

Burger²,

McKinley³

et al. 2008

View full text Add to dashboard Cite

The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.

show abstract

Section: Spatial Path Schedulingmentioning

confidence: 97%

Tera-OP Reliable Intelligently Adaptive Processing System (TRIPS) Implementation

Keckler¹,

Burger²,

McKinley³

et al. 2008

View full text Add to dashboard Cite

show abstract

“…Another way to partition the L1 data cache is to distribute a cache line among clusters in a word-interleaved manner [15]. In such a configuration, each cache module will hold some words of each memory block, depending on the data address and the interleaving factor of the architecture.…”

Section: Architecturementioning

confidence: 99%

“…These buses are controlled by the compiler, which is responsible for adding and scheduling an explicit copy operation whenever it assigns two register-flow dependent instructions to different clusters. This paper presents a comparative study of different architecture/compilation techniques that we have recently proposed for fully distributed clustered VLIW processors ( [38], [15], [17]). For each proposed architecture, efficient instruction scheduling techniques are developed, which are strongly tied to the architectural configuration in order to exploit its particularities.…”

Section: Introductionmentioning

confidence: 99%

Distributed Data Cache Designs for Clustered VLIW Processors

Gibert

Sánchez

González

2005

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

Abstract-Wire delays are a major concern for current and forthcoming processors. One approach to deal with this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the L1 data cache typically remains centralized in what we call partially distributed architectures. However, as technology evolves, the relative latency of such a centralized cache will increase, leading to an important impact on performance. In this paper, we propose partitioning the L1 data cache among clusters for clustered VLIW processors. We refer to this kind of design as fully distributed processors. In particular, we propose and evaluate three different configurations: a snoop-based cache coherence scheme, a word-interleaved cache, and flexible L0 buffers managed by the compiler. For each alternative, instruction scheduling techniques targeted to cyclic code are developed. Results for the Mediabench suite show that the performance of such fully distributed architectures is always better than the performance of a partially distributed one with the same amount of resources. In addition, the key aspects of each fully distributed configuration are explored.

show abstract

“…In [23], the authors proposed to distribute the L1 cache among clusters in a cache-coherent manner. In [10], a much simpler design was proposed, in which the L1 data cache is distributed among clusters in a word-interleaved manner. We compare our work to these two distributed cache configurations in Section 5.3.…”

Section: Related Workmentioning

confidence: 99%

“…The cache could be close to one or few clusters but not to all of them. Because of that, some recent works advocate for the distribution of the first level data cache among clusters as well [24][23] [10]. Several configurations have been studied and instruction scheduling techniques have been proposed to exploit the underlying cache architecture.…”

Section: Introductionmentioning

confidence: 99%