The impact of exploiting instruction-level parallelism on shared-memory multiprocessors

Pai, Vijay S.; Ranganathan, Parthasarathy; Abdel-Shafi, Hazim; Adve, Sarita V.

doi:10.1109/12.752663

Cited by 17 publications

(18 citation statements)

References 15 publications

(15 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This facilitates INSTRUCTION LEVEL PARALLELISM with loop as combined optimization. The impact of ILP processors on the performance of shared memory multiprocessors [17] with and without latency hiding optimizing software prefetching has been represented by Pai, Ranganathan, Shafi andAdve (1999). One of the critical goals in the code optimization for multiprocessor system on single chip architecture [4] is to minimize the number of off chip memory access.…”

Section: Related Workmentioning

confidence: 99%

Role of Multiblocks in Control Flow Prediction using Parallel Register Sharing Architecture

Kumar¹,

Singh²

2010

IJCA

View full text Add to dashboard Cite

In this paper we present control flow prediction (CFP) in parallel register sharing architecture to achieve high degree of ILP. The main idea behind this concept is to use a step beyond the prediction of common branch and permitting the architecture to have the information about the CFG (Control Flow Graph) components of the program to have better branch decision for ILP. The navigation bandwidth of prediction mechanism depends upon the degree of ILP. It can be increased by increasing control flow prediction at compile time. By this the size of initiation is increased that allows the overlapped execution of multiple independent flow of control. The multiple branch instruction can also be allowed. These are intermediate steps to be taken in order to increase the size of dynamic window to achieve a high degree of instruction level parallelism exploitation.

show abstract

Section: Related Workmentioning

confidence: 99%

Role of Multiblocks in Control Flow Prediction using Parallel Register Sharing Architecture

Kumar¹,

Singh²

2010

IJCA

View full text Add to dashboard Cite

show abstract

“…[20 ] represents the impact of ILP processors on the performance of shared-memory multiprocessors, both without and with the latency hiding optimization of software pre-fetching. One of the critical goals in code optimization for Multiprocessor-System-on-a-Chip (MPSoC) architectures is to minimize the number of off-chip memory accesses.…”

Section: Epic (Explicitly Parallelmentioning

confidence: 99%

A Modern Parallel Register Sharing Architecture for Code Compilation

Kumar¹,

Singh²

2010

IJCA

View full text Add to dashboard Cite

The design of many-core-on-a-chip has allowed renewed an intense interest in parallel computing. On implementation part, it has been seen that most of applications are not able to use enough parallelism in parallel register sharing architecture. The exploitation of potential performance of superscalar processors has shown that processor is fed with sufficient instruction bandwidth. The fetcher and the Instruction Stream Buffer (ISB) are the key elements to achieve this target. Beyond the basic blocks, the instruction stream is not supported by currents ISBs. The split line instruction problem depreciates this situation for x86 processors. With the implementation of Line Weighted Branch Target Buffer (LWBTB), the advance branch information and reassembling of cache lines can be predicted by the ISB. The ISB can fetch some more valid instructions in a cycle through reassembling of original line containing instructions for next basic block. If the cache line size is more than 64 bytes, then there exist good chances to have two basic blocks in the recognized instruction line.The code generation for parallel register share architecture involves some issues that are not present in sequential code compilation and is inherently complex. To resolve such issues, a consistency contract between the code and the machine can be defined and a compiler is required to preserve the contract during the transformation of code. In this paper, we present a correctness framework to ensure the protection of the contract and then we use code optimization for verification under parallel code.

show abstract

“…The graph shows multiprocessor and uniprocessor experiments (MP/UP) before and after clustering (Base/Clust), normalized to the given application and system size without clustering. For analysis, execution time is categorized into data memory stall, CPU, synchronization stall, and instruction memory stall times, following the conventions of previous work (e.g., [14]). Since writes can retire before completing and read hits are fast, nearly all data memory stalls stem from reads that miss in the L2 cache.…”

Section: Performance Of Latbenchmentioning

confidence: 99%

“…Our previous work characterized the effectiveness of ILP processors in a shared-memory multiprocessor [14]. Although ILP techniques successfully and consistently reduced the CPU component of execution time, their impact on the memory (read) stall component was lower and more application-dependent, making read stall time a larger bottleneck in ILP-based multiprocessors than in previousgeneration systems.…”

Section: Introductionmentioning

confidence: 99%