Disjoint out-of-order execution processor

Sharafeddin, Mageda; Jothi, Komal; Akkary, Haitham

doi:10.1145/2355585.2355592

Cited by 13 publications

(6 citation statements)

References 69 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the effectiveness of thread partition is determined by these five parameters, and [LLoTG, ULoTG, DDC, LLoSD, ULoSD] represents the partition scheme. For example, a partition scheme could be [10,50,18,20,30]. These values indicate that thread granularity ranges from 10 to 50, and data dependence count is no more than 18, and spawning distance is set from 20 to 30 during the period of thread partition, and H 1 , H 2 , H 3 , H 4 , H 5 can be expressed as follows:…”

Section: Partitioning Schemementioning

confidence: 99%

See 1 more Smart Citation

IDaTPA:Importance Degree Based Thread Partitioning Approach in Thread Level Speculation

Li,

Zhang,

Wang

et al. 2023

Preprint

View full text Add to dashboard Cite

As an auto-parallelization technique with the level of thread on multi-core,Thread-Level Speculation (TLS) which is also called Speculative Multithreading(SpMT), partitions programs into multiple threads and speculatively executesthem under conditions of ambiguous data and control dependence. Thread par-titioning approach plays a key role to the performance enhancement in TLS.The existing heuristic rules-based approach (HR-based approach) which is anone-size-fits-all strategy, can not guarantee to achieve the optimal thread parti-tioning. In this paper, an importance degree based thread partitioning approach(IDaTPA) is proposed to realize the partition of irregular programs into mul-tithreads. IDaTPA implements biasing partitioning for every procedure with amachine learning method. It mainly includes: constructing sample set, expres-sion of knowledge, calculation of similarity, prediction model and the partitionof the irregular programs is performed by the prediction model. Using IDaTPA,the subprocedures in unseen irregular programs can obtain their optimal parti-tion. On a generic SpMT processor (called Prophet) to perform the performanceevaluation for multithreaded programs, the IDaTPA is evaluated and averagelydelivers a speedup of 1.80 upon a 4-core processor. Furthermore, in order toobtain the portability evaluation of IDaTPA, we port IDaTPA to 8-core processorand obtain a speedup of 2.82 on average. Experiment results show that IDaTPAobtains a significant speedup increasement and Olden benchmarks respectively deliver a 5.75% performance improvement on 4-core and a 6.32% performanceimprovement on 8-core, and SPEC2020 benchmarks obtain a 38.20% performanceimprovement than the conventional HR-based approach.

show abstract

Section: Partitioning Schemementioning

confidence: 99%

“…Within this model that the closer the distance between x q and x i is, the bigger the weights are and the sum of weight is equal to 1, we can obtain the assignment weights in the formula (10).…”

Section: Generation Of Partition Schemementioning

confidence: 99%

IDaTPA:Importance Degree Based Thread Partitioning Approach in Thread Level Speculation

Li,

Zhang,

Wang

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The existing contributions on a hardware approach to automatize parallelization [18][19][20][21] are penalized by the low basic ILP measured in programs. 10 The hardware-based parallelization in Goossens et al 22 overcomes this limitation in 2 ways: (1) very distant ILP is caught when fetch is parallelized, (2) all stack memory false dependences and stack pointer true dependences are removed.…”

Section: Related Work and Conclusionmentioning

confidence: 99%

Computing on many cores

Goossens

Parello

Porada

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary This paper presents an alternative method to parallelize programs, better suited to manycore processors than actual operating system–/API‐based approaches like OpenMP and MPI. The method relies on a parallelizing hardware and an adapted programming style. It frees and captures the instruction‐level parallelism (ILP). A many‐core design is presented in which cores are multithreaded and able to fork new threads. The programming style is based on functions. The hardware creates a concurrent thread at each function call. The programming style and the hardware create the conditions to free the ILP, by eliminating the architectural dependences between a call and its continuation after return. We illustrate the method on a sum reduction, a matrix multiplication and a sort. We measure the ILP of the parallel runs and show that it is high enough to feed thousands of cores because it increases with data size. We compare our method to pthread parallelization, showing that (1) our parallel execution is deterministic, (2) our thread management is cheap, (3) our parallelism is implicit, and (4) our method parallelizes functions and loops. Implicit parallelism makes parallel code easy to write and read. Deterministic parallel execution makes parallel code easy to debug.

show abstract

“…They gave solutions to allocate later and free sooner the needed resources to optimize their usage and so, take care of more "on-the-fly" instructions with the same resources. In 2012, Sharafeddine, Jothi and Akkary [12] proposed an architecture to partition a run into parallel threads, forking the leading thread at call. In the sum example this leads to fork on both of the highest levels calls but not on the lower levels, capturing only a small part of the distant ILP.…”

Section: Ilp In Programsmentioning

confidence: 99%

Toward a Core Design to Distribute an Execution on a Manycore Processor

Goossens

Parello

Porada

et al. 2015

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. This paper presents a parallel execution model and a manycore processor design to run C programs in parallel. The model automatically builds parallel sections of machine instructions from the run trace. It parallelizes instructions fetches, renamings, executions and retirements. Predictor based fetch is replaced by a fetch-decode-and-partlyexecute stage able to compute in-order most of the control instructions. Tomasulo's register renaming is extended to memory with a technique to match consumer/producer pairs. The Reorder Buffer is adapted to allow parallel retirement. The model is presented on a sum reduction example which is also used to give a short analytical evaluation of the model performance potential.

show abstract

Disjoint out-of-order execution processor

Cited by 13 publications

References 69 publications

IDaTPA:Importance Degree Based Thread Partitioning Approach in Thread Level Speculation

IDaTPA:Importance Degree Based Thread Partitioning Approach in Thread Level Speculation

Computing on many cores

Toward a Core Design to Distribute an Execution on a Manycore Processor

Contact Info

Product

Resources

About