Control-Flow Decoupling: An Approach for Timely, Non-Speculative Branching

Sheikh, Rami; Tuck, James; Rotenberg, Eric

doi:10.1109/tc.2014.2361526

Cited by 10 publications

(7 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further reducing mispredictions in software may be hard. However, previous hardware proposals [41] could help: since the loop trip count is generated outside of the loop body, the count could be communicated to the branch predictor in hardware, completely eliminating branch mispredictions.…”

Section: Mitigating Branch Misprediction Penaltymentioning

confidence: 99%

SparseTrain

Gong¹,

Ji²,

Fletcher³

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Our community has improved the efficiency of deep learning applications by exploiting sparsity in inputs. Most of that work, though, is for inference, where weight sparsity is known statically, and/or for specialized hardware. In this paper, we propose SparseTrain, a software-only scheme to leverage dynamic sparsity during training on general-purpose SIMD processors. SparseTrain exploits zeros introduced by the ReLU activation function to both feature maps and their gradients. Exploiting such sparsity is challenging because the sparsity degree is moderate and the locations of zeros change over time. SparseTrain identifies zeros in a dense data representation and performs vectorized computation. Variations of the scheme are applicable to all major components of training: forward propagation, backward propagation by inputs, and backward propagation by weights. Our experiments on a 6-core Intel Skylake-X server show that SparseTrain is very effective. In end-to-end training of VGG16, ResNet-34, and ResNet-50 with ImageNet, SparseTrain outperforms a highly-optimized direct convolution on the non-initial convolutional layers by 2.19x, 1.37x, and 1.31x, respectively. SparseTrain also benefits inference. It accelerates the non-initial convolutional layers of the aforementioned models by 1.88x, 1.64x, and 1.44x, respectively. CCS CONCEPTS • Computing methodologies → Neural networks; Shared memory algorithms; Vector / streaming algorithms.

show abstract

Section: Mitigating Branch Misprediction Penaltymentioning

confidence: 99%

SparseTrain

Gong¹,

Ji²,

Fletcher³

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

“…Speculative multithreading executes pre-computation slices [56] with architectural support to validate speculations, relies on ultra-light-weight threads to perform prefetching [13,18,61] or requires hardware communication channels between the prefetching and the main thread [49,53,58]. CFD [64] requires an architectural queue to efficiently communicate branch predicates that are loaded early in advance. Other proposals, most notably Multiscalar [24,66,74], combine software and hardware to enable instruction level parallelism using compiler-generated code structures, i.e., tasks, which can be executed simultaneously on multiple processing units.…”

Section: Related Workmentioning

confidence: 99%

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

et al. 2018

View full text Add to dashboard Cite

Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency.In this paper, we address one of the main performance bottlenecks-last-level cache misses-through a softwarehardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms. We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-flow and indirect memory accesses.The effectiveness of our software-hardware co-design is proven on the most limited but energy-efficient microarchitectures, non-speculative, in-order execution (InO) cores, which rely entirely on compile-time instruction scheduling.

show abstract

“…Speculative multithreading executes pre-computation slices [56] with architectural support to validate speculations, relies on ultra-light-weight threads to perform prefetching [13,18,61] or requires hardware communication channels between the prefetching and the main thread [49,53,58]. CFD [64] requires an architectural queue to eiciently communicate branch predicates that are loaded early in advance. Other proposals, most notably Multiscalar [24,66,74], combine software and hardware to enable instruction level parallelism using compiler-generated code structures, i.e., tasks, which can be executed simultaneously on multiple processing units.…”

Section: Related Workmentioning

confidence: 99%

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

Tran

Jimborean

Carlson

et al. 2018

Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

View full text Add to dashboard Cite

Increasing demands for energy eiciency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy eiciency. In this paper, we address one of the main performance bottlenecksÐlast-level cache missesÐthrough a softwarehardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms. We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to eiciency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-low and indirect memory accesses.

show abstract

Control-Flow Decoupling: An Approach for Timely, Non-Speculative Branching

Cited by 10 publications

References 32 publications

SparseTrain

SparseTrain

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

Contact Info

Product

Resources

About