MIOpen: An Open Source Library For Deep Learning Primitives

Khan, Jehandad; Fultz, Paul; Tamazov, Artem; Lowell, Daniel; Liu, Chao; Melesse, Michael; Nandhimandalam, Murali; Nasyrov, K. A.; Perminov, I.; Shah, Tejash; Filippov, V. L.; Zhang, Jing; Zhou, Jing; Natarajan, B.; Daga, Mayank

doi:10.48550/arxiv.1910.00078

Cited by 3 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These APIs are backed by reference implementations that enable Flashlight to efficiently target CPUs, GPUs, and other accelerators. These include code generation and dedicated kernels for Intel, AMD, OpenCL, and CUDA devices, and leverage libraries such as cuDNN [Chetlur et al, 2014], MKL [Intel, 2020a], oneDNN [Intel, 2020b], ArrayFire [Malcolm et al, 2012], and MiOpen [Khan et al, 2019].…”

Section: Open Foundational Interfacesmentioning

confidence: 99%

Flashlight: Enabling Innovation in Tools for Machine Learning

Kahn¹,

Pratap²,

Likhomanenko³

et al. 2022

Preprint

View full text Add to dashboard Cite

As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototyping new computational paradigms with existing frameworks. Large frameworks prioritize machine learning researchers and practitioners as end users and pay comparatively little attention to systems researchers who can push frameworks forwardwe argue that both are equally important stakeholders. We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems by prioritizing open, modular, customizable internals and state-of-the-art, research-ready models and training setups across a variety of domains. Flashlight allows systems researchers to rapidly prototype and experiment with novel ideas in machine learning computation and has low overhead, competing with and often outperforming other popular machine learning frameworks. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together. * Now at Apple. † Now at SambaNova Systems. ‡ Currently independent. § Now at Apple.

show abstract

Section: Open Foundational Interfacesmentioning

confidence: 99%

Flashlight: Enabling Innovation in Tools for Machine Learning

Kahn¹,

Pratap²,

Likhomanenko³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…This paper target to catch up with NVIDIA's performance [16]. MIOpen is a deep learning accelerating library implemented for Radeon graphics by AMD [18]. At present, the implementation of each algorithm of the library is not perfect, and the performance fails to meet expectations, but that has certain guiding significance for the implementation of GPU in Winograd algorithm.…”

Section: Introductionmentioning

confidence: 99%

Optimizing Winograd convolution on GPUs via multithreaded communication

et al. 2023

Second International Conference on Algorithms, Microchips, and Network Applications (AMNA 2023)

View full text Add to dashboard Cite

In advanced High-performance computing (HPC), convolution operations take a big proportion in convolutional neural networks, and convolutional neural networks very common in image and video based deep learning applications, because of which, this paper takes improving the performance of convolution operation as the research direction. Convolution can be performance in many ways, such as using mathematical definition to calculate, conversing to Fast Fourier Transform (FFT), conversing to batch matrix multiplication (im2col) or using Winograd algorithm. For small filter, Winograd has unique advantages. AMD based ROCm environment, the implementation of Winograd and an optimization method of Winograd based on multi-thread communication algorithm are introduced in this paper. For the Winograd convolution in ROCm 2.9.0, the speed of the algorithm was increased by more than 150% after optimization in this paper. Under some certain computing power ituations, the performance of the optimization algorithm approaches or even exceeds cuDNN and MIOpen.

show abstract

“…AMD followed a technical route, naming their analogous component the Matrix Core (MC) in their MI100 series data center GPU. This algorithm-architecture co-design marked a huge success with the fact that mainstream deep learning frameworks like PyTorch have embraced these designs with the help of vendor-provided high-performance libraries like cuDNN, cuBLAS and MIOpen [8]. Other vendors like Google and Tesla have also presented proprietary ASIC accelerators like TPU [9] and Dojo [10], aiming to accelerate the quantized workloads by exploiting special hardware components to calculate low-precision types of data elements.…”

Section: Introductionmentioning

confidence: 99%

Benchmarking GPU Tensor Cores on General Matrix Multiplication Kernels through CUTLASS

Huang,

Zhang,

Yang

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

GPUs have been broadly used to accelerate big data analytics, scientific computing and machine intelligence. Particularly, matrix multiplication and convolution are two principal operations that use a large proportion of steps in modern data analysis and deep neural networks. These performance-critical operations are often offloaded to the GPU to obtain substantial improvements in end-to-end latency. In addition, multifarious workload characteristics and complicated processing phases in big data demand a customizable yet performant operator library. To this end, GPU vendors, including NVIDIA and AMD, have proposed template and composable GPU operator libraries to conduct specific computations on certain types of low-precision data elements. We formalize a set of benchmarks via CUTLASS, NVIDIA’s templated library that provides high-performance and hierarchically designed kernels. The benchmarking results show that, with the necessary fine tuning, hardware-level ASICs like tensor cores could dramatically boost performance in specific operations like GEMM offloading to modern GPUs.

show abstract

MIOpen: An Open Source Library For Deep Learning Primitives

Cited by 3 publications

References 0 publications

Flashlight: Enabling Innovation in Tools for Machine Learning

Flashlight: Enabling Innovation in Tools for Machine Learning

Optimizing Winograd convolution on GPUs via multithreaded communication

Benchmarking GPU Tensor Cores on General Matrix Multiplication Kernels through CUTLASS

Contact Info

Product

Resources

About