Katsuhisa Ozaki scite author profile

Katsuhisa Ozaki

4Publications

120Citation Statements Received

77Citation Statements Given

How they've been cited

119

How they cite others

Affiliations

Shibaura Institute of Technology, Japan Science and Technology Agency, Engineering Systems (United States)

Publications

Order By: Most citations

Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications

et al. 2011

View full text Add to dashboard Cite

This paper is concerned with accurate matrix multiplication in floating-point arithmetic. Recently, an accurate summation algorithm was developed by Rump et al. (SIAM J Sci Comput 31(1):189-224, 2008).The key technique of their method is a fast error-free splitting of floating-point numbers. Using this technique, we first develop an error-free transformation of a product of two floating-point matrices into a sum of floating-point matrices. Next, we partially apply this error-free transformation and develop an algorithm which aims to output an accurate approximation of the matrix product. In addition, an a priori error estimate is given. It is a characteristic of the proposed method that in terms of computation as well as in terms of memory consumption, the dominant part of our algorithm is constituted by ordinary floating-point matrix multiplications. The routine for matrix multiplication is 96 Numer Algor (2012) 59:95-118 highly optimized using BLAS, so that our algorithms show a good computational performance. Although our algorithms require a significant amount of working memory, they are significantly faster than 'gemmx' in XBLAS when all sizes of matrices are large enough to realize nearly peak performance of 'gemm'. Numerical examples illustrate the efficiency of the proposed method.

show abstract

Tight and efficient enclosure of matrix multiplication by using optimized BLAS

Ozaki

Ogita

Oishi

2011

Numerical Linear Algebra App

View full text Add to dashboard Cite

This paper is concerned with the tight enclosure of matrix multiplication AB for two floating-point matrices A and B. The aim of this paper is to compute component-wise upper and lower bounds of the exact result C of the matrix multiplication AB by floating-point arithmetic. Namely, an interval matrix enclosing C is obtained. In this paper, new algorithms for enclosing C are proposed. The proposed algorithms are designed to mainly exploit the level 3 operations in BLAS. Although the proposed algorithms take around twice as much costs as a standard algorithm promoted by Oishi and Rump, the accuracy of the result by the proposed algorithms is better than that of the standard algorithm. At the end of this paper, we present numerical examples showing the efficiency of the proposed algorithms.We define the radius of an interval [C ij , C ij ] by (C ij −C ij )/2. For a dominant approach for this purpose, a method promoted by Oishi and Rump [7,9,12] is fast and useful. Throughout this paper, we call it Oishi-Rump's method. It exploits switches of rounding mode defined in IEEE 754, in particular, rounding upward and rounding downward are used. It is particularly worth noting that only matrix operations are required. Concretely speaking, two matrix products and only two switches of rounding mode are required to enclose the matrix multiplication (1).There is a so-called optimized BLAS (basic linear algebra subprograms) whose performance for the matrix multiplication is nearly peak. For example, GotoBLAS [13], Intel Math Kernel Library and ATLAS [14] are well known as the optimized BLAS. Moreover, matrix multiplication routines in these optimized BLAS can automatically be parallelized on several symmetric multi-processing environments. As an advantage of Oishi-Rump's method, such convenient routines for the matrix multiplication can be exploited for the enclosure. The Oishi-Rump's method is known to be wellbalanced between the tightness of the resultant interval matrix and the computational performance. This method has already been implemented on INTLAB developed by Rump [15]. Here, INTLAB is a fast and useful interval computation toolbox for MATLAB.To obtain an interval result whose radius is tighter than that calculated by Oishi-Rump's method, the use of some multi-precision libraries are seen as a means of achieving this purpose. For instance, MPFR [16] is a fast tool for multi-precision floating-point arithmetic. As it supports directed rounding in its arithmetic as in IEEE 754, it is possible to obtain enclosure of the matrix multiplication by MPFR. It is expected that the radius of the matrix given by MPFR becomes tight in proportion to computational precision.Recently, we have investigated an accurate algorithm for the matrix multiplication [17,18]. By specializing the algorithm, we develop efficient algorithms for obtaining tight enclosure of the matrix multiplication. As the proposed algorithms mainly exploit the matrix multiplication routines in BLAS, it receives much benefit from the optimized BLAS in terms of th...

show abstract

Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures

Mukunoki

Ogita

Ozaki

2020

View full text Add to dashboard Cite

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

Mukunoki

Ozaki

Ogita

et al. 2020

View full text Add to dashboard Cite

This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA's graphics processing units (GPUs). Tensor Cores are special processing units that perform 4 × 4 matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads. Keywords: Tensor cores • FP16 • Half-precision • Low-precision • Matrix multiplication • GEMM • Linear algebra • Accuracy • Reproducibility

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Katsuhisa Ozaki

Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications

Tight and efficient enclosure of matrix multiplication by using optimized BLAS

Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

Contact Info

Product

Resources

About