Matrix multiplication has been implemented in various programming languages, and improved performance has been reported in many articles under various settings. Matrix multiplication is of paramount interest to machine learning, a lightweight matrix-based key management protocol for IoT networks, animation, and so on. There has always been a need and an interest for improved performance in terms of algorithm implementation. In this work, the authors compared the run times of matrix multiplication in popular languages such as C++, Java, and Python. This analysis showed that Python’s implementation was poor while Java was relatively slower compared to the C++ implementation. All the aforementioned languages use a row-major scheme, and hence, there are many cache misses encountered when implemented through simple looping. In contrast, the authors show that by changing the loop order, more performance gains are possible. Moreover, we evaluated the performance of matrix multiplication by comparing the execution time under various loop settings. The authors observed tremendous performance gains due to better spatial locality. In addition, the authors also implemented a parallel version of the same algorithm using OpenMP with eight logical cores and achieved a speed-up of seven times compared to the serial implementation.