Massively parallel computing technology has undergone a paradigm shift in recent years. The driving force behind this change is the need for better graphics hardware for personal computers. The latest graphics processors from ATI, Intel and NVIDIA have advanced multi-processor hardware to support popular graphics interfaces such as DirectX and OpenGL. These new graphics processing units (GPU) employ the Single Instruction Multiple Data (SIMD) computing model which enables all processors in the GPU to work simultaneously on a vast amount of data using identical instructions. This approach is very suitable for performing graphics operations because all pixels in an image require identical transformation and mapping instructions.The SIMD computing model, which revolutionized the GPU industry, is making its way into mainstream computing. Matrix operations, which are at the core of many computer graphics algorithms, are also found in many linear algebra routines. Moreover numerical procedures that require identical instructions to be executed on a large amount of data are suitable candidates for execution on the SIMD hardware in advanced GPUs. However, developing parallel algorithms for GPU hardware is not straightforward. The task is further complicated by the lack of a good software development kit (SDK) that encapsulates the hardware details in a software model.As of the writing of this article, ATI has released a Stream Computing SDK [1] whereas NVIDIA has released a new version of their Compute Unified Device Architecture (CUDA) SDK [2]. In addition to that, NVIDIA is working on an Open Computing Language (OpenCL) for programming GPU hardware [3]. In this paper, a TLM engine implemented using the CUDA SDK is presented.