Abstract-Unacceptable execution time of Non-rigid registration (NRR) often presents a major obstacle to its routine clinical use. Parallel computing is an effective way to accelerate NRR. However, development of efficient parallel NRR codes is a very challenging task. One desirable approach is to map the existing sequential algorithm to the parallel architecture to gain speedup instead of designing a new parallel algorithm. Multicores and GPU provide us a cooperative architecture, in which both Single Instruction Multiple Data (SIMD) and Single Program Multiple Data (SPMD) programming models can co-exist and complement each other. We present a method to parallelize a NRR on this cooperative architecture. Our approach is first to separate the sequential algorithm into regular and irregular parts. We then map the regular part on GPU following SIMD paradigm and irregular part on multicores in a SPMD fashion. Unlike the approaches that use multicores or GPU alone, our approach leads to desirable speedup for the whole application by taking advantage of all components of the cooperative parallel architecture, for all individual parts of the application. This helps us to get closer to our goal: cheaper and faster NRR that leads to its more widespread use. The results of our evaluation on clinical brain MRI data show that the GPU-based Block Matching (regular part) can run at least 1.9 times faster than on a typical cluster of workstations with eight high-performance nodes. The multicores-based implementation of the incremental finite element solver (irregular part) achieves speedup of up to 7 times compared to its sequential version. As a result, the total run time of the NRR code can be reduced to less than 1 minute therefore satisfying the real time requirement for its clinical application.