In this work we present an efficient GPU implementation of the Fast Directional Chamfer Matching (FDCM) algorithm [10]. We propose some extensions to the original FDCM algorithm. In particular, we extend the algorithm to handle templates with variable size, to account for perspective effects. To the best of our knowledge, our work is the first to present a full implementation of a shape based matching algorithm on a GPU. Further contributions of our work consist of implementing a highly optimized CPU version of the algorithm (via multi-threading and SSE2), as well as a thorough comparison between pure GPU, pure CPU, and a hybrid version. The hybrid CPU-GPU version which turns out to be the fastest, achieves run-time of 44 fps on PAL resolution images.