“…It has increasingly been realized that in order to fully exploit present and future highperformance computing systems we require algorithms that parallelize well and which can be implemented efficiently on accelerators, such as GPUs [5]. In particular, for GPU computing much research effort has been undertaken to obtain efficient implementations (see, e.g., [6,8,17,18,19,31,34,39,41,44]).…”