HPC platforms are getting increasingly heterogeneous and hierarchical. The main source of heterogeneity in many individual computing nodes is due to the utilization of specialized accelerators such as GPUs alongside general purpose CPUs. Heterogeneous many-core processors will be another source of intra-node heterogeneity in the near future. As modern HPC clusters become more heterogeneous, due to increasing number of different processing devices, hierarchical approach needs to be taken with respect to memory and communication interconnects to reduce complexity. During recent years, many scientific codes have been ported to multicore and GPU architectures. To achieve optimum performance of these applications on CPU/GPU hybrid platforms software heterogeneity needs to be accounted for. Therefore, design and implementation of data parallel scientific applications for such highly heterogeneous and hierarchical platforms represent a significant scientific and engineering challenge. This chapter will present the state of the art in the solution of this problem based on the functional performance models of computing devices and nodes.