The main contributors to message delivery latency in message passing environments are the copying operations needed to transfer and bind a received message to the consuming process/thread. A significant portion of the software communication overhead is attributed to message copying. Recently, a set of factors has been leading highperformance processor architectures toward designs that feature multiple processing cores on a single chip (a.k.a. CMP). The Cell Broadband Engine (BE) shows potential to provide high-performance to parallel applications (e.g., MPI applications). The Cell's non-homogeneous architecture along with small local storage in SPEs impose restrictions and challenges for parallel applications. In this work, we first characterize various data delivery mechanisms in the Cell BE processor; then, we propose techniques to facilitate the delivery of a message in MPI environments implemented in the Cell BE processor. We envision a cluster system comprising several cell processors each supporting several computation threads.