The current wave of advances in Machine Learning (ML) and Deep Learning (DL) have been triggered by the availability of large-scale datasets, efficient CPU and GPU hardware, and development of easy-to-use software frameworks like TensorFlow (TF), Caffe and Torch. TensorFlow has been, by far, the most widely adopted ML/DL framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+'X': X = (InfiniBand Verbs, Message Passing Interface (MPI), and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (#6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures (e.g ResNet vs. MobileNet). Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using the Allreduce communication pattern. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits 1) CUDA kernels to perform large reductions on the GPU and 2) A pointer cache to avoid overheads involved in queries to the CUDA driver. Our proposed designs have been implemented in MVAPICH2-GDR and offer 5-17× better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages on 16 GPUs (nodes). The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8× and 3.2× higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.