This paper demonstrates the first tera-scale performance of Intel R Xeon Phi TM coprocessors on 1D fft computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 tflops with 512 nodes, which is 1.5× than achievable on a same number of Intel R Xeon R nodes. It is a challenge to fully utilize the compute capability presented by many-core widevector processors for bandwidth-bound fft computation. We leverage a new algorithm, Segment-of-Interest fft, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running fft on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging hpc systems that are increasingly communication limited.