With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well-known algorithm in the field of biosequence matching and database searching, the Smith-Waterman (S-W) algorithm as an example, and demonstrate approaches that fully exploit its performance potentials on CPU, GPU, and field-programmable gate array (FPGA) computing platforms. For CPU platforms, we perform two optimizations, single instruction, multiple data and multithread, with compiler options, to gain over 70 speedups over naive CPU versions on quad-core CPU platforms. For GPU platforms, we propose the combination of coalesced global memory accesses, shared memory tiles, and loop unfolding, achieving 50 speedups over initial GPU versions on an NVIDIA GeForce GTX 470 card. Experimental results show that the GPU GTX 470 gains 12 speedups, instead of 100 reported by some studies, over Intel quadcore CPU Q9400, under the same manufacturing technology and both with fully optimized schemes. In addition, for FPGA platforms, we customize a linear systolic array for the S-W algorithm in a 45-nm FPGA chip from Xilinx (XC6VLX760), with up to 1024 processing elements. Under only 133 MHz clock rate, the FPGA platform reaches the highest performance and becomes the most power-efficient platform, using only 25 W compared with 190 W of the GPU GTX 470. higher performance/power computation ratio. In addition to customized circuit structures, similar to application-specific integrated circuit (ASIC) chips, FPGA chips are reconfigurable (i.e., the behaviors of FPGA chips can be changed dynamically during run time). Hence, FPGA-reconfigurable computing platforms combine some of the flexibility of the software with the high performance of ASIC hardware.Despite many reports on the superiority of GPU or FPGA acceleration over CPU, there are still many open questions that cause confusion, including debates in some academic papers and website discussions [1][2][3][4][5]. In the website of NVIDIA (2701 San Tomas Expressway Santa Clara, CA 95050, USA.), a blog post [5] provides links to 10 users who have documented GPU performance increases of 100 or more. On FPGA acceleration, examples can be found in FCCM (International Symposium on Field-Programmable Custom Computing Machines) Proceedings [6], which achieve speedups of orders of magnitude versus CPU implementations. However, some obvious shortcomings in the above performance comparison have resulted in a misunderstanding of the features of three completely different computing architectures. First, the reported CPU performance results were measured from naive software versions without compiler optimization options. Second, calculating the execution time not counting the time for transferring initial data to accelerators or collecting results from accelerators. Third, neither single instruction, multiple data (SIMD) instructions nor cache optimizations [7] were utilized in the evaluation of CPU performance. In cont...