This paper explores real-time Convolutional Neural Network inference on Field Pro- grammable Gate Arrays (FPGAs) implemented in Synchronous Message Exchange (SME). We compare SME to the widespread FPGA tool, High-Level Synthesis (HLS), and com- pare both the SME and HLS implementations of CNNs with the PyTorch implementation for CNN on CPU/GPU. We find that the SME implementation is more flexible than the HLS implementation as it allows for more customization of the hardware. Programming with SME is more difficult than HLS, although easier than traditional Hardware Descrip- tion Languages. Finally, for a test use case, we find that the SME implementation on FPGA is approximately 2.8/1.4/2.0 times more energy efficient than CPU/GPU/ARM at larger batch sizes, with the HLS implementation on FPGA falling in between CPU/ARM and GPU in terms of energy efficiency. At a batch size of 1, appropriate for edge-device inference, the gap in energy efficiency between the FPGA and CPU/GPU/ARM imple- mentations becomes more pronounced, with the SME implementation on FPGA being approximately 83/47/8 times more energy efficient than the CPU/GPU/ARM implemen- tations, and with the HLS implementation on FPGA being approximately 40/23/4 times more energy efficient than the CPU/GPU/ARM implementations.