Decarbonizing the building sector by improving the energy efficiency of the existing building stock through retrofits in a targeted and efficient way remains challenging. This is because, as of now, the energy efficiency of buildings is generally determined by on-site visits of certified energy auditors which makes the process slow, costly, and geographically incomplete. In order to accelerate the identification of promising retrofit targets on a large scale, we propose to estimate building energy efficiency from remotely sensed data sources only. To do so, we collect street view, aerial view, footprint, and satellite-borne land surface temperature (LST) data for almost 40,000 buildings across four diverse geographies in the United Kingdom. After training multiple end-to-end deep learning models on the fused input data in order to classify buildings as energy efficient (EU rating A-D) or inefficient (EU rating E-G), we analyze the best performing models quantitatively as well as qualitatively. Lastly, we extend our analysis by studying the predictive power of each data source in an ablation study. We find that the best end-to-end deep learning model achieves a macro-averaged F1-score of 62.06% and outperforms the k-NN and SVM-based baseline models by 5.62 to 11.47 percentage points, respectively. As such, this work shows the potential and complementary nature of remotely sensed data in predicting energy efficiency and opens up new opportunities for future work to integrate additional data sources.