El Niño Southern Oscillation (ENSO) is a natural climate phenomenon. Mainly characterized by the periodic variation of sea surface temperature anomalies in the eastern equatorial Pacific. It has significant impacts on global climate change. The traditional ENSO prediction methods rely on complex physical models and empirical rules, but these methods still have significant biases and uncertainties. We design a novel multi-head spatiotemporal convolutional attention module, TSCA Block. The module employs a dual-branch structure to fully exploit spatiotemporal features, where the spatiotemporal attention mechanism is used to capture long-term dependencies in time and space. Subsequent convolutional modules are used to locally enhance features. In addition, we also propose a spatiotemporal fusion prediction model based on multivariable and dual-branch Transformers, called ENSO-Former. By repeating basic building blocks at multiple stages and adopting long skip connections between shallow and deep layers, it captures multiscale features and dynamic changes of ENSO more effectively. We train using multiple meteorological factors, initially pre-training with various model-simulated data, followed by fine-tuning using transfer learning on reanalysis datasets. Experimental results show that the model achieves a correlation skill of 0.51 for predicting ENSO 20 months in advance during the period from 1983 to 2021. Compared to the current state-of-the-art deep learning model, the correlation skill is improved by 0.06. The source code will be available at https://github.com/shaxiaoyuyu/ensoformer.git.