Facial expression is one of the most direct and effective ways to recognize emotions, widely used in human‐computer interaction, affective computing, and other research fields. Expression recognition can be divided into discrete expression classification and continuous dimensional emotion recognition. Most of the existing multi‐dimensional emotional estimation only considers the data under laboratory conditions. In this paper, facial emotion estimation is performed based on real‐world images and combined with the advantages of multi‐task learning and attention mechanism. We improve the multi‐task attention network (MTAN) from two aspects: task and feature. At the aspect of the task, the multi‐task collaborative attention network (MTCAN), which is based on task correlation, is proposed to solve task deviation in multi‐task learning. At the aspect of the feature, based on MTCAN, we came up with MTACN, which used the self‐attention mechanism to measure the importance of each attention module for each specific task. Then, we can capture the local‐to‐global connection in one step and fully exploit the feature within different levels of each task. Experimental results on the AffectNet dataset show that the performance of the model is significantly better than the original network, and the Root‐mean‐square error and consistency correlation coefficient results are superior to other existing models. © 2021 Institute of Electrical Engineers of Japan. Published by Wiley Periodicals LLC.