Multi-factor authentication (MFA) is a system for authenticating an individual’s identity using two or more pieces of data (known as factors). The reason for using more than two factors is to further strengthen security through the use of additional data for identity authentication. Sequential MFA requires a number of steps to be followed in sequence for authentication; for example, with three factors, the system requires three authentication steps. In this case, to proceed with MFA using a deep learning approach, three artificial neural networks (ANNs) are needed. In contrast, in parallel MFA, the authentication steps are processed simultaneously. This means that processing is possible with only one ANN. A convolutional neural network (CNN) is a method for learning images through the use of convolutional layers, and researchers have proposed several systems for MFA using CNNs in which various modalities have been employed, such as images, handwritten text for authentication, and multi-image data for machine learning of facial emotion. This study proposes a CNN-based parallel MFA system that uses concatenation. The three factors used for learning are a face image, an image converted from a password, and a specific image designated by the user. In addition, a secure password image is created at different bit-positions, enabling the user to securely hide their password information. Furthermore, users designate a specific image other than their face as an auxiliary image, which could be a photo of their pet dog or favorite fruit, or an image of one of their possessions, such as a car. In this way, authentication is rendered possible through learning the three factors—that is, the face, password, and specific auxiliary image—using the CNN. The contribution that this study makes to the existing body of knowledge is demonstrating that the development of an MFA system using a lightweight, mobile, multi-factor CNN (MMCNN), which can even be used in mobile devices due to its low number of parameters, is possible. Furthermore, an algorithm that can securely transform a text password into an image is proposed, and it is demonstrated that the three considered factors have the same weight of information for authentication based on the false acceptance rate (FAR) values experimentally obtained with the proposed system.