Underwater object recognition presents a unique set of challenges due to the complex and dynamic characteristics of marine environments. This paper introduces a novel, multilayered architecture that leverages the capabilities of Swin Transformer modules to process segmented image patches derived from aquatic scenes. A key component of our approach is the integration of the Feature Alignment Module (FAM), which is designed to address the complexities of underwater object recognition by enabling the model to selectively emphasize essential features. It combines multi-level features from various network stages, thereby enhancing the depth and scope of feature representation. Furthermore, this paper incorporates multiple detection heads, each embedded with the innovative ACmix module. This module offers an integrated fusion of convolution and self-attention mechanisms, refining detection precision. With the combined strengths of the Swin Transformer, FAM, and ACmix module, the proposed method achieves significant improvements in underwater object detection. To demonstrate the robustness and effectiveness of the proposed method, we conducted experiments on the UTDAC2020 dataset, highlighting its potential and contributions to the field.