vi supervised setting when only little labeled data is available. Specially, we investigate web text-aided one-shot learning that is able to identify unlabeled data from novel classes based on single observation using an adaptive attention mechanism. This thesis is organized as follows. Chapter 1 introduces the motivation behind the web resources-aided image classification. Chapter 2 reviews the related works in this field, including image representation learning, text representation learning and multimodal fusion learning. Chapter 3 investigates the decision-level data fusion for web-aided image classification. An adaptive combiner for two separate bimodal classifiers is developed in decision level. This adaptive fusion algorithm is inspired by the multisensory integration mechanism of human. And the adaptability is achieved by reliabilitydependent weighting of different sensory modalities. In Chapter 4, a novel text modeling namely the semantic matching neural network (SMNN) is proposed, which is quantified by cosine similarity measures between embedded text input and task-specific semantic filters. It is capable of learning semantic features from the associated text of web images. The SMNN text features have improved reliability and applicability, compared to the text features obtained from other methods. Then, the SMNN text features and convolutional neural network visual features are jointly learned in a shared representation, which aims to capture the correlations between the two modalities in the feature level. Improving upon task-specific filters for SMNN, Chapter 5 presents a novel semantic CNN (s-CNN) model for high-level text representation learning to encode semantic correlation based on task-generic semantic filters. However, the s-CNN model inevitably brings about surplus semantic filters to achieve better applicability and generalization in universal tasks. Moreover, the surplus filters may lead to semantic overlaps and feature redundancy issue. To address this issue, the s-CNN Clustered (s-CNNC) models that uses filter clusters instead of individual filters is presented. Interacting with the image CNN models, the s-CNNC models can further boost image classification under a multimodal framework, which can be trained end-toend. Chapter 6 develops an adaptive encoder-decoder attention network that uses web text to aid one-shot image classification. Without any ground truth semantic clues, e.g., class tag information, our model is able to extract useful information Contents vii from web sourced data instead. To address the noise nature of web text, the adaptive mechanism is introduced to determine when to attend to text-inferred visual features and when to rely on original visual features. The summarization and future prospect of my PhD work is finally discussed in Chapter 7.