Multimodal feature generation framework for semantic image classification

Znaidia, Amel; Shabou, Aymen; Popescu, Adrian; Borgne, Hervé Le; Hudelot, Céline

doi:10.1145/2324796.2324842

Cited by 11 publications

(8 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the early fusion methods [28], the features extracted from input data are firstly combined and then sent as input for annotation. In the late fusion methods [38,22], the local decisions are firstly obtained based on different modalities, then these decisions are combined for the final decision. The major disadvantage of multi-modal methods is that multimodal features are also required in the prediction process.…”

Section: Introductionmentioning

confidence: 99%

A Cross-modal Multi-task Learning Framework for Image Annotation

Xie

Pan

et al. 2014

Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

View full text Add to dashboard Cite

With the advance of internet, multi-modal data can be easily collected from many social websites such as Wikipedia, Flickr, YouTube, etc. Images shared on the web are usually associated with social tags or other textual information. Although existing multi-modal methods can make use of associated text to improve image annotation, the disadvantages of them are that associated text is also required for a new image to be predicted. In this paper, we propose the crossmodal multi-task learning (CMMTL) framework for image annotation. Labeled and unlabeled multi-modal data are both levaraged for training in CMMTL, and it finally obtains visual classifiers which can predict concepts for a single image without any associated information. CMMTL integrates graph learning, multi-task learning and cross-modal learning into a joint framework, where a shared subspace is learned to preserve both cross-modal correlation and concept correlation. The optimal solution of the proposed framework can be obtained by solving a generalized eigenvalue problem. We conduct comprehensive experiments on two real world image datasets: MIR Flickr and NUS-WIDE, to evaluate the performance of the proposed framework. Experimental results demonstrate that CMMTL obtains a significant improvement over several representative methods for crossmodal image annotation.

show abstract

Section: Introductionmentioning

confidence: 99%

A Cross-modal Multi-task Learning Framework for Image Annotation

Xie

Pan

et al. 2014

Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

View full text Add to dashboard Cite

show abstract

“…Here, a noise filtering algorithm is necessarily developed to remove the irrelevant web texts for the BoW-based model. Since web resources have great reliability diversity, it may not be an optimal practice to allocate fixed weights to the visual feature-based and text feature-based classifiers as in [9][10][11]105]. In this chapter, an adaptive fusion algorithm is developed for the integration of the visual feature-based and web textual feature-based classification results.…”

Section: Motivationsmentioning

confidence: 99%

“…Different from homogeneous web data-aided approaches, heterogeneous web dataaided frameworks [9][10][11] have been developed to explore different modality data and facilitate image classification, such as image tags or descriptions in the form of short text. Compared to homogeneous frameworks, heterogeneous frameworks not only use the extra images that have the same feature representation for training, but also investigate different feature representations for the web text information.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Web text-aided image classification

Wang¹

View full text Add to dashboard Cite

vi supervised setting when only little labeled data is available. Specially, we investigate web text-aided one-shot learning that is able to identify unlabeled data from novel classes based on single observation using an adaptive attention mechanism. This thesis is organized as follows. Chapter 1 introduces the motivation behind the web resources-aided image classification. Chapter 2 reviews the related works in this field, including image representation learning, text representation learning and multimodal fusion learning. Chapter 3 investigates the decision-level data fusion for web-aided image classification. An adaptive combiner for two separate bimodal classifiers is developed in decision level. This adaptive fusion algorithm is inspired by the multisensory integration mechanism of human. And the adaptability is achieved by reliabilitydependent weighting of different sensory modalities. In Chapter 4, a novel text modeling namely the semantic matching neural network (SMNN) is proposed, which is quantified by cosine similarity measures between embedded text input and task-specific semantic filters. It is capable of learning semantic features from the associated text of web images. The SMNN text features have improved reliability and applicability, compared to the text features obtained from other methods. Then, the SMNN text features and convolutional neural network visual features are jointly learned in a shared representation, which aims to capture the correlations between the two modalities in the feature level. Improving upon task-specific filters for SMNN, Chapter 5 presents a novel semantic CNN (s-CNN) model for high-level text representation learning to encode semantic correlation based on task-generic semantic filters. However, the s-CNN model inevitably brings about surplus semantic filters to achieve better applicability and generalization in universal tasks. Moreover, the surplus filters may lead to semantic overlaps and feature redundancy issue. To address this issue, the s-CNN Clustered (s-CNNC) models that uses filter clusters instead of individual filters is presented. Interacting with the image CNN models, the s-CNNC models can further boost image classification under a multimodal framework, which can be trained end-toend. Chapter 6 develops an adaptive encoder-decoder attention network that uses web text to aid one-shot image classification. Without any ground truth semantic clues, e.g., class tag information, our model is able to extract useful information Contents vii from web sourced data instead. To address the noise nature of web text, the adaptive mechanism is introduced to determine when to attend to text-inferred visual features and when to rely on original visual features. The summarization and future prospect of my PhD work is finally discussed in Chapter 7.

show abstract

“…The results of these individual analyses are then fused together to decide which annotations are relevant to the input image. In [114], the authors adopted a late-fusion strategy in which they trained three SVM classifiers: one for a feature vector representing all visual features and two classifiers for two different representations of context data. The classifiers returns three scores that are then fed to a final SVM classifier.…”

Section: Combining Visual and Context Featuresmentioning

confidence: 99%

Mining User Activity Data in Social Media Services

Costa¹

View full text Add to dashboard Cite

Social media services have a growing impact in our society. Individuals often rely on social media to get their news, decide which products to buy or to communicate with their friends. As consequence of the widespread adoption of social media, a large volume of data on how users behave is created every day and stored into large databases. Learning how to analyze and extract useful knowledge from this data has a number of potential applications. For instance, a deeper understanding on how legitimate users interact with social media services could be explored to design more accurate spam and fraud detection methods. This PhD research is based on the following hypothesis: data generated by social media users present patterns that can be exploited to improve the effectiveness of tasks such as prediction, forecasting and modeling in the domain of social media. To validate our hypothesis, we focus on designing data mining methods tailored to social media data. The main contributions of this PhD can be divided into three parts. First, we propose Act-M, a mathematical model that describes the timing of users actions. We also show that Act-M can be used to automatically detect bots among social media users based only on the timing (i.e. time-stamp) data. Our second contribution is VnC (Vote-and-Comment), a model that explains how the volume of different types of user interactions evolve over time when a piece of content is submitted to a social media service. In addition to accurately matching real data, VnC is useful, as it can be employed to forecast the number of interactions received by social media content. Finally, our third contribution is the MFS-Map method. MFS-Map automatically provides textual annotations to social media images by efficiently combining visual and metadata features. Our contributions were validated using real data from several social media services. Our experiments show that the Act-M and VnC models provided a more accurate fit to the data than existing models for communication dynamics and information diffusion, respectively. MFS-Map obtained both superior precision and faster speed when compared to other widely employed image annotation methods.

show abstract

Multimodal feature generation framework for semantic image classification

Cited by 11 publications

References 19 publications

A Cross-modal Multi-task Learning Framework for Image Annotation

A Cross-modal Multi-task Learning Framework for Image Annotation

Web text-aided image classification

Mining User Activity Data in Social Media Services

Contact Info

Product

Resources

About