A probabilistic model for multimodal hash function learning

Zhen, Yi; Yeung, Dit–Yan

doi:10.1145/2339530.2339678

Cited by 177 publications

(105 citation statements)

References 31 publications

Supporting

Mentioning

104

Contrasting

Order By: Relevance

“…In our experiments, since few works focus on the local feature representation-based hashing scheme for cross-modal retrieval, we can only systematically compare the proposed BSE method with six prevailing global hashing methods for cross-modal retrieval tasks: CVH [11], MLBE [22], IMH [15], CMSSH [21], CHMIS [14], CMFH [16], and QCH [39]. For fair comparison, all the methods are implemented on the same SIFT features and word vectors in the image and text domains, respectively.…”

Section: B Compared Methods and Experimental Settingsmentioning

confidence: 99%

“…With extended SpH [18], Kumar and Udupa [11] proposed cross-view hashing (CVH) to generate binary codes for each modality via canonical correlation analysis (CCA). Multimodal latent binary embedding (MLBE) [22] is another cross-modal hashing method considering both the intermodal and intramodal similarity via a probabilistic model. To learn the hash function with good generalization, co-regularized hashing [13] was proposed to project data far from zero.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Binary Set Embedding for Cross-Modal Retrieval

Liu

Shao

2017

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

Abstract-Cross-modal retrieval is such a challenging topic that traditional global representations would fail to bridge the semantic gap between images and texts to a satisfactory level. Using local features from images and words from documents directly can be more robust for the scenario with large intraclass variations and small interclass discrepancies. In this paper, we propose a novel unsupervised binary coding algorithm called binary set embedding (BSE) to obtain meaningful hash codes for local features from the image domain and words from text domain. Understanding image features with the word vectors learned from the human language instead of the provided documents from data sets, BSE can map samples into a common Hamming space effectively and efficiently where each sample is represented by the sets of local feature descriptors from image and text domains. In particular, BSE explores relationship among local features in both feature level and image (text) level, which can balance the sensitivity of each other. Furthermore, a recursive orthogonalization procedure is applied to reduce the redundancy of codes. Extensive experiments demonstrate the superior performance of BSE compared with state-of-the-art cross-modal hashing methods using either image or text queries.

show abstract

Section: B Compared Methods and Experimental Settingsmentioning

confidence: 99%

mentioning

confidence: 99%

Binary Set Embedding for Cross-Modal Retrieval

Liu

Shao

2017

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

show abstract

“…Other recent works include CMSSH [1], MLBE [25] and LSCMR [12]. CMSSH uses a boosting method to learn the projection function for each dimension of the latent space.…”

Section: Related Workmentioning

confidence: 99%

“…There has been a long stream of research on multi-modal retrieval [27,1,15,9,25,12,26,20]. These works share a similar query processing strategy which consists of two major steps.…”

Section: Introductionmentioning

confidence: 99%

“…We observe that most existing works, such as CVH [9], IMH [20], MLBE [25], CMSSH [1], and LSCMR [12], require a substantial amount of prior knowledge about the training data to learn effective mapping functions. Preparing prior knowledge in terms of large training dataset is labor-intensive, and due to manual intervention, the prepared knowledge may not be comprehensive in capturing the regularities (e.g., distribution or similarity relationships) of data.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Effective multi-modal retrieval based on stacked auto-encoders

et al. 2014

View full text Add to dashboard Cite

Multi-modal retrieval is emerging as a new search paradigm that enables seamless information retrieval from various types of media. For example, users can simply snap a movie poster to search relevant reviews and trailers. To solve the problem, a set of mapping functions are learned to project high-dimensional features extracted from data of different media types into a common lowdimensional space so that metric distance measures can be applied. In this paper, we propose an effective mapping mechanism based on deep learning (i.e., stacked auto-encoders) for multi-modal retrieval. Mapping functions are learned by optimizing a new objective function, which captures both intra-modal and inter-modal semantic relationships of data from heterogeneous sources effectively. Compared with previous works which require a substantial amount of prior knowledge such as similarity matrices of intramodal data and ranking examples, our method requires little prior knowledge. Given a large training dataset, we split it into minibatches and continually adjust the mapping functions for each batch of input. Hence, our method is memory efficient with respect to the data volume. Experiments on three real datasets illustrate that our proposed method achieves significant improvement in search accuracy over the state-of-the-art methods.

show abstract