Proceedings of the 2021 International Conference on Multimedia Retrieval 2021
DOI: 10.1145/3460426.3463625
|View full text |Cite
|
Sign up to set email alerts
|

TEACH: Attention-Aware Deep Cross-Modal Hashing

Abstract: Hashing methods for cross-modal retrieval have recently been widely investigated due to the explosive growth of multimedia data. Generally, real-world data is imperfect and has more or less redundancy, making cross-modal retrieval task challenging. However, most existing cross-modal hashing methods fail to deal with the redundancy, leading to unsatisfactory performance on such data. In this paper, to address this issue, we propose a novel crossmodal hashing method, namely aTtEntion-Aware deep Cross-modal Hashi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(6 citation statements)
references
References 40 publications
(45 reference statements)
0
4
0
Order By: Relevance
“…Unsupervised hashing methods cannot construct a multi-label similarity matrix to guide the learning of hash codes due to the inability to obtain the labels of the samples. As described in [ 31 , 38 , 41 ], building a similarity matrix using deep neural networks to capture the complementary and coexistence information of the original data is a superior method, which can provide effective self-supervision for the learning of hash functions. In particular, we use mini-batch visual features to build the visual modality similarity matrix , where .…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Unsupervised hashing methods cannot construct a multi-label similarity matrix to guide the learning of hash codes due to the inability to obtain the labels of the samples. As described in [ 31 , 38 , 41 ], building a similarity matrix using deep neural networks to capture the complementary and coexistence information of the original data is a superior method, which can provide effective self-supervision for the learning of hash functions. In particular, we use mini-batch visual features to build the visual modality similarity matrix , where .…”
Section: Methodsmentioning
confidence: 99%
“…By focusing on the information that is more critical to the current target among the many inputs and reducing the attention to other information or even filtering out irrelevant information, the attention mechanism can solve the information redundancy problem and improve the efficiency and accuracy of the task processing. In recent years, attention-based cross-modal retrieval methods [ 30 , 31 , 32 ] have been initially explored. Attention-aware deep adversarial hashing (ADAH) [ 33 ] proposes an adversarial hash network with an attention mechanism to enhance the measure of content similarity by selectively paying attention to the informative parts of multi-modal data.…”
Section: Related Workmentioning
confidence: 99%
“…tags or pair-wise similarity information) and generally achieve better accuracy. To highlight useful information and suppress redundant information, TEACH [14] and MMACH [15] add attention mechanism to feature learning process, and the latter utilizes multi-label information to further improve accuracy additionally. HSSAH [16] replaces binary similarity matrix by asymmetric high-level semantic similarity to maintain richer semantic information.…”
Section: A Non-continuous Cross-modal Hashingmentioning
confidence: 99%
“…As an efficient information retrieval paradigm for big multimedia data [1][2][3], crossmodal retrieval [4][5][6][7][8][9] utilizes one modality as a query to search another modal data. Among the existing techniques [10][11][12][13][14][15][16][17][18], cross-modal hashing [16][17][18][19][20] is popular for its fast retrieval speed and low storage cost. The core of cross-modal hashing is to map high-dimensional data to a low-dimensional common Hamming space, in which multi-modal instances (text, image, audio and video) with similar semantics are closer, and dissimilar instances are far apart.…”
Section: Introductionmentioning
confidence: 99%