2019
DOI: 10.1364/oe.27.034056
|View full text |Cite
|
Sign up to set email alerts
|

Attention-based fusion network for human eye-fixation prediction in 3D images

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
9
1

Relationship

1
9

Authors

Journals

citations
Cited by 15 publications
(7 citation statements)
references
References 14 publications
0
7
0
Order By: Relevance
“…e CNN is used to extract representational image features, and the RNN is naturally suitable for the recognition problem of sequence data, so this network architecture is well suited for image-based text line recognition. Convolutional RNN [15] is a representative approach for this type of network architecture, which uses CNN networks to extract high-level semantic features of images, transforms the extracted features into feature sequences, and then, uses a bidirectional long-and short-term (BiLSTM) memory network [2,16]. is model uses a CNN network to extract high-level semantic features, converts the extracted features into feature sequences, and then, uses a BiLSTM network [17][18][19] to capture the contextual information in both directions before and after the sequences and uses a CTC (Connectionist Temporal Classifier) [20] to decode the sequence features to obtain the final text recognition results.…”
Section: Introductionmentioning
confidence: 99%
“…e CNN is used to extract representational image features, and the RNN is naturally suitable for the recognition problem of sequence data, so this network architecture is well suited for image-based text line recognition. Convolutional RNN [15] is a representative approach for this type of network architecture, which uses CNN networks to extract high-level semantic features of images, transforms the extracted features into feature sequences, and then, uses a bidirectional long-and short-term (BiLSTM) memory network [2,16]. is model uses a CNN network to extract high-level semantic features, converts the extracted features into feature sequences, and then, uses a BiLSTM network [17][18][19] to capture the contextual information in both directions before and after the sequences and uses a CTC (Connectionist Temporal Classifier) [20] to decode the sequence features to obtain the final text recognition results.…”
Section: Introductionmentioning
confidence: 99%
“…Cross-modal attention [14,34] is an extension of the previous combination of different modal feature vectors that enhances the representation of multimodal features by assigning different attention weights for modeling information interactions between modalities. lv et al [35] input multilevel features from a three-stream network model into the channel attention mechanism to model the common feature space and focus the network on salient targets. In addition, Attn-Hybrid Net [36] was used to alleviate the redundancy between hybrid features.…”
Section: Attention-based Fusionmentioning
confidence: 99%
“…Lv et al. [34] proposed a three‐stream architecture to accurately predict human visual attention on 3D images. Xi et al.…”
Section: Related Workmentioning
confidence: 99%