The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
A previsão da saliência visual melhorou dramaticamente desde o advento das redes neurais convolucionais (CNN). Embora a CNN atinja um excelente desempenho, ela ainda não consegue aprender bem as informações contextuais globais e de longo alcance e carece de interpretabilidade devido à localidade das operações de convolução. Propusemos um modelo de previsão de saliência baseado em aprimoramento multi-prior e colaboração de atenção intermodal (ME-CAS). Concretamente, projetamos uma arquitetura de rede Siamesa baseada em transformador como espinha dorsal para extração de recursos. Um dos ramos do transformador captura as informações de contexto da imagem sob o mecanismo de autoatenção para obter um mapa de saliência global. Ao mesmo tempo, construímos um módulo de aprendizagem anterior para aprender o viés do centro visual humano antes, o contraste antes e a frequência antes. A entrada multi-anterior para outro ramo siamês para aprender os recursos detalhados dos recursos visuais subjacentes e obter o mapa de saliência das informações locais. Finalmente, usamos um módulo de calibração de atenção para orientar a aprendizagem colaborativa intermodal de informações globais e locais e gerar o mapa de saliência final. Extensos resultados experimentais demonstram que nossa proposta ME-CAS alcança resultados superiores em benchmarks públicos e concorrentes de modelos de previsão de saliência. Além disso, os módulos de aprendizagem multi-anteriores melhoram as imagens, expressam detalhes importantes e interpretam o modelo.
Fazhan YANG
China University of Mining and Technology
Xingge GUO
China University of Mining and Technology
Song LIANG
China University of Mining and Technology
Peipei ZHAO
China University of Mining and Technology
Shanhua LI
China University of Mining and Technology
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copiar
Fazhan YANG, Xingge GUO, Song LIANG, Peipei ZHAO, Shanhua LI, "Siamese Transformer for Saliency Prediction Based on Multi-Prior Enhancement and Cross-Modal Attention Collaboration" in IEICE TRANSACTIONS on Information,
vol. E106-D, no. 9, pp. 1572-1583, September 2023, doi: 10.1587/transinf.2022EDP7220.
Abstract: Visual saliency prediction has improved dramatically since the advent of convolutional neural networks (CNN). Although CNN achieves excellent performance, it still cannot learn global and long-range contextual information well and lacks interpretability due to the locality of convolution operations. We proposed a saliency prediction model based on multi-prior enhancement and cross-modal attention collaboration (ME-CAS). Concretely, we designed a transformer-based Siamese network architecture as the backbone for feature extraction. One of the transformer branches captures the context information of the image under the self-attention mechanism to obtain a global saliency map. At the same time, we build a prior learning module to learn the human visual center bias prior, contrast prior, and frequency prior. The multi-prior input to another Siamese branch to learn the detailed features of the underlying visual features and obtain the saliency map of local information. Finally, we use an attention calibration module to guide the cross-modal collaborative learning of global and local information and generate the final saliency map. Extensive experimental results demonstrate that our proposed ME-CAS achieves superior results on public benchmarks and competitors of saliency prediction models. Moreover, the multi-prior learning modules enhance images express salient details, and model interpretability.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2022EDP7220/_p
Copiar
@ARTICLE{e106-d_9_1572,
author={Fazhan YANG, Xingge GUO, Song LIANG, Peipei ZHAO, Shanhua LI, },
journal={IEICE TRANSACTIONS on Information},
title={Siamese Transformer for Saliency Prediction Based on Multi-Prior Enhancement and Cross-Modal Attention Collaboration},
year={2023},
volume={E106-D},
number={9},
pages={1572-1583},
abstract={Visual saliency prediction has improved dramatically since the advent of convolutional neural networks (CNN). Although CNN achieves excellent performance, it still cannot learn global and long-range contextual information well and lacks interpretability due to the locality of convolution operations. We proposed a saliency prediction model based on multi-prior enhancement and cross-modal attention collaboration (ME-CAS). Concretely, we designed a transformer-based Siamese network architecture as the backbone for feature extraction. One of the transformer branches captures the context information of the image under the self-attention mechanism to obtain a global saliency map. At the same time, we build a prior learning module to learn the human visual center bias prior, contrast prior, and frequency prior. The multi-prior input to another Siamese branch to learn the detailed features of the underlying visual features and obtain the saliency map of local information. Finally, we use an attention calibration module to guide the cross-modal collaborative learning of global and local information and generate the final saliency map. Extensive experimental results demonstrate that our proposed ME-CAS achieves superior results on public benchmarks and competitors of saliency prediction models. Moreover, the multi-prior learning modules enhance images express salient details, and model interpretability.},
keywords={},
doi={10.1587/transinf.2022EDP7220},
ISSN={1745-1361},
month={September},}
Copiar
TY - JOUR
TI - Siamese Transformer for Saliency Prediction Based on Multi-Prior Enhancement and Cross-Modal Attention Collaboration
T2 - IEICE TRANSACTIONS on Information
SP - 1572
EP - 1583
AU - Fazhan YANG
AU - Xingge GUO
AU - Song LIANG
AU - Peipei ZHAO
AU - Shanhua LI
PY - 2023
DO - 10.1587/transinf.2022EDP7220
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E106-D
IS - 9
JA - IEICE TRANSACTIONS on Information
Y1 - September 2023
AB - Visual saliency prediction has improved dramatically since the advent of convolutional neural networks (CNN). Although CNN achieves excellent performance, it still cannot learn global and long-range contextual information well and lacks interpretability due to the locality of convolution operations. We proposed a saliency prediction model based on multi-prior enhancement and cross-modal attention collaboration (ME-CAS). Concretely, we designed a transformer-based Siamese network architecture as the backbone for feature extraction. One of the transformer branches captures the context information of the image under the self-attention mechanism to obtain a global saliency map. At the same time, we build a prior learning module to learn the human visual center bias prior, contrast prior, and frequency prior. The multi-prior input to another Siamese branch to learn the detailed features of the underlying visual features and obtain the saliency map of local information. Finally, we use an attention calibration module to guide the cross-modal collaborative learning of global and local information and generate the final saliency map. Extensive experimental results demonstrate that our proposed ME-CAS achieves superior results on public benchmarks and competitors of saliency prediction models. Moreover, the multi-prior learning modules enhance images express salient details, and model interpretability.
ER -