Exploring Dual Stream Global Information for Image Captioning

Tiantao Xian, Zhixin Li, Tianyu Chen, Huifang Ma

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:05:18

13 May 2022

In recent years, image caption methods based on the encoder-decoder framework have made promising achievements, but most of them lack the exploitation of global information. In general, visual global information can provide more fine-grain details for recognizing small objects. On the other hand, the textual global information provides a coarse understanding of the visual scene. In this paper, we propose Dual Global Enhanced Transformer (DGET) to explicitly utilize both visual and textual global information. In encoding stages, we complement two visual features with different properties to obtain a global enhanced visual representation by a novel Global Enhanced Encoder (GEE). During decoding, we proposed Global Enhanced Decoder (GED) to utilize the textual global information explicitly. To validate our model, we conduct extensive experiments on the COCO image captioning dataset and achieve superior performance over many state-of-the-art methods.

Tags:

global information

transformer

image captioning

attention mechanism

Exploring Dual Stream Global Information for Image Captioning

Tiantao Xian, Zhixin Li, Tianyu Chen, Huifang Ma

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

KEYNOTE: Least Squares Support Vector Machines and Deep Learning

Devising Transformers as an Autoencoder for Unsupervised Multivariate Time Series Imputation

Slides: Devising Transformers as an Autoencoder for Unsupervised Multivariate Time Series Imputation

Join an IEEE Society