Captioning Transformer With Scene Graph Guiding
Haishun Chen, Ying Wang, Xin Yang, Jie Li
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:05:59
Image captioning is a challenging task which aims to generate descriptions of images. Most existing approaches adopt the encoder-decoder architecture, where encoder takes the image as input and decoder predicts corresponding word sequence. However, a common defect of these methods is that the abundant semantic relationships between relevant regions are ignored, leading the decoder to give a misled caption. To alleviate this issue, we propose a novel model, which utilizes sufficient semantic relationships provided by scene graph to guide the word generation process. To some extent, the scene graph narrows the semantic gap between images and descriptions, and hence improves the quality of generated sentences. Extensive experimental results demonstrate that our model achieves superior performance on various quantitative metrics.