Stacked GRU based Image Captioning

Authors

  • Sangita Nemade, Shefali Sonavane

Abstract

Image captioning is to apportion automatic caption to an image that clearly depicts the semantic
contents of an image. Therefore, it is essential to take out semantically significant information and present it in
natural language. Image captioning is a very challenging job as it has to connect natural language processing
(NLP) and computer vision research areas together. Generally, the convolutional neural network (CNN) and
recurrent neural network (RNN) are used for image captioning. A variant of RNN that is gated recurrent unit,
persists the shallow relationship among the sequential hidden states, last hidden state and its output for each
timestamp. To handle this problem, a framework consisting of DenseNet201 plus GRU is presented in this
paper for image captioning. The semantic information acquired from an image is given as additional input to
each stacked GRU to guide the model achieving the resultant image caption. A word sequence is generated
based on the conditional probability distribution. Image visual features and previous word information are used
to generate a word. The model determines the conditional probability distribution of a word in the sentence.
This distribution generates the image caption. The performance of this framework is assessed on the Flickr 8k
and MS COCO datasets that achieves acceptable improvement for image captioning.

Published

2020-03-31

Issue

Section

Articles