“…Recurrent neural networks (RNNs), as one of the most widespread neural network architectures, mainly focus on a wide variety of applications where data is sequential, such as text classification [38], language modeling [60], speech recognition [64], machine translation [56]. On the other hand, RNNs have also shown their notable success in image processing [17], including but not limited to text recognition in scenes [55], facial expression recognition [6], visual question answering [4], handwriting recognition [52]. As a well-known architecture of RNNs, LSTM [27] has been widely utilized to encode various input (e.g., image, text, audio and video) to improve the recognition performance [60].…”