A Novel Approach on Visual Question Answering by Parameter Prediction using Faster Region Based Convolutional Neural Network

Visual Question Answering (VQA) is a stimulating process in the field of Natural Language Processing (NLP) and Computer Vision (CV). In this process machine can find an answer to a natural language question which is related to an image. Question can be open-ended or multiple choice. Datasets of VQA contain mainly three components; questions, images and answers. Researchers overcome the VQA problem with deep learning based architecture that jointly combines both of two networks i.e. Convolution Neural Network (CNN) for visual (image) representation and Recurrent Neural Network (RNN) with Long Short Time Memory (LSTM) for textual (question) representation and trained the combined network end to end to generate the answer. Those models are able to answer the common and simple questions that are directly related to the image’s content. But different types of questions need different level of understanding to produce correct answers. To solve this problem, we use faster Region based-CNN (R-CNN) for extracting image features with an extra fully connected layer whose weights are dynamically obtained by LSTMs cell according to the question. We claim in this paper that a single R-CNN architecture can solve the problems related to VQA by modifying weights in the parameter prediction layer. Authors trained the network end to end by Stochastic Gradient Descent (SGD) using pre-trained faster R-CNN and LSTM and tested it on benchmark datasets of VQA.


I. Introduction
U NDERSTANDING an image by the help of computer vision or image processing technique is a complex procedure studied in the two last eras. Then the scientists introduced compatible circumstances between Image Processing and Natural Language Processing (NLP) to solve the problem of image understanding. Traditionally the researchers all over the world applied the process of image understanding to solve the problem of Visual Question Answering by the machine learning. Recently, deep learning architectures constructed by knowledge artificial neural networks have enhanced visual image understanding [1,2,3].Object recognition from an image is done by Convolutional Neural Network (CNN). Recurrent Neural Network (RNN) with Long Short Term Memory (LSTM) cell has outstretched the bar on sequence prediction jobs as well as machine translation [18,19]. CNN performs feature representation of the image and LSTMs process the representation of question and answer. The researchers directly combined both networks and trained end to end to generate the answer [20,21]. But this kind of approach is able to answer the common and simple questions that are related to the image's content i.e. 'What is the shape …?' or 'How many?' To find a correct answer for the different types of questions, understanding of an image should be different [4,5].  There are various kinds of questions and answers which are used to train the VQA system. Questions can be mainly of these types: fine grained recognition (e.g. "What kind of animal is in the ground?"), Object detection (e.g. "How many birds are there?"), activity recognition (e.g. is this boy smiling?"), knowledge based reasoning (e.g. "which animal is the reptile in this picture?"), commonsense reasoning (e.g. "Is the boy expecting a job?"). Answer format can be of two categories: yes/no (binary) or multiple-choice. For openended, formless questions, answer has not any particular form it can be one word or several words. Images can be mostly two types: Real (real world images like photograph of nature, human, object, etc.) and abstract (clip-arts, animated scene). That is why one type of object recognition for the image is not sufficient for visual question answering problem; selection of the appropriate recognition task is also significant [9,10].
Computer vision and NLP community released different types of approaches among those techniques, actually the main methodology is to combine the image and question features to create deep learning architecture for classifying jointly combined features to produce a correct answer. Some approaches use CNN for extracting image features and bag of words (BOW), for obtaining question features [11,12].
VQA is a complete AI finish task and it is multi-disciplinary and heterogeneous in nature. The visual/ image question answering model can solve the complex real life problems (discussed in future work). Another point is that it tackled different sub problems in a single task i.e. image processing, question answering in NLP and sometimes knowledge representation. Motivation from this research problem is a variety of standpoints like new datasets and techniques in computer vision and image processing, integration of vision, language and common sense and merging group of methods of machine learning and deep learning. It has the ability to train the machine how to learn and answer the question like human being and also learn about the pictorial world [13]. The VQA system can be able to learn what do need to learn and what do not need. VQA system is relevant to various types of social applications that include situational information from visual content, making decisions from huge amount of investigation data and, last but not the least, interaction with robots. This research has the purpose to improve the lives of visually-impaired people and transfigure the society through interaction between human beings and visual information [14].
VQA needs various types of understanding of an image not only caption generation or image entity recognition but visual scene understanding and reasoning of knowledge about to the image. The approaches that we discuss in related work took the VQA problem as a classification task, in addition to some network architectures designed to solve the above mentioned problem in joint embedding manner. To overcome the previous problem, we use Faster Region based CNN (R-CNN) for feature extraction of image and updating weights of that CNN according to the question. For weights (parameter) updating we use different types of functions into the weights matrix of the main network. Actually we are trying to generate the answer of free-form and open ended question from the image.
In the section III we discuss the basic concepts behind the deep learning tools CNN, RNN, LSTM and Word embedding which are applied to build model to solve VQA problem.
In the section IV & V we describe our proposed model and methodologies to solve the problems over VQA in field of CV and NLP.
Our key contributions to this research are that we describe a different methodology that depends on the question part. We have applied faster R-CNN (Region based CNN) [7] with a prediction of parameter layer where the weights are updated according to the question. We apply the word embedding method [12] to question and build a parameter prediction network with a series of LSTM cells. In the section VI we conclude proposed work in details. It uses pre-trained faster R-CNN [7] and LSTM [9] to initialize the weights of entire network by using Stochastic Gradient Descent [22] for training the whole network.

II. Related Work
Malinowski et al. [2] proposed a model that can answer the probability of a given question according to an image. This model gives a single word answer. Authors proposed another model which produces multi word answer. This model is trained and tested on DAQUAR dataset.
Kaffle et. al. [1] proposed a model that is based on Bayes' rule of probability. They trained on COCO-QA data set that contains four types of answers such as counting, object, location; color. The model evaluates the probability of answer dependent on answer type.
Malinowski et al. [3] proposed a method called Neural image QA where they jointly fed image and question features into LSTM encoderdecoder to generate answer. Image features are extracted by CNN and question embedding is obtained from LSTM. The main limitation of the model is that it can be applied on large dataset only. They also propose two metrics i.e. "Average Consensus" which defines human disagreement and "Min Consensus" which defines the disagreement in question answering of a human.
Ren et al. [6] proposed a method that uses RNN and visual semantic embedding in the middle phases of image segmentation and object detection. Their main contribution is the question generation algorithm that transforms image descriptive dataset into question -answer format. They treated the VQA problem as classification rather than answer generation.
Gao et al. [21] proposed a little different procedure that uses LSTMs to translate the question and create the answer by two diverse forms. First LSTM is shared weight mechanism into LSTM encoder -decoder architecture that focuses on grammatical structure of the sentence. Second LSTM extracts features from image by CNN that are not directly served into encoder at each time step.
Noh et al. [5] derived a new approach named "DPPnet" of VQA which learns by CNN with dynamic parameter prediction layer. Here weight matrix of the CNN is dynamically predicted by a parameter predictive network. Parameters were predicted by Gated Recurrent Unit (GRU), which takes the asking, question as an input and produce candidate weights as an output. They used hashing technique for the arrangement of final weight matrix of CNN. This method improved the accuracy on the benchmark data set.
Lin et al. [14] did not utilize any RNN for encoding question. They utilize 3 kinds of CNN for answering the question from an image: CNN for image, CNN for sentence and extra multimodal CNN. In the sentence CNN they utilize 3 layers of convolution and max pooling for creating the sentence portrayal. Catching the relations amongst feature vector of image and question are ended by multimodal convolution step. They added two extra fully connected layer; one for multimodal convolution i.e. they embedded one image feature in between two consecutive semantic components of question side and other one is softmax classification layer to predict answer.
Shih et al. [15] proposed an attention-based model henceforth referred to as Where To Look (WTL). The authors use VGG net to extract the image features .They average the each word vector in the question to obtain the question vector.
Then they compute attention vector over the set of the image features to choose which region to be focused on. The final image representation is the summation of attention weighted regions of the image. After that they concatenated between question features and summed result and fed through full connected softmax layer to predict answer.
Yang et al. [16] proposed a model called Stacked Attention Networks (SAN) encode question using either CNN or LSTM. Then the question encoding is fed to attend over image. Then question encoding and the attention weight are concatenated to calculate attention over the unique image. This model increases the accuracy of VQA system for use of attention.
Andreas et al. [17] proposed a new approach called Neural Memory Network (NMN) that is based on semantic parsing of the question. Parse tree is converted to module to answer the question. Modules are compassable and independent. The question parsing is performed to identify the grammatical relation between parts of the sentence. They use ad-hoc handwritten rules to convert the parse tree in structured queries. The main problem in this model is parsing of a question due to fixed network structure. Error cannot be recoverable.

A. Convolutional Neural Network (CNN)
CNN is used to extract the feature vector from an image. There are two phases in CNN: feature extraction and output prediction. There are two layers in feature extraction: convolution layer and sub sampling (pooling) layer. And one fully connected layer for classification or output prediction. After convolution layer the obtained feature goes through an activation function. In the convolutional layer there are series of matrix multiplications followed by summation operation.

B. Recurrent Neural Network (RNN)
Recurrent Neural Network [4,24,25] is used for learning the sequence of data like series of video frame, text, music etc. The main difference of RNN over feed forward ANN is that we add another weight matrix, that matrix comes from previous hidden state. We just give input of the hidden state in every time step, and keep repeating that. Main problem with the RNN is vanishing gradient problem i.e. whenever we do back propagating in the neural network gradient (product of partial derivative of error) tends to vanish. The mathematical formulation RNN is the following: Where o t is the output of the RNN and x t is the input at time t, h t is the state of the hidden layer(s) at time t. θ is encapsulated parameter of weight and bias. Fig. 2 defines a simple graphical model to explain the relation between these three variables in an RNN computation graphs.
In Fig. 2, the values θ i , θ h , θ o represent the parameters associated with the inputs, previous hidden layer states, and outputs, respectively.

C. Long Short Term Memory (LSTM)
LSTM [9] resolves the problem of vanishing gradient and getting successes in natural language processing applications like machine translation, speech recognition etc. LSTM cell consists mainly three gates i.e. input gate, output gate and forget gate and a cell state. Actually LSTM says what is the relevant part of that a network has learned and what to forget.

D. Long Short Term Memory (LSTM)
LSTM [9] resolves the problem of vanishing gradient and getting successes in natural language processing applications like machine translation, speech recognition etc. LSTM cell consists mainly three gates i.e. input gate, output gate and forget gate and a cell state.
Actually LSTM says what is the relevant part of that a network has learned and what to forget. Cell state is long-term memory, it represents all the learning at over time, and hidden state (is like a current memory). Forget gate is called remember vector that learns what to forget and what to remember. Input gate is also called the save vector that determines number of inputs getting into the cell state, and output gate is also called focus vector that is akin to an attention mechanism (what part of data should be focused on). Actually these gates of LSTM are perception i.e. single layer neural network. So LSTM functionality is mainly based on forgetting, remembering and paying attention of data. The mathematical formulation LSTM is the following: In the equation (3), the output of the forget gate ( ) denotes the cell state that what to forget by multiplying 0 to the particular position in the LSTM matrix. The information is remembered if is equal to 1. Here sigmoid activation function is applied to the weighted input and previous hidden state.
In the equation (4,5), the output of input gate ( ) is a sigmoid function ranged of [0, 1]. That is why sigmoid function is not capable to forget the information of the cell state. The output of input modulation gate ( ) is tanh activation function ranged of [-1, 1]. It permits the cell state to forget the information.
Equation (6) is called cell state equation. The previous cell state forgets information multiplying by output of forget gate and adds new information through the output of input gate ( ). Equation (7) is called output gate equation. The output of the output gate equation ( ) defines all possible values from LSTM matrix which must be moving forward to the next hidden state.
In the equation (8), is output of hidden state equation that defines what information we should take for next sequence.

E. Word Embedding
Words can be represented as a vector of real valued numbers. In the vector representations words with similar vectors should be semantically the same, that is, they have to represent related concepts. Sometime for practical purposes these vector can have the dimension of 10 or 100 as compared to the vocabulary size which is at least dimension 10 or 1000. Word2vec, GloVe, Skip thought [12] are some methods of word embedding.

IV. Proposed Work
In this section we discuss each tool that we use in the model and how they work together for solving this heterogeneous problem.

A. Image Features
In our method, we use faster R-CNN pretrained model VGG-16 [7] (16 layer CNN in faster R-CNN framework). The input image is fed through the Faster R-CNN [7] to get a vector representation. The R-CNN network is trained on Visual Genome [13] dataset for the focusing element of images with the help of image annotation. We vary the value of K (number of image regions) according to the variety of each image. After the convolution layer of faster R-CNN, there is a region proposal network ( Fig. 4 and 5) in which some part of the feature pixel matrix of input image is called proposal. Proposal is divided in some regions and selection of maximum feature value from each region is called ROI (Region of Interest) pooling. After the ROI pooling the required image feature vector is generated. We remove the last layer which is fully connected and attached three fully connected layers. The second last fully-connected layer is the parameter prediction layer whose weights are updated according to the question. The dimension of final output vector is equal to the number of possible answers. In the final layer (fully connected) softmax classifier [23] is applied to the output vector for calculating the probabilities of each answer. Output of the classification network is denoted by O = [o 1 …….o n ] T is (9) Where I is the input vector of parameter prediction layer I = [i 1 ……..i n ] T , W r (Q) represents the matrix constructed by LSTM network given to the question Q. b is bias.

B. Question Features
First we tokenize the question sentence means splitting the words with space and punctuation and any number is considered as words also. Questions are clipped into 14 words sentence and extra words are thrown out. Then we use pretrained GloVe word embedding (Global Vector for Word Representation). Each word and their corresponding vector have dimension of 300. We also use padding scheme (vectors of zeros) for the question which length is shorter than 14 words. After word embedding we fed the output sequence into LSTM. Output of the LSTM (Q l ) is multiplied by weight vector (W l ) of last fully connected layer of LSTM for getting the candidate weight vector of LSTM network. Candidate weight vector denoted by L= [l 1 ………l z ] is (10) Where W l is the weight matrix of a fully connected layer of LSTM network and Q l is final output vector of question.

C. Parameter Prediction
We update the weight (parameter) matrix of 2 nd last layer of R-CNN on the basis of the candidate vector of LSTM network. Here we have used parameter reduction techniques to optimize the network. Here the weight matrix of 2 nd last fully connected layer of faster R-CNN is denoted W r (Q) = [w r 11 ………w r kl ]. Now, w r kl is a weight of the weight matrix W r (Q), it is the corresponding weight between k th output and l th input neuron. £(�, �) is a function to map a key (�, �) to a natural number in {1,..., Z}, where Z is the dimensionality of l. The final value is given by: (11) Where ω (k, l) ∈ N × N → {+1, −1} is another function to remove the bias. By these two functions we update the weight matrix W r (Q) according to the question Q. 8. In the final layer softmax classifier is applied to the output vector for calculating the probabilities of each answer.

D. Training
Stochastic gradient is used to train the network. Questions are preprocessed by tokenization method [10] and transformed the upper case letter to lower case before training. We do the same thing for answers. To prevent over fitting we identify the epoch which gives the best performance on VQA dataset. Then the training is repeatedly running for the similar number of epochs. Optimization of weight updating layer of R-CNN changes in each batch. For clip gradient of LSTM networks we identify the range of gradient norm and then scale down the exceeding gradient range. We reuse the hyper parameters to normalize the loss. When training time of LSTM goes long then we stop the training because loss of the training does not improve for several epochs. When LSTM network starts to over fit, we stop tuning LSTM but training on the other parts of the network is going on.

E. Training Error by Stochastic Gradient Descent (SGD)
The use of SGD in the neural network is driven by the high cost of running back propagation over the full training set. SGD can overcome this cost and still lead to fast convergence. Back propagation gives you the gradients, but not how to use them. SGD optimizes the gradient and computes the objective function; the standard gradient descent algorithm updates the parameters θ of the objective function J (θ) as, (12) The above equation (12) is approximated by evaluating the cost and gradient over the full training set. Stochastic Gradient Descent (SGD) simply does away with the expectation in the update and computes the gradient of the parameters using only a single or a few training examples.

V. Experiment
We tested our model in various VQA datasets like COCO-QA, VQA, DAQUAR-all and DAQUAR-reduced [13] and tested before finetuning of LSTM and RCNN and after fine-tuning. We also describe the differences between existing methods and our method. And test results in terms of image and question also is displayed in this section.  Table I indicates the differences between previous approaches and our approach on basis of the method used, answer type and image features (pre-trained CNN model name). Joint embedding means combining both image and question representations together. Attention method means focusing the particular regions of the image or questions rather the whole image or question. Knowledge base signifies facts about the world. It has an inference engine for reasoning about the facts by using rules and logic, and deduces the new facts.

A. Datasets
Generally Datasets are prepared of triplets (question, answer, and image). Some number of datasets contains some kind of additional notation. Most of the answers are single words or phrases. Table II

B. Evaluation Metrics
In DAQUAR and COCO-QA dataset, there are two types of evaluation metrics; one is classification accuracy and another one is WUPS (Wu-Palmer similarity [6]) based on WordNet [5] categorization to calculate the semantic similarity between words. It signifies the similarity between machines generated answer and human answer.
Accuracy: The accuracy of classification is calculated by the following equation: Where µ (a, t) defines the threshold Wu-Palmer similarity between prediction (a) and ground-truth (t). 0.9 and 0.0 are two threshold values in our evaluation. This means that 0 <= score < 1.

C. Results and Discussion
In this section, results of the proposed model are tested by using four types of datasets i.e. DAQUAR -all, DAQUAR-reduced, COCO-QA, VQA test-dev. It describes the comparative analysis with the existing methods with respect to accuracy and WUPS score.  Table III, IV and V show comparison of the results obtained by our proposed model and other existing models "Neural QA, 3-CNN, Attribute LSTM, DPPNet, Bayesian" in terms of both metrics accuracy and WPUS@0.0 and WUPS@0.9 score. The proposed algorithm beats all existing approaches reliably in all benchmarks.   Fig. 6 and 7 and shows how the proposed model performs accurately in these different type of questions.  The qualitative results of the proposed model are introduced in Fig. 6. As a rule, the proposed method is effective to deal with different sorts of inquiries that need diverse levels of semantic understanding. It demonstrates that the system is able to recognize the task relying upon questions. On the other hand, in Fig.7, the proposed model is compelling to find the answer of different questions on various images. Some major advantages of our proposed model are discussed in the following: • Accuracy of this model is better than existing previous VQA models. Because we use single faster RCNN pre-trained model for obtaining visual also. • We solved the VQA problem not only as a joint embedding approach but we update the weight (parameters) of the network (RCNN), that is why in the last classification layer the number of probabilities (classes) is less than in the other approaches. CNN entirely loses all their inner data about the position and the placement of the object and they route all the info to the same neurons that may not be able to deal with this kind of information. A CNN makes estimations by looking at an image and then testing to see if assured modules are present in that image or not. If they are, then it classifies that image accordingly. Fine tuning of hyperparamters of our proposed model is non-trivial and we need a large dataset for training the model. We identify that when we trained our model in DAQUAR reduced dataset loss after several epochs are increased drastically. These are the complexities of using a CNN architecture.

VI. Conclusion
This paper focuses on a deep learning based model for both open ended and multiple choice questions. It describes several types of experiments and the contributions of each design. It also provides the importance of several mechanisms of a VQA model. It shows how Faster R-CNN magnifies the ability of object detection in an image fastly, increasing overall accuracy by 13% more than other image recognition models "Neural QA, 3-CNN, Attribute LSTM, Bayesian" do. The paper describes how our model updates the weight matrix of fully connected layer of faster R-CNN according to candidate weights of questions. It has been shown that VQA problem can be solved by a single R-CNN model. The following further directions should be investigated by using capsule Network for the object detection part of VQA system.