“All our knowledge has its origins in our perceptions.” ~Leonardo da Vinci.
Perception is the organization, identification, and interpretation of sensory information in order to represent and understand the presented information or environment. It could differ from person to person, be impacted by one’s surroundings, or even be similar to someone’s perception.
Automatic synthesis of realistic images from text would be interesting and useful, and we could witness some AI systems in recent years such as GAN-INT-CLS, which is an RNN encoder with GAN decoder that came out in 2016 was the first paper to propose the Idea of text to image using generative adversarial modeling. However, in recent years generic and powerful recurrent neural network architectures have been developed to learn discriminative text feature representations. Meanwhile, deep convolutional generative adversarial networks (GANs) have begun to generate highly compelling images of specific categories, such as faces, album covers, and room interiors, such as GAWWN, StackGANs, etc.
Automatically generating images according to natural language descriptions is a fundamental problem in many applications, such as art generation and computer-aided design. It also drives research progress in multimodal learning, inference across vision and language, and one of the most important: Biomedical Imaging and Research, which is one of the most active research areas in recent years.
There is a serious predicament of class imbalance in Artificial Intelligence and with Perception.AI, we aim to solve this problem by generating images for rare classes which otherwise occur infrequently, or not at all.
In this project, we propose to implement a text-to-image adversarial generative model that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation.
Split into three parts; the project includes reviewing Deep Learning concepts for data and modeling and how to apply them to different tasks, including vision and language tasks. Next, we move to development, where we use the models we trained and incorporate them into real-world applications. Finally, we deploy our application in Google Cloud Platform (GCP).
We used the Caltech-UCSD Birds 200 (CUB-200) dataset. It is an image dataset with photos of 200 bird species (mostly North American).
With approximately 12 thousand images and one bounding box per image for object detection, it also has 10 captions corresponding to each image.
The components of this work listed below cover the frontend, backend, and all aspects of pipeline enablement:
We took StackGAN as our baseline model and implemented it from scratch.
It supported two important ideas:
This was the baseline model that we implemented from scratch (find code in the GitHub like provided at the end of the article). A setback about StackGANs is that only a single sentence embedding is used as an input therefore there is no word-level association between the sentence and the image.
Therefore, we researched further and switched to AttnGANs for an improved model.
With a novel attentional generative network, the AttnGAN can synthesize fine-grained details at different subregions of the image by paying attention to the relevant words in the natural language description. In addition, a deep attentional multimodal similarity model is proposed to compute a fine-grained image-text matching loss for training the generator. It supports two important ideas:
The AttnGAN builds on top of StackGAN by using an attention network which allows it to capture word-level information, along with the broader sentence-level information that the StackGAN model already has in form of embedding. The AttnGAN does this by passing the sentence through a bidirectional-LSTM which outputs sentence-level and world-level features. The sentence-level features are a D-dimensional vector whereas the word-level features are D x T-dimensional matrix where T represents the number of words in the text description and D represents the dimensions of the embedding.
The first step is to pass the sentence level embedding through a conditioning augmentation. This takes a random sample from a normal distribution where the mean is the mean of the vector and the standard deviation is the covariance matrix of the sentence vector, this makes the model more robust by giving it a larger variety of samples.
We now take the sentence level representation and concatenate it with a random noise vector and pass it through the first feature generator which is responsible for most of the upsampling and outputs a hidden state.
The hidden state is also passed on to the next stage of the feature generator along with the word-level embeddings. The feature generator outputs a list where each item represents how important each of the words was in drawing a specific sub-region. The third feature generator does the same thing as the second feature generator.
Training this model is quite interesting as every feature generator has its corresponding image generator and discriminator. The feature generator also passes its hidden state to the image generator which are just convolutional layers that converts the hidden states in RGB image. The image generated is then passed through a discriminator which then tries to distinguish if the image is fake or not based on which the generator and the discriminator get better.
The Loss for the discriminator is broken up into two parts, the conditional loss and the unconditional loss, the unconditional loss is the discriminator trying to figure if the image is real or not. The conditional loss is how close is the sentence to the image and if the image makes sense given the sentence.
The loss for the generator can be divided into two parts, one being the sum of loss for all generators and the other being the lambda times the DAMSM loss.
Let’s go through the first part of the generator loss first, which is the sum of loss for all generators, that is, image generator one, two, and three, each generator loss is made up of conditional and unconditional loss. Its trying to capture if the image is real or fake and how well the sentence level vector matches up with the image.
To capture how well the word-level features are captured, the authors of the paper came with DAMSM loss which stands for deep attention multimodal similarity model. The image generated by our generator is passed in through the inception v3 model and gets global and local image features which are passed through a perceptron layer to have the same dimensions as the word level embedding. A dot product is then taken between the image features and word-level features to compute a similarity matrix, a weighted sum is calculated over all the subregions, cosine similarity is calculated between word and region of the image. If these have high similarity means that word had an impact on the corresponding region of the image thus giving us the attention part of the model.
This is the high-level view from development to deployment, where we are illustrating the interactions of various components:
The user can use the application through the deployed webpage.
Here, the user can enter the description of a bird in the text box, that the user wants an image of.
After hitting the ‘Generate’ button, a POST request is submitted and then our model comes to play. It generates a corresponding image as per the description along with the feature maps. It returns the output to the frontend via a GET request.
We can view the generated images. Also, when the user hits the ‘Get Preception’ button, the feature maps are displayed. Feature maps help the user understand where our model is focusing on with the help of a given set of words.
The proposed AttnGAN significantly outperforms the previous state-of-the-art, boosting the best-reported inception score by 14.14% on the CUB dataset. It shows that the layered attentional GAN is able to automatically select the condition at the word level for generating different parts of the image. The experimental results show that, compared to previous state-of-the-art approaches, the AttnGAN is more effective for generating complex scenes due to its novel attention mechanism that catches fine-grained word level and sub-region level information in text-to-image generation.
We hope, with Perception.AI we achieve a well-generalized and diversified model, that can be taken further in medicinal studies to combat class imbalance in medical datasets.
Anshika Gupta, Harsh Vardhan, Meghana Sarikonda & Vishnu M