## A short introduction to generative models

In recent years, a type of Machine Learning model known as Generative Adversarial Networks (GAN), has become a hot topic. This is mainly due to their capability of generating good-looking and convincing artificially images. One example can be seen below, generated by the PG-GAN that was published in [1].

Source: https://github.com/tkarras/progressive_growing_of_gans

Do you recognize these celebrities? No? They are fake celebrities generated by a machine learning model!!

The idea of a GAN was initially proposed by Ian Goodfellow in [2]. It is based on Game Theory and involves training two competing neural networks: the generator network G and the discriminator network D. The goal of GANs is to train a generator G to sample from the data distribution by transforming the vector of noise z. The discriminator D is trained to distinguish samples generated by G from samples generated from the data distribution (i.e. images of celebrities in the above example). The architecture of the model is presented below.

Practically, the training process can be explained as a duel game, where:

- Generator
**G(z)**tries to fool the discriminator by generating real-looking images, - Discriminator
**D(x)**tries to distinguish between real and fake images.

GAN models are widely used for generating artificial images, but there are plenty of other applications where they can be applied: semi-supervised classification, image retrieval style search and many others. It is also worth to mention, that the generated picture by GAN, created by students was recently sold for $432,500 on an auction [3]. We would like to share with you our most recent research [4], which uses GAN models to learn binary descriptors.

## Learning binary codes

Compact binary representations of images are instrumental for a multitude of computer vision applications, including image retrieval, simultaneous localization and mapping (SLAM), and large-scale 3D reconstruction. Traditionally, for this purpose in Computer Vision were used image descriptors such as ORB, BRIEF, SIFT, SURF etc. [5,6]. The idea of descriptors is that firstly, they detect and next, they describe the local feature points and allow them to be matched across sequences of image frames. In addition, these features can be used as an input to SVM classifiers, which associates pictures with classes (e.g. images of dogs vs. cats).

The modern approach in this field aims at learning binary features directly by using deep neural networks (see the figure below).

Practically, we would like to create a neural network that returns vectors of binary features for given input images. It is highly desirable (especially, when the codes are used for image retrieval), to receive similar codes for similar images. For the considered example, the codes between images with cars are more similar than the codes generated for an image with a cat in.

It has been shown in [7,8] that GANs can be used for this purpose in the same manner as Convolutional Neural Networks (CNNs) for image recognition. Convolutional Neural Networks is the first Deep Learning approach, that has beaten the classical SVM classifiers [9] in the ImageNet competition [10].

## BinGAN approach – one of the four papers in NIPS from Poland

During our work at Tooploox we have created the BinGan model, a state-of-the-art model, which **will be presented at Neural Information Processing System 2018**, one of the largest and most important Machine Learning conferences in the world.

We proudly present our BinGAN model, which makes use of training properties of GANs to learn characteristic binary features for image retrieval and matching. The main idea is to take the discriminator of the trained GAN model, cut the classification layer and use the model to extract binary features. When the number of hidden units in the intermediate layers is large, the vector representations are better for representing images. This is because the network has a larger number of parameters to adjust and to better fit the data. However, this requires more memory consumption and more of computation.

In order to build lower-dimensional vector representations of images make the learning procedure more effective and to avoid overfitting of the network, we developed a special regularization method. In our approach, we introduce a combination of regularizers called the Distance Matching Regularizer (DMR) and the Binary Representation Regularizer (BRE). The regularizers are included in loss function, which is minimized during the training procedure.

## Distance Matching Regularizer (DMR)

The DMR transfers Hamming distances from high-dimensional to low-dimensional layers. Roughly speaking, the information from the deeper layers is propagated into the shallow ones. The BRE increases the diversity of binary vectors.

To introduce the Distance Matching Regularizer, we will have to define a Hamming Distance. Assuming that we have two binary vectors, it is the number of position at which the corresponding values are different. For example, when we have vectors and the Hamming distance is .

The goal of this regularizer is to include the properties of vector Hamming distances of the higher layer to the lower one, which has smaller dimensions of possible parameter values.

To describe it more precisely, we have to introduce two functions:

(1)

This function is applied to each element of the high-dimensional layer and results in binary codes containing or for each element.

(2)

is the hyperparameter.

The soft sign is used in low-dim layers and provides a quantization technique. The result of it is continuous values from . So assuming that is and the input is the final vector can have a form . This function has a continuous form to provide gradient backpropagation, which is needed for training procedure and because the values are close to binary values , it is also possible to calculate the Hamming distance.

The notation we will use to explain the regularizers is listed here:

- will be denoted as a low-dim layer with hidden units,
- will be denoted as a high-dim layer with units,
- is the result of function from units of layer ,
- is the result of function from units of layer ,
- is the result of function from units of layer ,
- is the result of function from units of layer .

This are the components used in the regularizer.

The explain more precisely the how the regularizer works, we have to look into the definitions. The Hamming distance between two binary vectors, and can be expressed using a dot product: . As a consequence, distant vectors are characterized by low-valued dot products and close vectors are characterized by high values.

Combining the Hamming distance definition with , which is the empirical expected value of the loss function used in DMR, we get:

(3)

Where is the number of images in batch, so the loss is calculated between each image in the batch (k,j are the numbers of the images from mini-batches). The term in the equation above, consisting of vectors, is assumed to be constant during training time, so gradients are only computed to update the layers that produce the codes. Both terms are normalized by their number of elements and (which are the vector dimensions).

It can be visualized in this way:

Where NiN is the Network-in-Network layer, nin1 is the lower-dim layer and nin2 the higher dim layer. The GPool in figure 4 is the average pooling layer.

## Adjusted Binarization Representation Entropy (BRE)

The second regularizer we introduce is called the Adjusted Binarization Representation Entropy (BRE). This regularizer increases the diversity of binary vectors in the low-dimensional layer. The regularizer consists of two parts.

The first part of this regularizer called calculates the average of values for hidden units and forces the normalization of vector products. The is the average of element batch, calculated from the softsign values. Thanks to this, it forces the vector to have an average of 0, which is important for the calculation of loss function.

(4)

Our BRE regularizer differs from the original in the part that we call , which is a weighted version of the original (defined in [11]). Basically, values which have dot product different than zero are downweighted, and values that have dot products close to zero are upweighted. Z is the normalization value. The regularizer minimizes the correlation and hence increases diversity between image representations. This can also be seen as maximizing entropy.

(5)

The final training loss has a form:

(6)

The is defined as a sum of and and the lambdas are the hyperparameters of the model are set experimentally.

## Architecture

For the image matching task the discriminator is composed of:

- 7 convolutional layers (3×3 kernels, 3 layers with 96 kernels and 4 layers with 128 kernels),
- Two network-in-network (NiN) [15] layers (with 256 and 128 units respectively)
- A is a discriminative layer.

For image retrieval the discriminator is composed of:

- 7 convolutional layers (3×3 kernels, 3 layers with 96 kernels and 4 layers with 192 kernels),
- Two NiN layers with 192 units,
- One fully-connected layer with three variants of (16, 32, 64 units) and a discriminative layer.

For the low-dimensional feature space we take fully-connected layer, and for the high-dimensional space we take average-pooled last NiN layer.

## Experiments

### Image retrieval

In the task of image retrieval, we have a query image shown on the leftmost columns (red in the left figure and leftmost column in the right figure) and we are searching for the images that have closest Hamming distance to the query image in the binary descriptor space.

In this experiment, we use CIFAR-10 dataset to evaluate the quality of our approach in image retrieval. CIFAR-10 dataset has 10 categories and each of them is composed of 6,000 pictures with a resolution 32 × 32 color images. The whole dataset has 50,000 training and 10,000 testing images.

*Table 1. Results on Cifar10 (mAP).*

In table 1 we can see the performance of mean average precision of top 1000 returned images with respect to the different number of hash bits on the CIFAR-10 dataset. Our method outperforms DBD-MQ method, which is the unsupervised method that previously was reporting state-of-the-art results on this dataset, for 16, 32 and 64 bits. The performance improvement in terms of Mean Average Precision reaches over 40%, 31%, and 15%, respectively. The most significant performance boost can be observed for the shortest binary strings. Thanks to the loss terms introduced in our method, we explicitly model the distribution of the information in a low-dimensional binary space.

### Image matching

The BinGAN as mentioned before can be used as an image descriptor. It means that the similar patches should have a similar vector representation.

*Table 2. Results on Brown dataset (FPR@95%). *

To evaluate the performance of our approach on image matching task, we use the Brown dataset [3]. We train binary local feature descriptors using our BinGAN method and compare with competing previous methods.

The Brown dataset is composed of three subsets of patches: Yosemite, Liberty and Notredame. The resolution of the patches is 64 × 64, although we subsample them to 32 × 32 to increase the processing efficiency. Next, we use the method to create binary descriptors. The data is split into training and test sets according to the provided ground truth, with 50,000 training pairs (25,000 matched and 25,000 non-matched pairs) and respectively 10,000 test pairs (5,000 matched, and 5,000 non-matched pairs).

In table 2. we present false positive rates at 95% true positives (FPR@95%) obtained for our BinGAN descriptor compared with the state-of-the-art binary descriptors on Brown dataset (%). As we can see, it has the lowest errors in most cases.

See more in: https://arxiv.org/pdf/1806.06778.pdf

The code for our method is available: github.com/maciejzieba/binGAN

### What’s next?

Currently, there are two research branches in Tooploox.

1. First one includes using generating point clouds using GANs.

2. Next, we are thinking about using BinGAN for style search approach that is already developed at Tooploox.

**Authors:**

Piotr Semberecki

Maciej Zięba

*Literature:*

[2] Goodfellow, Ian, et al. “Generative adversarial nets.” NIPS, 2014.

[3]. Christie’s sells its first AI portrait for $432,500, beating estimates of $10,000 https://www.vox.com/the-goods/2018/10/29/18038946/art-algorithm

[4] Zięba et al. “BinGAN: Learning Compact Binary Descriptors with a Regularized GAN” NIPS, 2018.

[5] Lowe, David G. “Distinctive image features from scale-invariant keypoints.” International journal of computer vision 60.2 (2004): 91-110.

[6]. Rublee, Ethan, et al. “ORB: An efficient alternative to SIFT or SURF.” Computer Vision (ICCV), 2011 IEEE international conference on. IEEE, 2011.

[7] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.

[8] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016.

[9]. Lin, Yuanqing, et al. “Imagenet classification: fast descriptor coding and large-scale SVM training.” Large scale visual recognition challenge (2010).

[10]. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

[11] Y. Cao, G. W. Ding, K. Y.-C. Lui, and R. Huang. Improving GAN training via binarized representation entropy (BRE) regularization. In ICLR, 2018.