Hidden biases and unbalanced datasets are one of the most painful challenges in building ethical AI solutions. By delivering tools to produce more accurate synthetic data, a group of researchers backed by Tooploox contributes to overcoming the problems that arise from the scarcity of data.
According to a Datagen report, a whopping 99% of data science teams working with image recognition software had their projects canceled due to a lack of sufficient data for their projects. The data scarcity problem impacts real-life solutions severely.
A good example of this comes from Uber. Drivers representing ethnic minorities pointed out that the facial recognition software used by the company was unable to process the images of their faces. This effectively excluded them from the database of drivers, rendering them unable to work due solely to their ethnic background.
Harvard University points out that facial recognition technology reaches over 90% accuracy in general, yet there are significant differences when it comes to recognizing minorities. The technology reached peak performance in the act of recognizing faces of light-skinned males while delivering significantly worse accuracy (up to 34.4 percentage points of difference, depending on the technology) when recognizing dark-skinned females.
The database and AI ethics
The only way to mitigate the risk of hidden biases in a dataset is to provide the model with a sufficient amount of data, representing as many cases as possible. One of the possible solutions is to generate training data with neural networks.
What is synthetic data?
Synthetic data is a term referring to images, sounds, or texts generated by AI. These can be used to enrich and balance a dataset used by another model to deliver better results. It is one of the key techniques in data generation for machine learning.
When using Generative Adversarial Networks (GAN) for synthetic data generation, the quality can be lifelike, with generated images being hard or impossible to distinguish from real ones. The neural network can deliver images of basically anything it was trained on.
The key is that the network does not choose an image the model has seen before in order to show it – it generates a new image from nothing, comparable to a human being drawing “a house,” and not a particular house they have seen before. A network fed on images of landscapes can deliver beautiful panoramas of nonexistent countrysides.
Similarly, a neural network trained on a dataset consisting of cars will not deliver an image of a Toyota Yaris or Fiat Multipla, but just “a car” with four wheels, a windscreen, and the proper shape, while remaining not a particular, existing car.
Artificial data generation can be sufficient in performing multiple tasks, including an object or facial recognition. On the other hand, using them to train algorithms for medical image processing may prove far too risky.
The limitations of GAN-generation
GANs have proven their usefulness in delivering outstanding results in generating lifelike images. The key limitation of this technology is the fact that apart from shaping datasets, the user has no influence on the generated media.
If the GAN was trained using the database of faces, the output will show a face. Thus if the dataset the GAN was trained on was diverse and rich, the output could show the face of a child, an adult, a man, a woman, wearing glasses, not wearing glasses – you name it.
How to generate a dataset with synthetic images
Thus, if the main dataset is to be enriched with a synthetic image dataset of a particular type, be they images of red pick-ups or dark-skinned females, one has two ways to go:
- Train the new network from scratch, using a narrow dataset – this approach requires a lot of computing power and resources. Also, building the new dataset meets the same limitations – if there was an insufficient representation of a particular data type, it may be challenging to collect the required data for training the new network.
- Run as many trials as possible – if there were some images representing the desired class in the dataset of the network, one can simply repeat enough trials to get the desired dataset. This can take thousands of trials and, due to the data scarcity mentioned in the beginning, automating this process with facial recognition tools can be challenging.
Both are costly and timely ways to overcome data abundance. This severe limitation of GAN-backed synthetic data generation was tackled by a team of researchers from Tooploox and Jagiellonian University, delivering a paper titled PluGeN: Multi-Label Conditional Generation From Pre-Trained Models.
Tackling synthetic data limitations with PluGeN
The research is based on the aim to make large models, usually trained on millions of images, deliver outcomes with particular qualities as desired by the data scientist. For example, it can be used to generate synthetic faces of dark-skinned people or red pickups driven through the forest – you name it.
The goal was to use an already-trained network without the need to modify it. To do so, the research team, consisting of researchers from Tooploox, Jagiellonian University, Wroclaw University of Science and Technology, and the Institute of Pharmacology PAS, designed a plugin network that can modify the output by infusing the internal neural network’s processes with a set of desired attributes.
The thinking machine
The neural network operates using a numeric representation of inflowing data, legible usually only to the particular network. The input, be that an image, sound, movie, or anything else, is encoded into a set of numbers.
While the representations are eligible to humans, data scientists can obtain them.
Labeling the images
The research team has built a dataset by feeding the neural network with particular, labeled images and taking their numeric representation from the network. The research dataset was built using benchmark CelebA and HQFaces pre-made sets. Images taken from these were further used to extract numeric representations and build representation-description pairs.
The dataset consisted of 10000 pieces of the representation with labels, so the model had the data to find correlations within. For example, the system was able to spot the fact that the label consists of “red” and there was a particular number combination for that representation.
The correlations spotted depend only on the quality and type of the label. The more detailed the label is, the more precise effect the model can deliver.
Injecting an idea
Having knowledge about the ways a large neural network processes images, the plugin network can tune the output by forcing the large network to produce images with the desired parameters.
For example, if the network is producing images of cars, the data scientist can force it to deliver only red pickups, assuming the plugin network was trained on a dataset containing enough examples.
Using the plugin network, the data scientists can force a large, pretrained network to produce the output of the desired outcome. With that come several advantages:
- Saving on computing power – the “large” network has already been trained and its architecture and operations remain unchanged. The plugin network is trained on numerical representations and labels. Processing this data requires a fraction of the computing power required to train a full-scale image processing neural network.
- Saving on time – the time required to build the new dataset is incomparable with the time required to build a dataset using neural network outputs. Also, the nature of the data reduces the training time greatly.
- Keeping the large network untouched – the plugin network can be added and removed at will, so there is no need to retrain or intervene in the large network at all. As such, it can be used “on production” without any risk of disrupting already existing solutions.
- Much higher control of output – the plugin network delivers output of desired parameters, enabling the data scientists to collect outputs faster and with much less effort.
- More flexibility – because the large network remains unchanged, it is possible to prepare just a set of plugin networks to control the output, adding or removing them at will. Rebuilding the large model is entirely unnecessary.
The plugin network enables data scientists to produce the output of desired qualities, yet there are also some limitations to this approach:
- There must be enough labeled data for the model – if the model has to produce output, the data scientists need to prepare the proper training dataset. The sample can be much smaller than for an image recognition tool, but it still has to be large enough to spot correlations between the label and the desired quality.
- The neural network needs to have a desired quality in the dataset – if the dataset used to train the larger neural network did not contain a particular quality, the output will be impossible to obtain. Thus, getting images of red pickups from a network producing images of cars would be possible, while getting a star destroyer or a car with a dog’s ears wouldn’t.
- The plugin network will not work with neural networks that are too universal – it is possible to deliver a neural network able to recognize the human face, a car, a cat, and a building with ease and to generate such images. But the plugin network delivered by the Tooploox-backed team will work only with a focused network, delivering images of only one type, i.e., only of faces or cars.
This work delivers a new way to tune the outputs of a neural network to contain the desired qualities, thus making the production of synthetic data much easier and swifter. As such, it is then possible to tackle the imbalances in datasets and deliver more responsible AI solutions.
This work has been delivered by a team consisting of Maciej Wołczyk, Magdalena Proszewska, Łukasz Maziarka and Marek Smieja of the Jagiellonian University; Patryk Wielopolski of the Wroclaw University of Technology; Rafał Kurczab of the Institute of Pharmacology PAS; and Maciej Zieba, representing the Wroclaw University of Technology and Tooploox.