With the tremendous growth in the capabilities of computer vision, it is not a surprise to see new ways of harnessing them arise. Autonomous driving is currently one of the most challenging yet most inspiring areas where modern computer vision technologies are finding application. But these come not without their challenges.
It was a sunny day, only a month before the start of World War I, when Lawrence Sperry did the unthinkable. In front of a crowd, during a flight show, he came out of the cockpit of his aircraft and jumped out onto a wing. The aircraft, of course, was equipped with one of the earliest versions of an autopilot device, the Sperry’s Gyroscopic Stabilizer Apparatus.
The Sperry Corporation has delivered multiple devices based on the gyroscope, from flight stabilizers to maritime autopilot systems even on to bombsights and ballistic computers. The idea of self-driving aircraft was relatively easy to achieve, even without the existence of modern computing systems.
But in the crowded and volatile environment of the on-ground world, an auto-pilot, or any more sophisticated form of autonomous driver, had to wait until the modern computing era had emerged. The latest achievements in computer vision are the foundations of progress in this dream.
What is computer vision?
Computer vision is an umbrella term for a set of techniques used to interpret image-based data by computers. While image processing has been possible for a long time, computers were unable to interpret those images in any way.
Simple heuristics were used to deliver semi-intelligent ways of processing an image – for example, gamma analysis enabled cameras to check if a particular area had been breached. But not much more.
When using an artificial neural network, the capabilities of machines have risen to levels as unseen before. The machine can now understand the context of a scene and filter out the noise to process only the most important information. For example, a child running into the street or a car nearby would be considered more important than, say, a cloud in the sky or a petrol station on the horizon.
But the sole fact that computer vision in an automotive context can tell the difference between a car and a human or a tree and a building does not mean that it can rival the perceptive skills of a human driver. This ability is only a prelude for more sophisticated technologies to arrive.
But these come not without their own challenges. In regards to these that, this text will discuss:
- Challenge 1: Car sensors and multimodal data
- Challenge 2: Gathering representative training data
- Challenge 3: Object detection
- Challenge 4: Semantic instance segmentation
- Challenge 5: Stereovision and multi-camera vision
- Challenge 6: Object tracking
- Challenge 7: 3D scene analysis
As with any tech advanced enough, the possibilities are endless, assuming we overcome the challenges.
Challenges and solutions in computer vision for autonomous vehicles
In autonomous vehicles the quality and reliability of computer vision solutions can be a matter of life or death, either of the driver or of others on the road.
Autonomous car sensors
Sensors are basically the senses of an autonomous vehicle and the foundation of its further actions. Currently, there are four leading sensor types used to check on surroundings and provide the controller neural network with the information required to make decisions:
- Ultrasound sensor
There is a wide array of hardware and software configurations, resulting in millions of possible pitfalls. Even a slight change in one or another sensor configuration, for example the sensitivity or the position regarding an axis of a vehicle, can alter the readings. Because of this, the neural network can have a problem with making proper interpretations.
Also, as it is in the case of humans, only a combination of senses provides the driver with information accurate enough to ensure the safe movement of speeding, heavy machinery. When it comes to a human driver, there is a dependence on sight, hearing, and motion sense. An autonomous vehicle needs much more than that, using only the sensors mentioned above. The challenge is in building a comprehensive image of the outer world and processing it effectively.
More about challenges and possible solutions when it comes to autonomous car sensors can be found in one of our recent blog posts.
It is fair to say that an AI solution can only be as good as the data it was trained or validated on. In this case, gathering and processing datasets for the training of autonomous vehicles is a challenge in itself.
When it comes to building datasets for autonomous drivers, there are at least two challenges to consider:
- Gathering data – the best idea is to wander around in a car and gather basically everything.
- Labeling data – all the data gathered needs to be properly labeled, usually requiring heavy, manual human labour.
The first challenge can be approached either by gathering data through existing semi-autonomous cars and driving support systems or to gather data in an artificial, simulated reality based on computer game engines. The most popular environment of this latter variety is called Carla.
For labeling the data there is no simple answer with the Captcha mechanism being one of the most successful. Within the slightly annoying mechanism of proving one is not a robot, a user is provided with two words or an image and needs to spot the correct elements. When it comes to words, one is verified to be correct, while the other is labeled by the user.
The same goes for the image. With thousands of Internet users tirelessly labeling images in search of traffic lights, road signs, or pedestrians, it has become possible to gather a decent number of labeled images that are further used in training image recognition-based solutions.
On the other hand, though, a company needs to be extremely careful when it comes to gathering training data, so sometimes we must evaluate the labels and ensure 100% accuracy of the dataset. When this consists of thousands of images, it is a challenge in itself.
Out of distribution problem
There is also a significant challenge in delivering cars for different regions, be they North America, Europe, or Africa. While for a human driver it is not a challenge to generalize a Pine and a Baobab into a simple “tree” category and move on without a second glance, the autonomous car system can struggle with this problem and become utterly confused in an unfamiliar environment.
Considering the general problem of generalization witnessed in neural networks, a car trained on data gathered in the US can be significantly less capable of driving autonomously in Europe. The narrow streets of Edinburgh’s old town are incomparable to the largely well-planned and wide streets of US towns and cities – and this is a factor that is hard to cope with for a neural network.
Thus, when building a dataset aimed to train a worldwide-capable car, it is crucial to build datasets consisting of images from all parts of the world. And that makes the datasets even bigger – like the whole of Google Street View, but fully labeled.
The prime and the most basic task of computer vision algorithms is to recognize an object in a picture. While computers outperform humans in multiple image recognition tasks, there are several which are particularly interesting in the context of autonomous vehicles.
- Object recognition needs to be done in a real-time environment. When it comes to input from a camera, it is sometimes based on a set of lines that are constantly flowing from the sensor and are used to refresh an ever changing image on a screen, rather than on a series of complete and whole images. Thus, there is a need to recognize objects without actually seeing them.
- There are multiple elements in an environment that can be confusing for an autonomous system – a truck trailer in front of a car can be a good example.
This first challenge can be solved by training a model on the data delivered by the sensor as an output, practically switching the model toward signal analysis rather than image recognition.
The second challenge is an example of a typical problem of AI being unable to generalize and having no prior knowledge on a subject. An interesting solution comes from enriching image recognition with partial evidence – a technique that enables the neural network to use a piece of additional information (for example context) to exclude the least probable outcome.
So if there is a car hovering higher than five meters off the ground, it is probably a billboard and there is no need to decelerate.
Traffic sign recognition is an iconic task for an autonomous vehicle’s neural network. The challenge in traffic sign recognition is in doing it quickly and in a highly volatile environment. A sign can be dirty, covered by leaves, twisted into an odd angle, or modified in this or another way.
Also, it is common to rearrange signs on the road or put some up temporarily, for example informing about a detour or road construction. So the net needs to be swift in processing and vigilant in spotting signs.
An even more significant challenge comes with pedestrians. The machine needs not only to recognize one without a moment’s doubt but also needs to be able to estimate one’s pose. If the pedestrian’s motion indicates that he or she is going to cross the road, the vehicle needs to spot that and react quickly.
Semantic segmentation and semantic instance segmentation
Semantic segmentation and semantic instance segmentation are intuitively similar problems with different sets of challenges.
Semantic segmentation is about detecting multiple entities in one image and providing each one with a separate label. Obviously, there can be a car, a road sign, a biker, and a truck on the road at the same time – and that’s what semantic segmentation is all about.
Semantic instance segmentation is about spotting the difference between each object in the scene. For a car’s system it is not enough that there are simply three cars on the road – it needs to be able to differ between them easily in order to track their individual behavior. While semantic segmentation frames each car, each tree and each pedestrian, semantic instance segmentation labels each as car1, car2, tree1, tree2 etc.
This challenge is to deliver a bit of additional information about the number of objects on the road and their positions regarding each other without naming them.
- Performance – as mentioned above, it is challenging to deliver truly real-time object recognition due to the limitations of the sensor itself. The same feature is limiting both for semantic segmentation and semantic instance segmentation in the same way it is for object recognition.
- Confusion – while machines are increasingly efficient in their tasks, one needs to remember that there is always a present factor of unpredictability in the artificial neural network. Thus the network can be confused by several factors like unusual lighting conditions or weather.
Tackling these problems is tied to acquiring and leveraging bigger datasets that provide the neural network with more examples to generalize on. Providing the network with artificial data either generated manually or via Generative Adversarial Networks is one of the simplest ways to tackle this challenge.
Stereovision and multi-camera vision
Depth estimation is one of the key factors in ensuring the safety of a vehicle and its passengers. While there are multiple tools available, including camera radar and lidar, it is common to support them with multiple camera vision.
Knowing the distance between camera lenses and the exact location of an object on images taken by them is the first step toward building a stereo vision system. In theory, the rest is simple – the system uses triangulation to make depth estimations.
- Camera configuration – the distance between lenses and the sensitivity of the sensor can differ, delivering additional challenges for a depth estimation system.
- Non-parallel representation – the cameras used in autonomous cars can deliver slightly different images, without pixel-to-pixel world representation. Thus, if there is a hardware shift in a pixel representation of an image, the system can find it more challenging to calculate the distance.
- Perspective distortion – the bigger the distance between the camera lenses, the better the depth estimation. Yet this comes with another challenge – perspective distortion, which needs to be accounted for in the depth estimation.
The Tooploox engineers have overcome this challenge by feature engineering and combining the stereoscopic data with information from lidar and radar devices. Also, it is a matter of feature engineering – with enough artificial data generated to tweak the algorithm and real data to validate the effects, the team was able to deliver top-class results.
A good example of multi-objective camera usage in the automotive industry is Light, one of Tooploox’s clients.
Object tracking aims to provide the autonomous vehicle control system with information about the current motion and motion prediction of the object. While object recognition informs the system that a particular object is a car, a truck, or a tram, this feature delivers information on whether or not that object is accelerating, decelerating, or maneuvering.
- Risk estimation – the network needs to not only predict movement but also to anticipate behavior. In the same way, a human driver is often more careful when driving near cyclists than when driving along with other cars. Any incident would be more dangerous for a person on a bicycle.
- Volatile background – when tracking an object, the network needs to deal with changes in the background. Other vehicles are approaching, the road changes color or there are trees instead of fields behind it. This is not a problem for a human driver, but it can be utterly confusing for a neural network.
- Confusing objects – objects on a road are fairly repetitive, with dozens of red cars passing by each day. The tracking software can possibly mismatch one red car with another and thus provide the controller network with inaccurate information.
Providing the controller neural network with multimodal data gathered by Lidar and Radar is a good answer for this challenge. While Lidar can struggle with the identification of the type of a particular object, it delivers the exact position of it with pinpoint-accuracy. Radar provides less accurate data, yet independent of the scene and rarely affected by other factors like weather conditions.
3D scene analysis
Combining all information gathered from the techniques described above, the controlling system should construct a 3D representation of the surrounding world. This can be compared to one’s imagination, where the computations are done, the effects estimated and the outcome produced.
- Accuracy – in 3D scene analysis even the slightest inaccuracies tend to stack into a larger mistake and result in a drift. What appears to be harmless at low speeds can become visible when velocity rises.
- Multiple unpredictable objects – what is fairly straightforward on a highway gets complicated in urban traffic, where the street network grows more complex as well as the intentions of the other actors on the road.
Without the supporting role of Lidar and Radar, an effective and fast 3D scene analysis would be extremely challenging. Tooploox engineers and researchers have approached this challenge by working with point cloud data on object identification.
By this method, the system controlling the autonomous vehicle receives accurate information about objects in the scene from two independent sources – a camera with image recognition capabilities and a lidar system with point cloud data analysis and identification.
Image recognition capabilities are at the foundation of autonomous vehicle control systems. The good news is that the systems can be enriched with multiple sensors as well as provided with various other forms of data. Modern solutions leverage the capabilities delivered by HD maps or GPS systems to get better information to work with.
If you wish to get more information regarding autonomous vehicles or wish to discuss this matter, don’t hesitate to contact us now!