What is it?
- The visual crowd detector is an application for the analysis and monitoring of gathered footage to identify groups of people as well as places that have had contact with the largest number of people over time.
- The application was designed to support security measures during the COVID-19 pandemic, namely keeping to safe distances and the frequent disinfection of hazardous areas.
- It was prototyped for the Hack the Crisis hackathon and you can find the working prototype in this repo (hackathon quality code)
How does it work?
The application consists of an AI object detection model and a logic for processing its outputs.
- We use the model called Mask R-CNN with the inception backbone as it offers decent speed with satisfactory accuracy in detections. You can find the weights of other pre-trained models here. Any model with “Masks” as the output is compatible with our application.
- We analyze monitored footage frame by frame. In the preparation stage, we divide the first frame of the footage into a grid of squares. We use the same grid for the whole of the footage as it’s quite safe to assume that the size of frames won’t change over time.
- Then we detect all objects belonging to the class called “person,” or in non-nerdy English – we find all the people visible in the frame.
- Once we know where the people are, meaning that we know their exact pixel locations as well as the coordinates of the bounding boxes of each person, we can assign them to the closest square on the grid. To be exact, we compare distances between the centers of bounding boxes and the centers of squares in the grid.
- When we find that a single square is occupied by 3 or more persons, which is of course only an arbitrary parameter that can be modified, we mark silhouettes of those persons with a red mask.
- Thanks to this, an operator of the monitoring system can easily spot large groups of people even when observing multiple monitoring screens at the same time.
- At the same time, information about the assignment of each person to a given square in the grid is accumulated with each frame. After the data from at least 10 frames (another parameter that can be modified) has been collected, we mark the top 3 squares that were occupied by the largest number of people.
- This functionality would really shine in a situation where we have a large area that can be freely accessed by many people and not enough resources to constantly disinfect it as a whole.
- For example, if a city would be responsible for the disinfection of public benches, the system could be used to identify the benches that were used the most and should be disinfected more frequently.
What can be improved?
- Except to work on the object detection model itself, which could be improved in the first step by training it specifically on a set of given monitoring footage, there is a lot of space for improvement in the processing logic.
- Currently, we don’t take into account any parameters related to the configuration of the monitoring cameras, especially how they’re installed. If the camera is not installed perpendicularly to the observed surface then, pixel-wise, distances between objects close to and far away from the camera may be considered the same although the real distances are very different.
- For example, 10 pixels of distance between people close to the camera could translate to a few centimeters and 10 pixels between people far away from the camera to a few meters.
- Finally, the approach with the grid of squares may be too coarse in some applications. If we needed more accurate indications of where people actually were, we would propose the use of a heatmap created based on the detected silhouettes.
- We could go a step even further and turn it into the system for finding what and where people touched.
Read also about Augmenting AI image recognition with partial evidence