As proud human beings, we like to think that some of our attributes are un-programmable: feelings, intuition, aesthetics, artiness… Emotions are allowed only for living objects. Even though we can imagine Sophia being happy, it is a different kind of happiness, perhaps inferior to ours. Also, even though we can allow some autonomous system to perceive our expressed sadness, we are definitely not going to believe it can understand us.
On the other hand, how accurately can emotions can be read from our faces? It is not an easy task, given that even humans struggle with it sometimes. After all, you’re not always 100% sure if your significant other is sad or angry.
Nonetheless, emotion recognition sounds like an exciting challenge for machine learning and I was curious to see how far we can go in predicting expressed emotions from face images. Especially after trying out various available open source models where I saw a lot of room for improvement.
The architecture we created – EmotionalDAN was inspired by Deep Alignment Network for face alignment. Face alignment is a system that automatically determines the shape of the face components such as eyes and a nose. In other words, such model outputs locations of 68 most important landmarks of the face (such as eye corners, lip corners, eyebrows etc).
Our hypothesis was that by learning to predict facial landmarks, neural network should be better at predicting facial expression. As it has been shown before, multi-task learning might result in improved learning efficiency and accuracy when compared to training the models separately.
Deep Alignment Network is trained in consecutive stages that allow for refinement of facial landmarks. There is also a transfer of information between stages that keeps track of normalized face input, feature map and landmarks heat map. These features seemed especially beneficial for learning facial expressions.
On top of the last two dense layers in original DAN architecture, we added a new fully-connected layer for emotion branch with a number of neurons corresponding to a number of emotion classes we were trying to predict. Usually in the literature facial expression recognition is done as seven class classification problem – happiness, sadness, anger, surprise, fear, disgust and neutral. On the other hand, we wanted to check how the model performs on an easier but much less ambiguous task – predicting one of three emotion classes – neutral, positive and negative. Hence we experimented with both 7-class and 3-class classification problem.
where the first term is the predicted landmark distance from ground truth normalized by the distance between pupils and the second is cross entropy loss for emotion classification.
Finding a huge dataset for training that contains both emotion and landmark labels was easier than I thought. I was lucky to stumble upon recent(2017) AffectNet database, which contains over 1M face images that were collected from the Internet by querying three major search engines using 1250 emotion-related keywords in six different languages.
Where is AI looking at?
Apart from knowing the accuracy of your model, it is even more exciting to get a grasp of how the model is learning or how a decision is made. To gain some interpretability from our model, we applied a popular technique called GradCAM, which provides visual explanations from deep networks via gradient-based localization.
What’s really interesting is that even though we did not feed any emotion-related spatial information to the network, the model is capable of learning on itself which face regions should be looked at when trying to understand facial expressions. We humans intuitively look at person’s eyes and mouth to notice smile or sadness but neural network only sees a matrix of pixels.
Looking at GradCAM activations, it appears that model got it that eyes and mouth are the most important indicators of expressed emotion. Other regions that were often activated include forehead (surprise, fear) and nose(disgust).
Another interesting thing to do was to check how those activated regions vary per category. To do that, I took face images from the test set, calculated Grad-CAM activations on them, grouped by emotion label and calculated their averages.
Even though mean maps don’t look that much different between labels, which is not surprising given that we only check which face regions are activated – not how they are activated, there is something emotional about them. For example, I love how mean activations for disgust look really disgusted and unhappy.
If you are interested in numerical results on how our model compares against benchmarks (spoiler alert: it kicks ass ! ), I recommend adding our paper from this year’s CVPR workshops to your reading list. For further reading, there is also an extended version on arxiv (currently under review for publication).
For those who are into fewer words and more code, there is also a GitHub repo with EmotionalDAN implementation in Tensorflow. Enjoy!