The Client & the Challenge
A technology company that is a leading player in conversational AI and automation wanted a robust and efficient face super resolution (FSR) system that could transform smaller, low-quality facial images captured through web and CCTV cameras to usable sized, high-resolution images that can be utilized for recognizing facial expressions. The intent was to augment conversational AI with facial expression recognition to comprehend participant’s emotions during online meetings.
Automation is the process of converting recurring, routine processes into autonomous operations that can be performed with zero or minimal human intervention. Automation improves efficiency, minimises errors, saves cost and labour, and can be applied to any sector. The market for automation software which was valued at USD 19.9 billion in 2021, is anticipated to reach USD 76.4 billion by 2030, increasing at a CAGR of 16.5% between 2022 and 2030. Artificial Intelligence is a key enabler of automation. Conversational AI, through technologies such as chatbots and virtual assistants, enables automatic comprehension, processing and response to customers. The global conversational AI market size is expected to almost triple from USD 6.8 billion in 2021 to USD 18.4 billion in 2026, growing at a CAGR of 21.8% during the forecast period.
Intent and emotion inference is a key aspect of conversational AI and automation, and requires simultaneous processing of both audio and visual inputs for better precision. Understanding emotions from video inputs such as facial images requires high resolution images with enough detailing to recognize and comprehend facial expressions. Typical facial images from webcams that we obtain during online meetings are smaller in size and have a low resolution. For facial recognition AI models to work effectively, these images need to be transformed into usable sized high-resolution images. Thus, it was necessary to develop a high quality and low latency face super resolution system that could be used with facial expression recognition models to decipher participant’s emotions during online meetings.
For facial recognition AI models to work effectively, these images need to be transformed into usable sized high-resolution images.
We used a blend of vision AI and Deep Learning to solve the customer's challenge. Here is a breakdown of the steps we used:
Step 1: Leverage the power of generative adversarial networks
Generative adversarial networks (GANs) are algorithmic architectures that use two neural networks, pitting one against the other (thus the “adversarial”) to discover and learn the regularities and patterns in the input data and generate new instances of data that are very close to the original. Thus, GAN is a good choice for face super resolution. After detailed analysis and experimentation, GFP-GAN (Generative Facial Prior GAN) and RESR-GAN (Real - Enhanced Super Resolution GAN), two popular models for face restoration and image and video super resolution respectively, were selected to build our FSR system.
Step 2: Train the GFP_GAN and RESR-GAN models
The GFP-GAN and RESR-GAN models were trained with approximately 80K images across three different categories of emotions - Positive, Neutral, and Negative, to avoid any kind of biases. Based on experimental analysis, the RESR-GAN model, which offered an accuracy of more than 85% while preserving the identity, artefacts and facial expressions of the original image after upsampling, was chosen for productization. The RESR-GAN model also showed a low latency of 13.89 ms per image.
Step 3: Deploy the deep residual learning model
The RESR-GAN model was deployed to ONNX, The Open Neural Network Exchange ecosystem. ONNX not only supports interoperability across frameworks, but also facilitates accelerated inferencing via runtimes and libraries designed to maximise performance across various hardware such as NVIDIA GPUs and Habana Gaudi.
- Face Super Resolution achieved with 85% + accuracy.
- Fast face image transformation at a low latency of 13.89 ms per image.
- More than 1 million face images transformed.
- Automatic transformation of low-quality facial images into high-resolution images.
- High pixel density and enhanced image details that facilitate recognition of facial expressions and emotions.