EmoAR Ended

EmoAR is a mobile AR app that aims to recognize face expression in real time, and to superimpose virtual content according to the recognized expression.

Cost of development: N/A
Communities: TensorFlow, ARCore, YOLO
Categories: Lenses & Masks, Utilities, Games

Screenshots & Video

Description

EmoAR (2019/20) is an AR application prototype (ARCore support is required) that aims to recognize human face expression in real time, and to superimpose virtual content according to the recognized expression. For example: Depending on the predicted face expression, EmoAR would overlay emojis or randomized famous quotes about the expression, in AR.

The live AR camera stream of a mobile device (Android) is input to a segmentation tool (using tiny YOLO) that detects faces in the video frames, in real time. The detected areas with a face are fed into a model that was trained on the public FER dataset (from a Kaggle competition 2013). The face expression of the detected face is determined in real time by using our model trained with PyTorch and converted to Tensorflow Lite for use in Android. Depending on the model prediction (the output result), different virtual content overlays the face. This virtual augmentation of a face is done with Augmented Reality (ARCore).

We also deployed the PyTorch model to a web app using Flask and Heroku, but without the AR feature.

Demo video Android app: https://youtu.be/R7-69Vf8r_4?t=120
Web app demo: https://emoar1.herokuapp.com/
Project description, German: https://www.terweyxr.de/post/augmented_reality_kuenstliche_intelligenz

Instead of using OpenCV’s techniques, we access the AR camera stream, we use YOLO to determine a person in a video frame, we crop this AR camera image, convert it to a Bitmap and feed a copy of it as input in our custom PyTorch model to determine the facial expression of this face in real time. Most tasks are done asynchronously.

How we built EmoAR with PyTorch:

Trained CNNs with PyTorch specifically for mobile applications and web applications
Trained CNNs using a custom architecture and alternatively using transfer learning with pre-trained PyTorch models
Model conversion from PyTorch to ONNX to Tensorflow for use in mobile applications, Unity3d and Android
Development of a REST API by using Flask.
PyTorch Model deployment to a web app Heroku
Deployment of a converted PyTorch model to Unity3d and to Android
Combination of Deep Learning and Augmented Reality (AR) in an Android app. Deep Learning determines the kind of visual AR content.
Sharing of the AR camera feed with other APIs to be able to input the AR camera feed to the CNN, rendering on the OpenGL thread

Data Description:

We used the FER2013 dataset from Kaggle for training. [ https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/overview ] It was prepared by Pierre-Luc Carrier and Aaron Courville and consists of grayscale facial images of size 48x48 px.

The faces are segregated and categorized into 7 classes:

0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral

A total of 28,709 examples was used for training the models, which were further validated with 3,589 examples.

Dataset issues we encountered:

the dataset is unbalanced
some images are ambiguous, have mixed expression
images without faces, but with watermarks, icons, graphic content, blank, useless information etc.

About model training:

We experimented with and trained several pre-trained models of different architectures:

ResNeXt50, Inception Resnet v2, MobileNet v2, SqueezeNet, DenseNet121, ResNet101, a custom CNN

We experimented with

data augmentation, i.e. rotation, horizontal flip
unfreezing some of the last layers
SGD and Adam optimizers,
different learning rates
schedulers – MultiStepLR, ReduceOnPlateau
weight initialization of linear layers in a custom CNN

We started cleaning the FER dataset, trained with PyTorch for the web app, converted our best PyTorch models to Tensorflow Lite.

About model conversion (as of autumn 2019):

Initially, we wanted to deploy to an Augmented Reality app (iOS and Android) via Unity3d using TensorflowSharp to do the inference. (Tensorflow -.pb files are supported by Unity3d.) The conversion chain is as follows: PyTorch → ONNX → Tensorflow -.pb

We also tried the recently released Inference Engine by Unity3d with the Unity3d Barracuda backend of the ML Agents Toolkit. Due to incompatibility issues concerning the Tensorflow versions as well as our models’ architectures with Barracuda, we dropped the app development in Unity3d, the issues led to crashes of Unity (as of autumn 2019).

We switched to the development in Android (Java) with Tensorflow Lite and ARCore. The conversion to Tensorflow Lite reduced the model size by approx. 66% to only 33% of the model size.

An alternative conversion would be PyTorch to Caffe2 for use in Android, or the conversion from Keras to Tensorflow Lite (which we tried for the Android app, too, as this is a quite straight forward approach.)

About the Android project:

We used the following Android APIs and frameworks:

android.graphics
android.opengl
com.google.ar.core
org.tensorflow.lite

to name but one.

We access the AR camera stream, we use Tiny YOLO to determine a person in a video frame, we crop this AR camera image, convert it to a Bitmap and feed it as input in our custom PyTorch model to determine the expression of this face in real time.

To overlay and to place virtual 3d content with ARCore, ARPoint Clouds and ARPlanes are currently used.

We calculate the coordinates of the feature point that is closest to the face. Then, we place our virtual 3D arrow at that coordinate. According to the recognized expression, the 3d model's texture changes.

Next steps.

How to improve the project

Use of a dataset of facial landmarks and/ or 3d meshes (3d objects) of faces, because currently the app works best, if flat images of faces are used for inference. This would increase the model performance enormously, especially if the app has to deal with rotated faces. Although we have found a few public datasets of facial 3d meshes, it still would have to be labelled according to facial expression classes. Due to lack of time and of GPU resources for training such large sized datasets, we have not yet tackled it. To superimpose and to place virtual 3d content with ARCore, ARPoint Clouds and ARPlanes are currently used. A better approach in this regard is: Use the 3d mesh of the face that is generated by ARCore and ARKit for face tracking to superimpose the virtual 3d content as well as for model inference. This requires a dataset that has been trained on such 3d face meshes, is labeled (expression classes) and also gives information about landmarks.
Label these datasets according to the facial expression classes
This would improve the AR tracking quality as well as the TF AI model performance
Improve the dataset with more number of classes.
Overlay virtual objects on multiple faces simultaneously

Incorporation of PySyft (Federated Learning and Encryption), in this project and future prospects:

User's data and training data will not leave the device and individual updates will not be stored in any servers/data centers.
The data for training the models in this project are used taking into consideration any copyright and privacy issues.

For improvement of our PyTorch model with new high quality private images, we could incorporate federated learning and encryption in order to secure the prediction of our classification models:

The user’s device will download the current shared model from a cloud, will improve it by learning from and training on local data on the device. Then the modifications will be summarized as an update. This model update will be sent back to the cloud where it is averaged with other user updates in order to improve the commonly shared model. All communication will have to be encrypted. This way, Federated Learning enables end devices to collaboratively learn a commonly shared prediction model while keeping all the training data on device.

Why we have not applied PySyft yet, as of September 2019:

Problems using Differential Privacy and Federated Learning in our dataset

Issues with using Encryption, Federated Learning, CNN and BatchNorm layers in pre-trained models.
No GPU support for PySyft resulting in longer training times [ OpenMined/PySyft#1893 ]

The team: B. Terwey, A. Antony, M. Calincan, M. Zatyln

Target Images

Scan these images to discover AR content

Get it now

What’s New

{{ u.content }}