Head Pose Estimation using Deep Learning


Estimating the head pose of a person is a significant task with various applications such as helping with fitting 3D objects to the face, user interaction in computer systems or applications, and tracking the driver’s direction of view. There are two practical approaches to estimating the head pose based on an image of a human head. One is to estimate the landmarks of the human face and then solve the 2D to 3D correspondence problem with a 3D human head model, also called a landmark-based approach. Another method is to train a Deep Convolutional Neural Network to predict the head pose directly from the image of a human head, also called an appearance-based approach. An example of an appearance-based approach is the architecture called HopeNet, which was presented in “Fine-Grained Head Pose Estimation Without Keypoints” and produced state-of-the-art results when compared to commonly used landmark based approaches. However, this method was trained on synthetic datasets and only tested on one dataset with actual data (BIWI) due to the lack of other real datasets of human heads corresponding to Euler angles. The 2018 iPad Pro has a TrueDepth sensor, which, together with the Apple Neural Engine, can capture detailed information from the human face and extract the head pose from this. We present a new approach to create a custom dataset with images of heads and corresponding intrinsic Euler angles (pitch, yaw, roll) and 3D position by recording videos with a 2018 iPad Pro and collecting the frames with the corresponding Euler angles. We show that this data collection method and the HopeNet architecture show great potential for head pose estimation by producing prediction errors similar to what the 2018 iPad Pro produces.