소프트웨어 개발/Computer Vision

GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB

Leo's notes 2020. 1. 15. 21:48

Goal

Track hand pose from unconstrained monocular RGB video streams at real-time framerates

Background

Multi-view methods

  • Hard to setup (calibration)
  • Hard to operate on general hand motions in unconstrained scenes
  • Expensive

Monocular methods

  • Without setup overhead
  • Do not work in all scenes (e.g. outdoor with sunlight)
  • Higher power consumption
  • Not robust to occlusions by objects
  • Not able to distinguish 3D poses with the same 2D joint position projection

Learning-based methods

  • Difficult to obtain annotated data with sufficient real-world variations
  • Suffer from occlusions due to objects being manipulated by the hand
  • Synthetic data has a domain gap when models trained on this data are applied to real input
  • Hard to obtain real-synthetic image pairs

Dataset

Since the annotation of 3D joint positions in hundreds of real hand images is infeasible, synthetically generated images are commonly used.

Real hand image

28,903 Real hand image with desktop webcam

Synthetic hand image

SynthHands dataset from state-of-the-art datasets

  • Real-time hand tracking under occlusion from an egocentric RGB-D sensor
  • Learning to Estimate 3D Hand Pose from Single RGB Images

GeoConGAN

Translate synthetic to real images based on CycleGAN

  • Uses adversarial discriminators to learn cycle-consistent forward and backward mappings
  • Has two trainable translators synth2real and real2synth
  • Does not require paired images
  • Preserve poses during translation

Extract the silhouettes of the images by training a binary classification network, SilNet based on simple UNet.

  • Has three 2-strided convolutions and three deconvolutions

Data Augmentation

  • Composite GANerated images with random background images
  • Randomly textured object by leveraging the object masks

Hand Joints Regression

Train a CNN, the RegNet derived from the ResNet50, that predicts 2D and 3D positions of 21 hand joints with about 440,000 samples (60% GANerated)

  • 2D joint positions are represented as heatmaps in image space : represent uncertainities
  • 3D positions are represented as 3D coordinates relative to the root joint : resolve the depth ambiguities
  • Additional refinement module based on projection layer (ProjLayer) to better coalesce 2D & 3D predictions

Kinematic Skeleton Fitting

Kinematic Hand Model

Comprises one root joint and 20 finger joints.

  • Per-user skeleton adaptation : obtain averaging relative bone lengths of 2D prediction over 30 frames while the users hold their hand parallel to the camera image plane

2D Fitting Term

Minimize the distance between the hand point position projected onto the image plane and the heatmap maxima.

3D Fitting Term

  • Obtain a good hand articulation by using the predicted relative 3D joint positions
  • Resolve depth ambiguities that are present when using 2D joint positions only

Joint Angle Constraints

Penalize anatomically implausible hand articulations by enforcing that joints do not bend too far

Temporal Smoothness

Penalize deviations from constant velocity

Optimization

  • Minimize the energy in gradient-descent strategy,
  • Use that the root joint and four direct children joints (respective non-thumb MCP joints) are rigid

Experiments

Use the Percentage of Correct Keypoints (PCK) score that is a popular criterion to evaluate pose estimation accuracy

Conclusion

Real-time hand tracking system with,

  • Monocular RGB-only image without multi-view or depth image
  • Synthetic generation of training data based on geometrically consistent image-to-image translation network (GeoConGAN)
  • Training convolutional neural network (RegNet)
  • Kinematic skeleton fitting

Outperforms,

  • Without setup overhead
  • Lower power consumption
  • From unconstrained images (does not require paired images)
  • Preserve poses during translation
  • Robust to occlusions and varying camera viewpoints
  • More precise than state-of-the-art methods

Reference

https://handtracker.mpi-inf.mpg.de/projects/GANeratedHands/content/GANeratedHands_CVPR2018.pdf