Capturing 3D interacting hand motion is used for tasks like AR/VR and social signal understanding. Current works mainly rely on depth images, multi-view images, or image sequences as input.
A recent paper, published on arXiv.org, proposes to reconstruct 3D interacting hands from monocular single RGB images.
The researchers introduce a two-stage framework that estimates 3D hand poses and shapes of two closely interacting hands with precise 3D poses and little collisions. Firstly, a convolutional neural network predicts initial hand meshes of two hands. In the second stage, a novel factorized refinement strategy ameliorates the collision issue. The causing factor of error is decomposed and corrected one factor at a time.
Extensive evaluations on large-scale datasets show that the proposed method achieves a 71.4% reduction in the generated collisions and improves pose estimation by 16.5% in comparison to existing methods.
3D interacting hand reconstruction is essential to facilitate human-machine interaction and human behaviors understanding. Previous works in this field either rely on auxiliary inputs such as depth images or they can only handle a single hand if monocular single RGB images are used. Single-hand methods tend to generate collided hand meshes, when applied to closely interacting hands, since they cannot model the interactions between two hands explicitly. In this paper, we make the first attempt to reconstruct 3D interacting hands from monocular single RGB images. Our method can generate 3D hand meshes with both precise 3D poses and minimal collisions. This is made possible via a two-stage framework. Specifically, the first stage adopts a convolutional neural network to generate coarse predictions that tolerate collisions but encourage pose-accurate hand meshes. The second stage progressively ameliorates the collisions through a series of factorized refinements while retaining the preciseness of 3D poses. We carefully investigate potential implementations for the factorized refinement, considering the trade-off between efficiency and accuracy. Extensive quantitative and qualitative results on large-scale datasets such as InterHand2.6M demonstrate the effectiveness of the proposed approach.
Research paper: Rong, Y., Wang, J., Liu, Z., and Change Loy, C., “Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements”, 2021. Link: https://arxiv.org/abs/2111.00763