Abstract
Neural radiance fields (NeRFs) are promising 3D representations for scenes, objects, and humans. However, most existing methods require multi-view inputs and per-scene training, which limits their real-life applications. Moreover, current methods focus on single-subject cases, leaving scenes of interacting hands that involve severe inter-hand occlusions and challenging view variations remain unsolved. To tackle these issues, this paper proposes a generalizable visibilityaware NeRF (VA-NeRF) framework for interacting hands. Specifically, given an image of interacting hands as input, our VA-NeRF first obtains a mesh-based representation of hands and extracts their corresponding geometric and textural features. Subsequently, a feature fusion module that exploits the visibility of query points and mesh vertices is introduced to adaptively merge features of both hands, enabling the recovery of features in unseen areas. Additionally, our VA-NeRF is optimized together with a novel discriminator within an adversarial learning paradigm. In contrast to conventional discriminators that predict a single real/fake label for the synthesized image, the proposed discriminator generates a pixelwise visibility map, providing fine-grained supervision for unseen areas and encouraging the VA-NeRF to improve the visual quality of synthesized images. Experiments on the Interhand2.6M dataset demonstrate that our proposed VANeRF outperforms conventional NeRFs significantly. Project Page: https://github.com/XuanHuang0/VANeRF.
Framework
Experiment
Conclusion
In this paper, we introduce a single-image generalizable visibility-aware neural radiance field framework for image synthesis of interacting hands. The proposed framework leverages the visibility of 3D points for feature fusion and adversarial learning. Our feature fusion is achieved by fusing features of reference vertices closely related to query points, with fusion weights determined by point visibility. Our adversarial learning is accomplished through the training of a pixel-wise discriminator capable of estimating visibility maps. With these two components cooperating together, the proposed method can obtain reliable features and high-quality results, even in challenging scenarios involving heavy occlusions and large view variations. The proposed method is evaluated on Interhand2.6M and obtains performance superior to state-of-the-art generalizable models.
Acknowledgement
This work was supported in part by National Key R&D Program of China under Grant No. 2020AAA0109700, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), National Natural Science Foundation of China (NSFC) under Grant No. 61976233, No. 92270122, No. 62372482 and No. 61936002, Mobility Grant Award under Grant No. M-0461, Shenzhen Science and Technology Program (Grant No. RCYX20200714114642083), Shenzhen Science and Technology Program (Grant No. GJHZ20220913142600001), Nansha Key R&D Program under Grant No.2022ZD014 and Sun Yat-sen University under Grant No. 22lgqb38 and 76160-12220011.