- Mude Lin, Liang Lin, Xiaodan Liang, Keze Wang, Hui Cheng, “Reccurrent 3D Pose Sequence Machines”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. (oral presentation) Paper Code demo
3D human articulated pose recovery from monocular image sequences is very challenging due to the diverse appearances, viewpoints, occlusions, and also the human 3D pose is inherently ambiguous from the monocular imagery. It is thus critical to exploit rich spatial and temporal long-range dependencies among body joints for accurate 3D pose sequence prediction. Existing approaches usually manually design some elaborate prior terms and human body kinematic constraints for capturing structures, which are often insufficient to exploit all intrinsic structures and not scalable for all scenarios. In contrast, this paper presents a Recurrent 3D Pose Sequence Machine(RPSM) to automatically learn the image-dependent structural constraint and sequence-dependent temporal context by using a multi-stage sequential refinement. At each stage, our RPSM is composed of three modules to predict the 3D pose sequences based on the previously learned 2D pose representations and 3D poses: (i) a 2D pose module extracting the image-dependent pose representations, (ii) a 3D pose recurrent module regressing 3D poses and (iii) a feature adaption module serving as a bridge between module (i) and (ii) to enable the representation transformation from 2D to 3D domain. These three modules are then assembled into a sequential prediction framework to refine the predicted poses with multiple recurrent stages. Our RPSM is thus capable of implicitly encoding the 3D geometry structural information and temporal coherence in the 3D pose sequences. Extensive evaluations on the Human3.6M dataset and HumanEva-I dataset show that our RPSM substantially surpasses all state-of-the-art approaches for 3D pose estimation, e.g., over 20% reduction of the mean per joint error on the Human3.6M dataset.
As illustrated in Fig.1, we propose a novel Recurrent 3D Pose Sequence Machine (RPSM) to resolve 3D pose sequence generation for monocular frames, which recurrently refines the predicted 3D poses at multiple stages. At each stage, RPSM consists of three consecutive modules: 1) 2D pose module to extracts 2D pose-aware features; 2) feature adaption module to transform the representation from 2D to 3D domain; 3) 3D pose recurrent module to estimate 3D poses for each frame incorporating temporal dependency in the image sequence. These three modules are combined into a unified framework in each stage. The monocular image sequences are passed into multiple stages to gradually refine the predicted 3D poses. We train the network parameters recurrently at multiple stages in a fully end-to-end way.
Fig 1. An overview of the proposed Recurrent 3D Pose Sequence Machine architecture. Our framework predicts the 3D human poses for all of the monocular image frames, and then sequentially refines them with multi-stage recurrent learning. At each stage, every frame of the input sequence is sequentially passed into three neural network modules: i) a 2D pose module extracting the image-dependent pose representations; 2) a feature adaption module for transforming the pose representations from 2D to 3D domain; 3) a 3D pose recurrent module predicting the human joints in 3D coordinates. Note that, the parameters of 3D pose recurrent module for all frames are shared to preserve the temporal motion coherence. Given the initial predicted 3D joints and 2D features from the first stage, we perform the multi-stage refinement to recurrently improve the pose accuracy. From the second stage, the previously predicted 17 joints (51 dimensions) and the 2D pose-aware features are also posed as the input of 2D pose module and 3D pose recurrent module, respectively. The final 3D pose sequence results are obtained after recurrently performing the multi-stage refinement.
Fig2. Detailed network architecture of our proposed RPSM at the k-th stage. An input frame with the 368 x 368 size is subsequently fed into 2D pose module, feature adaption module and 3D pose recurrent module to predict the locations of 17 joint points (51 dimensions output). The 2D pose module consists of 15 shared convolution layers across all stages and 2 specialized convolution layers for each stage. The specialized convolution layers take the shared features and the 2D pose-aware features at previous stage as the input, and output specialized features to the feature adaption module as well as the next stage. The feature adaption module consists of two convolution layers and one fully-connected layer with 1024 units. Finally, the adapted features, the hidden states of the LSTM layer and previously predicted 3D poses are concatenated together as the input of 3D pose recurrent module to produce the 3D pose of each frame.
Table 1. Quantitative comparisons on Human3.6M dataset using 3D pose errors (in millimeter) for different actions of subjects 9 and 11. The entries with the smallest 3D pose errors for each category are bold-faced. Our RPSM achieves the significant improvement over all compared state-of-the-art approaches, i.e. reduces mean error by 21.52%.
Table2. Quantitative comparisons on HumanEva-I dataset using 3D pose errors (in millimeter) for the “Walking”, “Jogging” and “Boxing” sequences. ‘-‘ indicates the corresponding method has not reported the accuracy on that action. The entries with the smallest 3D pose errors for each category are bold-faced. Our RPSM outperforms all the compared state-of-the-art methods by a clear margin.
Fig3. Empirical study on the qualitative comparisons on Human3.6M dataset. The 3D poses are visualized from the side view and the camera are also depicted. Zhou , Zhou , our RPSM and the ground truth are illustrated from left to right, respectively. Our RPSM achieves much more accurate estimations than the methods of Zhou et al  and Zhou et.al.
We have proposed a novel Recurrent 3D Pose Sequence Machines (RSPM) for estimating 3D human pose from a sequence of monocular images. Through proposed unified architecture with 2D pose, feature adaption and 3D pose recurrent modules, our RPSM can learn to recurrently integrate rich spatio-temporal long-range dependencies in an implicit and comprehensive way. We also proposed to employ multiple sequential stages to refine the estimation results via the 3D pose geometry information. The extensive evaluations on two public 3D human pose dataset validate the effectiveness and superior performance of the our RPSM. In future work, we will extend the proposed framework for other sequence-based human centric analysis such as human action and activity recognition.
 L. Bo and C. Sminchisescu. Twin gaussian processes for structured prediction. IJCV, 87(1-2):28–52, 2010.
 Y. Du, Y. Wong, Y. Liu, F. Han, Y. Gui, Z. Wang, M. S. Kankanhalli, and W. Geng. Marker-less 3d human motion capture with monocular image sequence and height-maps. In ECCV, 2016.
 C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. PAMI, 36(7):1325– 1339, 2014.
 I. Kostrikov and J. Gall. Depth sweep regression forests for estimating 3d human pose from images. In BMVC, 2014.
 S. Li, W. Zhang, and A. B. Chan. Maximum-margin structured learning with deep networks for 3d human pose estimation. In ICCV, 2015.
 I. Radwan, A. Dhall, and R. Goecke. Monocular image 3d human pose estimation under self-occlusion. In ICCV, 2013.
 M. Sanzari, V. Ntouskos, and F. Pirri. Bayesian image based 3d pose estimation. In ECCV, 2016.
 E. Simo-Serra, A. Quattoni, C. Torras, and F. Moreno Noguer. A Joint Model for 2D and 3D Pose Estimation from a Single Image. In CVPR, 2013.
 E. Simo-Serra, A. Ramisa, G. Aleny`a, C. Torras, and F. Moreno-Noguer. Single image 3d human pose estimation from noisy observations. In CVPR, 2012.
 Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In CVPR, 2013.
 B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua. Direct prediction of 3d body poses from motion compensated sequences. In CVPR, 2016.
 J. Tompson, A. Jain, Y. Lecun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014.
 W. Yang,W. Ouyang, H. Li, and X.Wang. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR, 2016.
 X. Zhou, X. Sun,W. Zhang, S. Liang, and Y.Wei. Deep kinematic pose regression. arXiv preprint arXiv:1609.05317, 2016.
 X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Sparseness meets deepness: 3d human pose estimation from monocular video. In CVPR, 2016.