论文主要分为Abstract、Introduction、Related works、Method、Evaluation、Discussion、Reference 






1)Convolutional Pose Machines

Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. 


In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation. 


The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation. 


We achieve this by designing a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style

而后说出本文具体是怎么做的:通过设计包含CNN的网络结构,该网络结构能够在前一阶段的belief map的结果之上进行,这样可以逐渐地得到经过精化之后的身体部件的位置,这种方式不需要显式地进行图模型建模。

Our approach addresses the characteristic difficulty of vanishing gradients during training by providing a natural learning objective function that enforces intermediate supervision, thereby replenishing back-propagated gradients and conditioning the learning
procedure. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the MPII, LSP, and FLIC datasets. 

最后给出了本文方法所强调要解决的问题,并且说明本方法NB,state of art performance。







⑤最后给出本文所的方法所解决的困难是什么,本文方法在xxx数据集上取得了state of art结果



in this work wo show a systematic design for xxxx


the contribution of this paper is to xxx


we achieve this by xxxx


our approach address xxx problem by xxxxx. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the xxxx


2) End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation

Recently, Deep Convolutional Neural Networks (DCNNs) have been applied to the task of human pose estimation, and have shown its potential of learning better feature representations and capturing contextual relationships. 


However, it is difficult to incorporate domain prior knowledge such as geometric relationships among body parts into DCNNs. 


In addition, training DCNN-based body part detectors without consideration of global body joint consistency introduces ambiguities, which increases the complexity of training. 


In this paper, we propose a novel end-to-end framework for human pose estimation that combines DCNNs with the expressive deformable mixture of parts. 


We explicitly incorporate domain prior knowledge into the framework, which greatly regularizes the learning process and enables the flexibility of our framework for loopy models or tree-structured models. 


The effectiveness of jointly learning a DCNN with a deformable mixture of parts model is evaluated through intensive experiments on several widely used benchmarks. 

The proposed approach significantly improves the performance compared with state-of-the-art approaches, especially on benchmarks with challenging articulations. 

说明方法的结果:在若干个benchmarks上NB,又是state of art…….



① 铺垫 

② 引出要解决的问题,突触要解决的问题的重要性 

③ 一句话介绍你的方法 

④ 一句话介绍具体如何做的 

⑤你的方法NB,在xxx数据集行state of art 





In addition


in this paper we propose a novel end-to end framework for xxxx that xxxxxxxx


The effectiveness of xxxxx is evaluated througth intensive experiments on xxxx benchmarks, the proposed approach significantly improves the performance compared with state-of-the-art approaches


3)Human Pose Estimation with Iterative Error Feedback

Hierarchical feature extractors such as Convolutional Networks (ConvNets) have achieved impressive performance on a variety of classification tasks using purely feedforward processing. 


Feedforward architectures can learn rich representations of the input space but do not explicitly model dependencies in the output spaces, that are quite structured for tasks such as articulated human pose estimation or object segmentation. 


Here we propose a framework that expands the expressive power of hierarchical feature extractors to encompass both input and output spaces, by introducing top-down feedback. 


Instead of directly predicting the outputs in one Go, we use a self-correcting
model that progressively changes an initial solution by feeding back error predictions, in a process we call Iterative Error Feedback (IEF). 


IEF shows excellent performance on the task of articulated pose estimation in the challenging MPII and LSP benchmarks, matching the state-of-the-art without requiring ground truth scale annotation. 

说明本文方法NB:IEF NB



① 一句话铺垫 

② 引出要解决的问题 

③ 提出本文方法 

④ 突出本文方法的区别 

⑤ 突出自己方法NB



xxxx is good but do not xxxxx 


Here we propose a framework that xxxx, by xxxxx 


Instead of xxx, we use xxxx 


Our method show excellent performance on the task of xxx in the challenging xxx and xxx benchmarks


4)Personalizing Human Video Pose Estimation

We propose a personalized ConvNet pose estimator that automatically adapts itself to the uniqueness of a person’s appearance to improve pose estimation in long videos. 


We make the following contributions: 

(i) we show that given a few high-precision pose annotations, e.g. from a generic ConvNet pose estimator, additional annotations can be generated throughout the video using a combination of image-based matching for temporally distant frames, and dense optical
flow for temporally local frames; 

(ii) we develop an occlusion aware self-evaluation model that is able to automatically select the high-quality and reject the erroneous additional annotations; 

(iii) we demonstrate that these high-quality annotations can be used to fine-tune a ConvNet pose estimator and thereby personalize it to lock on to key discriminative features of the person’s appearance. 


The outcome is a substantial improvement in the pose estimates for the target video using the personalized ConvNet compared to the original generic ConvNet. 


Our method outperforms the state of the art (including top ConvNet methods) by a large margin on two standard benchmarks, as well as on a new challenging YouTube video dataset. Furthermore, we show that training from the automatically generated annotations
can be used to improve the performance of a generic ConvNet on other benchmarks. 




① 提出本文方法 

② 给出本文贡献,逐个列出来 

③ 强调本文方法NB


We propose a xxx that xxxx


We make the following contributions: 

(i) we show that xxx 

(ii) we develop a xxx 

(iii) we demonstrate that xxx


Our method outperforms the state of the art by a large margin on two standard benchmarks, as well as on a new challenging YouTube video dataset


5)Structured Feature Learning for Pose Estimation

In this paper, we propose a structured feature learning framework to reason the correlations among body joints at the feature level in human pose estimation. 


Different from existing approaches of modeling structures on score maps or predicted labels, feature maps preserve substantially richer descriptions of body joints. 

强调本文方法的区别:现有方法都是对在score map或者预测类标上的结构进行建模,而feature map则保留了大量更加丰富的关节描述信息

The relationships between feature maps of joints are captured with the introduced geometrical transform kernels, which can be easily implemented with a convolution layer. 

Features and their relationships are jointly learned in an end-to-end learning system. A bi-directional tree structured model is proposed, so that the feature channels at a body joint can well receive information from other joints. 


The proposed framework improves feature learning substantially. With very simple post processing, it reaches the best mean PCP on the LSP and FLIC datasets. Compared with the baseline of learning features at each joint separately with ConvNet, the mean PCP
has been improved by 18% on FLIC. The code is released to the public. 




① 提出本文方法 

② 强调本文方法的区别 


④ 本文方法NB



In this paper, we propose a xxx learning framework to xxx 


Different from existing approaches of xxx, sssss can xxxxx 


The proposed framework improves xxxxx substantially. With xxxxx, it reaches the best mean PCP on the xxx and sss datasets. 

Compared with the baseline of xxx the mean PCP has been improved by 18% on FLIC.


6)Chained Predictions Using Convolutional Neural Networks

In this paper, we present an adaptation of the sequence-tosequence model for structured output prediction in vision tasks. 


In this model the output variables for a given input are predicted sequentially using neural networks. The prediction for each output variable depends not only on the input but also on the previously predicted output variables. The model is applied to spatial
localization tasks and uses convolutional neural networks (CNNs) for processing input images and a multi-scale deconvolutional architecture for making spatial predictions at each time step. We explore the impact of weight sharing with a recurrent connection
matrix between consecutive predictions, and compare it to a formulation where these weights are not tied. Untied weights are particularly suited for problems with a fixed sized structure, where different classes of output are predicted in different steps. 


We show that chained predictions achieve top performing results on human pose estimation from single images and videos. 




① 提出本文方法 

② 本文方法具体怎么做 

③ 本文方法NB



In this paper, we present an adaptation of xxx model for xxx in vision tasks.


We show that our method achieves top performing results on human pose estimation from single images and videos.


7)Stacked Hourglass Networks for Human Pose Estimation

This work introduces a novel convolutional network architecture for the task of human pose estimation. 


Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. 

We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a \stacked hourglass” network based on the successive steps of pooling
and upsampling that are done to produce a final set of predictions. 


State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods. 




① 提出本文方法 

② 本文方法具体怎么做 

③ 本文方法NB


8)Multi-Person Pose Estimation with Local Joint-to-Person Associations

Despite of the recent success of neural networks for human pose estimation, current approaches are limited to pose estimation of a single person and cannot handle humans in groups or crowds. 


In this work, we propose a method that estimates the poses of multiple persons in an image in which a person can be occluded by another person or might be truncated. 


To this end, we consider multiperson pose estimation as a joint-to-person association problem. 

We construct a fully connected graph from a set of detected joint candidates in an image and resolve the joint-to-person association and outlier detection using integer linear programming. Since solving joint-to-person association jointly for all persons in
an image is an NP-hard problem and even approximations are expensive, we solve the problem locally for each person. 


On the challenging MPII Human Pose Dataset for multiple persons, our approach achieves the accuracy of a state-of-the-art method, but it is 6,000 to 19,000 times faster. 




① 引出要解决的问题 

② 提出本文方法 

③ 具体介绍本文方法 

④ 突出自己方法NB



Despite of the recent success of xxx, current approaches are limited to xxx. 


On the challenging xxx Dataset for sss, our approach achieves the accuracy of a state-of-the-art method





However, it is difficult to xxxx 

xxxx is good but do not xxxxx 

Despite of the recent success of xxx, current approaches are limited to xxx.


In addition, training DCNN-based body part detectors without consideration of global body joint consistency introduces ambiguities, which increases the complexity of training.


the contribution of this paper is to xxx 

in this work we show a systematic design for xxxx 

in this paper we propose a novel end-to end framework for xxxx that xxxxxxxx 

In this paper, we propose a xxx framework to xxx 

In this paper, we present an adaptation of xxx model for xxx in vision tasks. 

Here we propose a framework that xxxx, by xxxxx 

We propose a xxxx that can do xxxx 

We achieve this by xxxx

We make the following contributions: 

(i) we show that xxx 

(ii) we develop a xxx 

(iii) we demonstrate that xxx


Instead of xxx, we use xxxx 

Different from existing approaches of modeling ssss, xxx can aaaaaa


Our approach address xxx problem by xxxxx. We demonstrate state-of-the-art performance and outperform competing methods on standard benchmarks including the xxxx 

Our method show excellent performance on the task of xxx in the challenging xxx and xxx benchmarks 

Our method outperforms the state of the art by a large margin on two standard benchmarks, as well as on a new challenging YouTube video dataset 

The proposed framework improves xxxxx substantially. With xxxxx, it reaches the best mean PCP on the xxx and sss datasets. Compared with the baseline of xxx the mean PCP has been improved by 18% on FLIC. 

The effectiveness of xxxxx is evaluated througth intensive experiments on xxxx benchmarks, the proposed approach significantly improves the performance compared with state-of-the-art approaches 

We show that our method achieves top performing results on human pose estimation from single images and videos. 

State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods. 

On the challenging xxx Dataset for sss, our approach achieves the accuracy of a state-of-the-art method

