READING NOTE: A Pursuit of Temporal Accuracy in General Activity Detection
2017-03-15 20:13
477 查看
TITLE: A Pursuit of Temporal Accuracy in General Activity Detection
AUTHOR: Yuanjun Xiong, Yue Zhao, Limin Wang, Dahua Lin, Xiaoou Tang
ASSOCIATION: The Chinese University of Hong Kong, ETH
FROM: arXiv:1703.02716
A cascaded classification pipeline is introduced that explicitly distinguishes between relevance and completeness of a candidate instance.
To evaluate the actionness, a binary classifier is learnt based on the Temporal Segment Network proposed in Temporal segment networks: Towards good practices for deep action recognition.
To generate temporal region proposals, the basic idea is to group consecutive snippets with high actionness scores. The scheme first obtains a number of action fragments by thresholding – a fragment here is a consecutive sub-sequence of snippets whose actionness scores are above a certain threshold, referred to as actionness threshold.
Then, to generate a region proposal, a fragment is picked as a starting point and expanded recursively by absorbing succeeding fragments. The expansion terminates when the portion of low-actionness snippets goes beyond a threshold, a positive value which is referred to as the tolerance threshold. Beginning with different fragments, we can obtain a collection of different region proposals.
Note that this scheme is controlled by two design parameters: the actionness threshold and the tolerance threshold. The final proposal set is the union of those derived from individual combination of the two values. This scheme is called Temporal Actionness Grouping, illustrated in the above figure, which has several advantages:
Thanks to the actionness classifier, the generated proposals are mostly focused on action-related contents, which greatly reduce the number of needed proposals.
Action fragments are sensitive to temporal transitions. Hence, as a bottom-up method that relies on merging action fragments, it often yields proposals with more accurate temporal boundaries.
With the multi-threshold design, it can cover a broad range of actions without the need of case-specific parameter tuning. With these properties, the proposed method can achieve high recall with just a moderate number of proposals. This also benefits the training of the classifiers in the next stage.
Activity Classification
A classifier is trained based on TSN. During training, region proposals that overlap with a ground-truth instance with an IOU above 0.7 will be used as positive samples. A proposal is considered as a negative sample only when less than 5% of its time span overlaps with any annotated instances. Only the proposals classified as non-background classes will be retained for completeness filtering. The probability from the activity classifier is denoted as Pa.
Completeness Filtering
To evaluate the completeness, a simple feature representation is extracted and used to train class-specific SVMs. The feature comprises three parts: (1) A temporal pyramid of two levels. The first level pools the snippet scores within the proposed region. The second level split the segment into two parts and pool the snippet scores inside each part. (2) The average classification scores of two short periods – the ones before and after the proposed region. The method is illustrated in the following figure.
The output of the SVMs for one class is denoted as Sc.
Then final detection confidence for each proposal is
SDet=Pa×Sc
AUTHOR: Yuanjun Xiong, Yue Zhao, Limin Wang, Dahua Lin, Xiaoou Tang
ASSOCIATION: The Chinese University of Hong Kong, ETH
FROM: arXiv:1703.02716
CONTRIBUTIONS
A novel proposal scheme is proposed that can efficiently generate candidates with accurate temporal boundaries.A cascaded classification pipeline is introduced that explicitly distinguishes between relevance and completeness of a candidate instance.
METHOD
The proposed action detection framework starts with evaluating the actionness of the snippets of the video. A set of temporal action proposals (in orange color) are generated with temporal actionness grouping (TAG). The proposals are evaluated against the cascaded classifiers to verify their relevance and completeness. Only proposals being complete instances are produced by the framework. Non-complete proposals and background proposals are rejected by a cascaded classification pipeline. The framework is illustrated in the following figure.Temporal Region proposals
The temporal region proposals are generated with a bottom-up procedure, which consists of three steps: extract snippets, evaluate snippet-wise actionness, and finally group them into region proposals.To evaluate the actionness, a binary classifier is learnt based on the Temporal Segment Network proposed in Temporal segment networks: Towards good practices for deep action recognition.
To generate temporal region proposals, the basic idea is to group consecutive snippets with high actionness scores. The scheme first obtains a number of action fragments by thresholding – a fragment here is a consecutive sub-sequence of snippets whose actionness scores are above a certain threshold, referred to as actionness threshold.
Then, to generate a region proposal, a fragment is picked as a starting point and expanded recursively by absorbing succeeding fragments. The expansion terminates when the portion of low-actionness snippets goes beyond a threshold, a positive value which is referred to as the tolerance threshold. Beginning with different fragments, we can obtain a collection of different region proposals.
Note that this scheme is controlled by two design parameters: the actionness threshold and the tolerance threshold. The final proposal set is the union of those derived from individual combination of the two values. This scheme is called Temporal Actionness Grouping, illustrated in the above figure, which has several advantages:
Thanks to the actionness classifier, the generated proposals are mostly focused on action-related contents, which greatly reduce the number of needed proposals.
Action fragments are sensitive to temporal transitions. Hence, as a bottom-up method that relies on merging action fragments, it often yields proposals with more accurate temporal boundaries.
With the multi-threshold design, it can cover a broad range of actions without the need of case-specific parameter tuning. With these properties, the proposed method can achieve high recall with just a moderate number of proposals. This also benefits the training of the classifiers in the next stage.
Detecting Action Instances
this is accomplished by a cascaded pipeline with two steps: activity classification and completeness filtering.Activity Classification
A classifier is trained based on TSN. During training, region proposals that overlap with a ground-truth instance with an IOU above 0.7 will be used as positive samples. A proposal is considered as a negative sample only when less than 5% of its time span overlaps with any annotated instances. Only the proposals classified as non-background classes will be retained for completeness filtering. The probability from the activity classifier is denoted as Pa.
Completeness Filtering
To evaluate the completeness, a simple feature representation is extracted and used to train class-specific SVMs. The feature comprises three parts: (1) A temporal pyramid of two levels. The first level pools the snippet scores within the proposed region. The second level split the segment into two parts and pool the snippet scores inside each part. (2) The average classification scores of two short periods – the ones before and after the proposed region. The method is illustrated in the following figure.
The output of the SVMs for one class is denoted as Sc.
Then final detection confidence for each proposal is
SDet=Pa×Sc
相关文章推荐
- READING NOTE: Pushing the Limits of Deep CNNs for Pedestrian Detection
- 【CV论文阅读】+【搬运工】LocNet: Improving Localization Accuracy for Object Detection + A Theoretical analysis of feature pooling in Visual Recognition
- Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
- READING NOTE: Face Detection with End-to-End Integration of a ConvNet and a 3D Model
- General issues in using variables -- Code complete reading notes(10)
- End-to-end learning of action detection from frame glimpses in videos 阅读笔记
- READING NOTE: Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Trackin
- 37 of the General Questions in Interview
- 显著性检测(二)Spatio-temporal Saliency Detection Using Phase Spectrum of Quaternion Fourier Transform
- P.Laguna/AUTOMATIC DETECTION OF WAVE BOUNDARIES IN MULTILEAD ECG SIGNALS VALIDATION WITH THE CSE DB
- General Object Initiating Function by the Example of $injector.instantiate Implementation in angular
- Programming In Scala Reading Note 8
- Reading Note: Progressive Growing of GANs for Improved Quality, Stability, and Variation
- 论文阅读: End-to-end Learning of Action Detection from Frame Glimpses in Videos
- READING NOTE: Do semantic parts emerge in Convolutional Neural Networks?
- Programming In Scala Reading Note 1
- Programming In Scala Reading Note 2
- Note of big data dummies:Understanding the role of relational databases in big data
- Think in dataset under evaluation of TensorFlow object detection API
- READING NOTE: Feature Pyramid Networks for Object Detection