Two-Stream SR-CNNs for Action Recognition in Videos
2017-06-01 15:36
507 查看
paper:http://www.bmva.org/bmvc/2016/papers/paper108/index.html
code:https://github.com/yifita/action.sr_cnn
三作主页:http://wanglimin.github.io/
accuracy: 92.6 53.77
![](https://img-blog.csdn.net/20170601135924974?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvYm9qYWNraG9zcmVtYW4=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
The inputs are first passed through standard convolutional and pooling layers.We replace the last pooling layer with a RoiPooling [2] layer, which separate features for different semantic cues into parallel fully connected layers (called channels) using bounding boxes proposed from a Faster R-CNN [18] object detector (see subsection 3.2).
每个channel都会得到独立的分数,由于有多个物体,作者采用了MIL((Multiple Instance Learning)来结合最有用的信息。最后所有的score都通过一个fusion layer,得到最终的预测结果。
Max fusing takes the maximum score value among all channels for each class, essentially picking the strongest channel.
Sum fusion directly adds up the scores from different channels, i.e. each channel is treated equal.
Category-wise weighted fusion (Weighted-1) combines channel scores via weighted sum, aiming to represent varied relative contribution of each channel for different classes using their corresponding weights.
As for correlationwise weighted fusion (Weighted-2), the scores of other classes are also taken into consideration, implicitly encoding the correlation information between classes.
Given L classes and C channels, the number of weights for Weighted Sum-1 and Weighted Sum-2 are L×C andL×L×Crespectively. All weights are trained together with the main parameters of networks through back-propagation process.
Objects detection. Objects detection in video dataset is challenging due to low resolution and motion blur. 所以作者制定了一些条件来过滤:
prediction 置信度低于阈值(0.1)
长度小于设定值20像素
和人没有overlap的部分
Person detection. 同样需要:
过滤不正确和无关的检测结果
恢复独立帧的缺失检测结果
refine bounding box的位置
follow the setting of two-stream()
input size : 256x340
spactial stream :RGB
temporal: 10-frame stacking of optical flow fieds
data augmentation: corner cropping scheme
spation : 10000 iterations
lr: 1×10−3 (decrease every 4000 iterations)
temporal: 19000 iterations
lr :5×10−3 (reduced every 8,000 iterations)
batch size : 256 for both
code:https://github.com/yifita/action.sr_cnn
三作主页:http://wanglimin.github.io/
Two-Stream SR-CNNs for Action Recognition in Videos
dataset : UCF101 JHMDB(split 1)accuracy: 92.6 53.77
framework
输入仍然是双流,但是将RGB和flow都经过了faster-rcnn,得到不同的区域分为了场景、人、物体三类,分别输入网络进行训练。The inputs are first passed through standard convolutional and pooling layers.We replace the last pooling layer with a RoiPooling [2] layer, which separate features for different semantic cues into parallel fully connected layers (called channels) using bounding boxes proposed from a Faster R-CNN [18] object detector (see subsection 3.2).
每个channel都会得到独立的分数,由于有多个物体,作者采用了MIL((Multiple Instance Learning)来结合最有用的信息。最后所有的score都通过一个fusion layer,得到最终的预测结果。
Fusion
fusion的策略,作者提出了4个:Max fusing takes the maximum score value among all channels for each class, essentially picking the strongest channel.
Sum fusion directly adds up the scores from different channels, i.e. each channel is treated equal.
Category-wise weighted fusion (Weighted-1) combines channel scores via weighted sum, aiming to represent varied relative contribution of each channel for different classes using their corresponding weights.
As for correlationwise weighted fusion (Weighted-2), the scores of other classes are also taken into consideration, implicitly encoding the correlation information between classes.
Given L classes and C channels, the number of weights for Weighted Sum-1 and Weighted Sum-2 are L×C andL×L×Crespectively. All weights are trained together with the main parameters of networks through back-propagation process.
Semantic channels
Detector. We extend the original Faster R-CNN model [18] from 20 object categories to 118 categories (listed in supplement) selected from ILSVRC2014 [19] (200 categories) and VOC 2007+2012 [1] detection challenge (20 categories), excluding categories such as small objects, food and most of animals. The complete training data is comprised of 196,780 images.Objects detection. Objects detection in video dataset is challenging due to low resolution and motion blur. 所以作者制定了一些条件来过滤:
prediction 置信度低于阈值(0.1)
长度小于设定值20像素
和人没有overlap的部分
Person detection. 同样需要:
过滤不正确和无关的检测结果
恢复独立帧的缺失检测结果
refine bounding box的位置
Implementation details
basic network : VGG16follow the setting of two-stream()
input size : 256x340
spactial stream :RGB
temporal: 10-frame stacking of optical flow fieds
data augmentation: corner cropping scheme
training
dropout rate : 0.8 (for both fc layers)spation : 10000 iterations
lr: 1×10−3 (decrease every 4000 iterations)
temporal: 19000 iterations
lr :5×10−3 (reduced every 8,000 iterations)
batch size : 256 for both
testing
For testing, we follow the same routine as [20], which selects 10 samples from 5 crops and 2 flips for each frame. The final classification result for one video is given by averaging the classification scores of 25 evenly sampled frames with their all valid crops.evaluation
见原文。相关文章推荐
- Two-Stream RNN/CNN for Action Recognition in 3D Videos-阅读笔记
- [行为识别] Two –Stream CNN for Action Recognition in Videos
- 【论文学习】Two-Stream Convolutional Networks for Action Recognition in Videos
- [深度学习论文笔记][Video Classification] Two-Stream Convolutional Networks for Action Recognition in Videos
- [论文阅读笔记]Two-Stream Convolutional Networks for Action Recognition in Videos
- READING NOTE: Two-Stream Convolutional Networks for Action Recognition in Videos
- Two-Stream Convolutional Networks for Action Recognition in Videos
- Two-Stream Convolutional Networks for Action Recognition in Videos
- Two-Stream Convolutional Networks for Action Recognition in Videos
- 【ML】Two-Stream Convolutional Networks for Action Recognition in Videos
- 论文笔记-Two-Stream Convolutional Networks for Action Recognition in Videos
- 视频动作识别--Two-Stream Convolutional Networks for Action Recognition in Videos
- 【CV论文阅读】Two stream convolutional Networks for action recognition in Vedios
- Two-Stream Convolutional Networks for Action Recognition in Video
- Two-Stream Convolutional Networks for Action Recognition in Video
- Implementing a CNN for Human Activity Recognition in Tensorflow
- READING NOTE: Pooling the Convolutional Layers in Deep ConvNets for Action Recognition
- Action Recognition-Two Stream CNN论文笔记
- Implementing a CNN for Human Activity Recognition in Tensorflow
- R-CNN学习笔记3:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition(SPP-net)