您的位置:首页 > 其它

Two-Stream SR-CNNs for Action Recognition in Videos

2017-06-01 15:36 507 查看
paper:http://www.bmva.org/bmvc/2016/papers/paper108/index.html

code:https://github.com/yifita/action.sr_cnn

三作主页:http://wanglimin.github.io/

Two-Stream SR-CNNs for Action Recognition in Videos

dataset : UCF101 JHMDB(split 1)

accuracy: 92.6 53.77

framework

输入仍然是双流,但是将RGB和flow都经过了faster-rcnn,得到不同的区域分为了场景、人、物体三类,分别输入网络进行训练。



The inputs are first passed through standard convolutional and pooling layers.We replace the last pooling layer with a RoiPooling [2] layer, which separate features for different semantic cues into parallel fully connected layers (called channels) using bounding boxes proposed from a Faster R-CNN [18] object detector (see subsection 3.2).

每个channel都会得到独立的分数,由于有多个物体,作者采用了MIL((Multiple Instance Learning)来结合最有用的信息。最后所有的score都通过一个fusion layer,得到最终的预测结果。

Fusion

fusion的策略,作者提出了4个:

Max fusing takes the maximum score value among all channels for each class, essentially picking the strongest channel.

Sum fusion directly adds up the scores from different channels, i.e. each channel is treated equal.

Category-wise weighted fusion (Weighted-1) combines channel scores via weighted sum, aiming to represent varied relative contribution of each channel for different classes using their corresponding weights.

As for correlationwise weighted fusion (Weighted-2), the scores of other classes are also taken into consideration, implicitly encoding the correlation information between classes.

Given L classes and C channels, the number of weights for Weighted Sum-1 and Weighted Sum-2 are L×C andL×L×Crespectively. All weights are trained together with the main parameters of networks through back-propagation process.

Semantic channels

Detector. We extend the original Faster R-CNN model [18] from 20 object categories to 118 categories (listed in supplement) selected from ILSVRC2014 [19] (200 categories) and VOC 2007+2012 [1] detection challenge (20 categories), excluding categories such as small objects, food and most of animals. The complete training data is comprised of 196,780 images.

Objects detection. Objects detection in video dataset is challenging due to low resolution and motion blur. 所以作者制定了一些条件来过滤:

prediction 置信度低于阈值(0.1)

长度小于设定值20像素

和人没有overlap的部分

Person detection. 同样需要:

过滤不正确和无关的检测结果

恢复独立帧的缺失检测结果

refine bounding box的位置

Implementation details

basic network : VGG16

follow the setting of two-stream()

input size : 256x340

spactial stream :RGB

temporal: 10-frame stacking of optical flow fieds

data augmentation: corner cropping scheme

training

dropout rate : 0.8 (for both fc layers)

spation : 10000 iterations

lr: 1×10−3 (decrease every 4000 iterations)

temporal: 19000 iterations

lr :5×10−3 (reduced every 8,000 iterations)

batch size : 256 for both

testing

For testing, we follow the same routine as [20], which selects 10 samples from 5 crops and 2 flips for each frame. The final classification result for one video is given by averaging the classification scores of 25 evenly sampled frames with their all valid crops.

evaluation

见原文。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐