您的位置：首页 > 其它

[水博文]Mask Rcnn简要阅读笔记

2017-11-28 19:31 393 查看

由于文字很多论文是高度借鉴或者类似其他视觉任务的思想，因此有必要先了解一下相关的工作。Mask RCNN[1]的思想很简单，主要是在做物体检测的同时做物体分割,也就是mask。优势是能处理多项不同的任务。如实例分割，物体检测甚至是关键点检测。

首先引用另外一篇文章对实例分割工作的一些回顾[2,3]。传统的FCN没办法应用到实例分割，主要因为卷积具有平移不变性，同一图像像素无论在什么位置都将得到相同的响应，即对位置不敏感。然而，实例分割是基于区域（region）进行操作的，同一像素在不同区域可以有不同语义。所以，我们需要一定程度的位置敏感性。对于实例分割，现在普遍的做法是采用不同类型的子网络进行检测和分割，分为以下三个阶段：1）原图送入FCN得到共享的feature maps；2）ROI pooling得到每个ROI的固定维度的feature maps；3）全连接层得到每个ROI的mask（使用全连接层引进了translation-variantproperty，这就是使用全连接层的一个原因）。然而上面提及的普遍做法存在一些缺点：1）ROI pooling在做特征变换时会损失空间信息，降低分割精度，特别是对大目标的分割精度；2）全连接层引进的参数量太多了；3）每个ROI都要运行一遍分割子网络，计算效率太低。如果没有理解错的话, mask rcnn好像只解决了第一个，第二个很好解决，但是可能会牺牲一些精度。全文除了思想外，有两点是作者在强调的1 Mask的表示在物体分割中，mask一般的表示是1*C*H*W(caffe的表示法，N*C*H*W),H*W也就是对应原图的大小，然后呢每个点上有C个通道，对应着不同的物体类别。而在maskrcnn中，有两点区别。首先mask不是全图的，是先通过RPN产生ROI后，再在这个roi小图中做分割，有多少个roi,就有多少个mask。取其中一个roi为例，我们假设它的长宽是Wr和Hr。第二点区别是，对应这个roi的mask不是像上面说的只是一幅图，而是C幅图，每个图对应着某个物体且是个二值图像,也就是这个mask的维度是C*1*Hr*Wr。整个的逻辑就是首先通过faster rcnn的分支回归了一个box，并且知道了这个roi框住的某个物体，比如说是马，那么分割的时候就选择对应马的那幅二值图像，在这幅图像里，每个被mask分支标记为1的像素都是对应马的像素。作者认为这样会将分类问题和分割问题解耦，因为在做分割的时候不再需要判断哪个像素属于哪个类别。（不过个人认为至少不是完全解耦，分割的时候照样有分类的任务，不然对应马的二值图像为什么产生对应人的像素特征） 2 RoiAlign用CNN进行训练，需要保持每个特征图维度一致（FCN且batch size为1的情况除外）。在上面的情况中，ROI出来的结果是个框，由于物体有大有小，那么框的大小也是变化的，那么我们怎么保持它特征维度的一致性呢。Faster Rcnn中已经提出了一种方案叫ROIpooling。就是不管多大的框，我总能按照它自己的比例切分成4个或者几个均等小块，然后对每个小块做pooling，抽取特定维度的特征。后面会附上此操作的c++代码。但是在做pooling之前，我们得知道roi每个点对应的featmap的值。比如用VGG的话，原图是W*H,那么feature map(conv5_3)对应的尺寸应该是W/16 *H/16，并且由于VGG都是采用的3*3的kernel和1的pad，因此坐标没有偏移。假定roi中的某个点是x,y,那么就可以直接取feature map上坐标round(x/16)，round(y/16)上的值作为其的特征值即可。但是这样会有个问题，x/16并不是整除的,也就是说原图的16个点会对应同一个feature map的点，那么我们得到的这个roi的feature就是一个模糊图像，不利于像素级别的分割任务，所以作者采用了双线性插值的方法。如果没有理解错的话，对应点直接采用浮点型x/16,y/16，这样他会落在featuremap四个像素点之间，就可以用插值的方法。 3 Mask的分辨率还有一点作者没有强调，但是可能需要注意。 Mask和detection两个feature map的分辨率是不一致的。mask有个upsample的过程。

template <typename Dtype>
void ROIPoolingLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
const vector<Blob<Dtype>*>& top) {
const Dtype* bottom_data = bottom[0]->cpu_data();
const Dtype* bottom_rois = bottom[1]->cpu_data();
// Number of ROIs
int num_rois = bottom[1]->num();
int batch_size = bottom[0]->num();
int top_count = top[0]->count();
Dtype* top_data = top[0]->mutable_cpu_data();
caffe_set(top_count, Dtype(-FLT_MAX), top_data);
int* argmax_data = max_idx_.mutable_cpu_data();
caffe_set(top_count, -1, argmax_data);

// For each ROI R = [batch_index x1 y1 x2 y2]: max pool over R
for (int n = 0; n < num_rois; ++n) {
int roi_batch_ind = bottom_rois[0];
int roi_start_w = round(bottom_rois[1] * spatial_scale_);
int roi_start_h = round(bottom_rois[2] * spatial_scale_);
int roi_end_w = round(bottom_rois[3] * spatial_scale_);
int roi_end_h = round(bottom_rois[4] * spatial_scale_);
CHECK_GE(roi_batch_ind, 0);
CHECK_LT(roi_batch_ind, batch_size);

int roi_height = max(roi_end_h - roi_start_h + 1, 1);
int roi_width = max(roi_end_w - roi_start_w + 1, 1);
const Dtype bin_size_h = static_cast<Dtype>(roi_height)
/ static_cast<Dtype>(pooled_height_);
const Dtype bin_size_w = static_cast<Dtype>(roi_width)
/ static_cast<Dtype>(pooled_width_);

const Dtype* batch_data = bottom_data + bottom[0]->offset(roi_batch_ind);

for (int c = 0; c < channels_; ++c) {
for (int ph = 0; ph < pooled_height_; ++ph) {
for (int pw = 0; pw < pooled_width_; ++pw) {
// Compute pooling region for this output unit:
// start (included) = floor(ph * roi_height / pooled_height_)
// end (excluded) = ceil((ph + 1) * roi_height / pooled_height_)
int hstart = static_cast<int>(floor(static_cast<Dtype>(ph)
* bin_size_h));
int wstart = static_cast<int>(floor(static_cast<Dtype>(pw)
* bin_size_w));
int hend = static_cast<int>(ceil(static_cast<Dtype>(ph + 1)
* bin_size_h));
int wend = static_cast<int>(ceil(static_cast<Dtype>(pw + 1)
* bin_size_w));

hstart = min(max(hstart + roi_start_h, 0), height_);
hend = min(max(hend + roi_start_h, 0), height_);
wstart = min(max(wstart + roi_start_w, 0), width_);
wend = min(max(wend + roi_start_w, 0), width_);

bool is_empty = (hend <= hstart) || (wend <= wstart);

const int pool_index = ph * pooled_width_ + pw;
if (is_empty) {
top_data[pool_index] = 0;
argmax_data[pool_index] = -1;
}

for (int h = hstart; h < hend; ++h) {
for (int w = wstart; w < wend; ++w) {
const int index = h * width_ + w;
if (batch_data[index] > top_data[pool_index]) {
top_data[pool_index] = batch_data[index];
argmax_data[pool_index] = index;
}
}
}
}
}
// Increment all data pointers by one channel
batch_data += bottom[0]->offset(0, 1);
top_data += top[0]->offset(0, 1);
argmax_data += max_idx_.offset(0, 1);
}
// Increment ROI data pointer
bottom_rois += bottom[1]->offset(1);
}
}

[1]He K, Gkioxari G, Dollár P,et al. Mask r-cnn[J]. arXiv preprint arXiv:1703.06870, 2017.[2]Li Y, Qi H, Dai J, et al.Fully convolutional instance-aware semantic segmentation[J]. arXiv preprintarXiv:1611.07709, 2016. [3]https://zhuanlan.zhihu.com/p/27500215

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航