您的位置：首页 > 其它

论文阅读理解 - SSD: Single Shot MultiBox Detector

2017-10-20 17:08 615 查看

SSD: Single Shot MultiBox Detector

Paper

Slide

Code-Caffe

摘要

SSD，一次深度神经网络前馈来进行目标检测. 通过对不同 scales 的各 feature map 位置，根据不同的 aspect ratios，将 bounding boxes 的输出空间离散化为 default boxes 集合.

在预测阶段，网络得到在每个 default box 中各物体类别的存在概率，以及与物体形状最佳匹配的 box.

SSD 网络通过结合多个不同分辨率的 feature maps，可以更好的检测不同尺寸的物体. 相对于需要 object proposals 的方法，SSD 不需要生成 proposal 和像素或特征重采样阶段，将所有的计算封装在一个网络中. 即使输入图像尺寸较小时，SSD 仍有较好的精度.

SSD主网络结构是VGG16，将 2 个全连接层替换为卷积层，并新增 4 个卷积层构造网络结构. 对其中5个不同的卷积层输出的 feature maps，分别用两个 3*3 卷积核的卷积层处理，其中一个卷积层输出分类用的 confidence，每个default box 生成 N+1 个confidence值；另一个卷积层输出回归的localization，每个default box 生成 4 个坐标值(x，y，w，h).

Fig.1. SSD 框架. (a) SSD训练时只需一张输入图片和其各物体的 groundtruth boxes. (b) 和 (c) 为不同 scales 的 feature map，尺寸分别为 8×8 和 4×4，分别估计不同 aspect ratios 的 default boxes 的 small set (这里是4). 对于各 default box，同时预测 box 的偏移和所有物体类别的 confidences (c1,c2,...,cp). 训练时，首先匹配 default boxes 和 groundtruth boxes. 例如，匹配两个 default boxes 到 cat，一个default box 到 dog，作为 positives，其它的作为 negatives. 模型loss为 localization loss 和 confidence loss的加权和.

1. Model

SSD 方法基于前馈卷积网路来生成 bounding boxes集合，以及各box中物体类别分数，采用 NMS(non-maximum suppression) 来得到最终的检测结果.

base network： VGG-16

Fig.2. SSD 和 YOLO 对比. SSD在 base network 后添加了几个特征层，来预测不同 scales 和 aspect ratios 的default boxes的偏移及对应的confidence.

SSD 特点：

Multi-scale feature maps for detection 多尺度特征图

卷积特征层尺寸逐步增加，可以在不同 scales 预测检测结果. 在

Convolutional predictors for detection

如图 Fig.2. , 对于一个 m×n 大小 p channels 的特征层，采用 3×3×p 的 small kernel 来得到类别分数，或者相对于 default box 的相对偏移量. 采用小卷积核来预测 bounding box 的物体类别和偏移量.

Default boxes and aspect ratios 默认boxes 和纵横比

对于网络输出的多个 feature maps，分别将各 feature map 与一组默认的边界框(bounding boxes)关联. 默认的boxes以卷积的方式与 feature map关联，使得各 box 相对于其 feature map cell 的相对位置是固定的. 在各 feature map cell，预测 box 与默认 box的偏移(offsets)，以及 box 内存所在某类实例的类别分数(per-class scores). 即，对于在给定位置的 k 个boxes中的各 box，计算 c 类分数和 4 个相对于默认box的偏移值(offses).

采用 (c+4)k 个 filters 对一个 m×n 的 feature map 各位置进行处理，会产生 (c+4)kmn 个输出. 默认 boxes 的例示如 Fig.1. 这里默认的 boxes 类似于 Faster R-CNN 中的 anchor boxes. 通过在多个不同分辨率的 feature map 设定不同的默认 box，能够有效的离散化可能的输出 boxes.

2. Training

SSD 的训练与传统采用region proposals 的CNN检测器的关键不同在于，ground truth 需要分配给检测器输出的固定集中的特定输出. 之后 loss 函数和BP都是 end-to-end 训练的.

训练还涉及 default boxes 的选择和检测尺度(scales)的选择，以及 hard negative mining 和数据增广策略.

2.1 Matching strategy 匹配策略

训练时需要确定默认 boxes 所对应的 ground truth.

每个 groundtruth box，是从变化位置(vary over location)、纵横比(aspect ratio)和尺度(scale)所得到的 default boxes中进行选择的.

首先，根据最大的 jaccard overlap(类似于MultiBox) 来将各 groundtruth box 与 default box 进行匹配；

然后，将 default boxes 与任何 jaccard overlap 阈值大于某个特定值(0.5)的 groundtruth box 进行匹配.

这种操作简化了学习问题，使得网络可以预测多个高分数的 overlapping default boxes，而不是只选择一个最大的 overlap.

2.2 训练目标函数

记 xpij=1,0 为指示函数，表示第 i 个 default box 与类别(category) p 中第 j 个 groundtruth box的匹配，且 ∑ixpij≥1.

最终目标损失函数是 localization loss(loc) 和 confidence loss(conf) 的加权和：

L(x,c,l,g)=1N(Lconf(x,c)+αLloc(x,l,g))

其中，

N ——匹配的 default boxes 的数目. 如果 N=0，loss 设为0.

Lloc —— predicted box(l) 和 groundtruth box(g) 间的 Smooth L1 loss，回归 default bounding box(d) 的中心(cx,cy) 、宽(width) w 和高(hgeight) h 的偏移量.

Lconf —— multiple classes(c类) confidences 的 softmax loss.

α=1 —— 权重项

Lloc(x,l,g)=∑Ni∈Pos∑m∈{cx,cy,w,h}xkijsmoothL1(lmi−g~mj)

g~cxj=(gcxj−dcxi)/dwi

g~cyj=(gcyj−dcyi)/dhi

g~wj=log(gwidwi)

g~hj=log(ghjdhi)

Lconf(x,c)=−∑Ni∈Posxpijlog(c~pi)−∑i∈Neglog(c~0i)

c~pi=exp(cpi)∑pexp(cpi)

2.3 default boxes 的 scales 和 aspect ratios 的选择

网络不同层的 feature maps 具有不同的接受野.

SSD网络中，default boxes 不需对应各层真实接受野. 定义 default boxes 以确保特定的 feature maps 来学习物体的特定scale的响应(response).

假设需要用 m 个 feature maps 来进行预测，各 feature map 的 default boxes 的计算为：

sk=smin+smax−sminm−1(k−1),k∈[1,m]

其中，smin=0.2 和 smax=0.9 表示 lowest 层的有一个 0.2 的scale，highest 层有一个 0.9 的scale，中间其它层是规律地分布.

对于 default boxes 添加不同的 aspect ratios，记为 αr∈1,2,3,12,13，则各 default box 的width 和 height分别为 wak=ska√r 和 hak=sk/a√r.

当 aspect ratio 为 1 时，添加一个 default box，其 scale 为 s`k=sksk+1−−−−−√，在各 feature map 位置可以得到 6 个 default boxes. 将各 default box 的中心设为 (i+0.5|fk|,j+0.5|fk|)，其中 |fk| 为第 k 个 feature map 的大小，i,j∈[0,|fk|].

实际应用中，可以定义 default boxes 的分布来更好的拟合数据集. 最优的方案仍需探索.

通过结合从许多 feature maps的所有位置得到的不同 scales 和 aspect ratios的 default boxes的预测结果，能够得到覆盖许多不同的输入物体大小和形状的预测结果集合. 如图 Fig.1. , dog 匹配到 4×4 feature map 中的 default box，而不是 8×8 feature map 中的任何 default boxes. 其原因是，这些 boxes具有不同的 scales，与 dog box 不匹配，在训练中被当作 negatives.

2.4 Hard negative mining

boxes 匹配后，大部分 default boxes 都是 negatives，尤其是 possible default boxes 数量较大时. 这会造成 positive 和 negative 训练样本的严重不平衡.

因此，这里不使用所有的 negative 样本. 而是根据各 default box 的 highest confidence loss 排序，并选择最大的，使得 negatives 和 positives 的比例大概为 3:1.

2.5 数据增广

采用整个输入图片

根据与 objects 的minimum jaccard overlap {0.1,0.3,0.5,0.7，or 0.9 }采样 patch.

随机选取一个采样 patch.

采样patch 的尺寸为原始图片尺寸的 [0.1,1]倍，aspect ratio 在[12,2] 之间. 如果 groundtruth box的中心在采样patch 中，则保留其重叠部分.

采样后，各采样patch 被裁剪为固定尺寸，并以0.5的概率随机水平镜像. 也会进行一些添加一些噪声，图片失真.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航