【神经网络】目标检测——RCNN

2023年4月8日下午12:02 • 目标检测

test阶段：

用Selective Search招两千左右的Rigion Proposal，并且图像四周加16个像素，再wrap最为CNN输入（227*227*3），这个是Alexnet的input。之所以这样是可以很好地利用Alexnet的结果；
用CNN特征提取特征（2000*4096）。网络结构就是AlexNet，输入227*227*3，第五层输出6*6*256，第六层输出：4096，第七层输出4096，模型将第七层的输出为特征
参考Alexnet：
https://blog.csdn.net/zyqdragon/article/details/72353420
对每一个类别分别建立SVM模型，得到评分（2000*20）。这里在选择rigion proposal的时候采用的是非极大抑制（greedy non-maximum suppression）

非极大抑制（greedy non-maximum suppression）：
对于每一类别，从score高的开始，当IoU（Intersection over union）高于阈值，去掉该region proposal，然后再在剩下的中间继续选择，直到遍历所有的score高于某个阈值的region proposal。

训练过程：

pre-training：用ILSVRC 2012的数据集进行训练（Alexnet模型），这个是没有bounding box labels，学习率为0.01
fine-tuning：利用warped region proposal进行训练，在这里是有bbox的
positive：所有和groud-truth box的IoU>0.5的region proposal
negative：剩下的
学习率设置为0.001，比前面小，因为我们是想微调，不太改变pretraining的权重
每次训练用128个样本，正负比例1:3，即正32个，负96个。We bias the sampling towards positive windows because they are extremely rare compared to background.
在建立SVM时
positive：only the ground-truth boxes
negative:proposals with less than 0.3IoUwith all instances of a class(该类的所有ground truth boxes)
Proposals that fall into the grey zone (more than 0.3 IoU overlap, but are not ground truth) are ignored.

paper对比了用第五六七层的输出作为特征，发现在不fine-tuning的情况下，建立的模型mAP（mean average precision）差不多，说明CNN提取的特征主要表现在卷积层（尽管参数少），而用了fine-tuning，fc7、fc6比pool5效果好很多，说明全连接层能够学习到特定任务的样本的特征。

bounding box regression
为了improve localization performance，需要class-specific bounding-box regression。
在对每一个proposal打分之后，用CNN得到的特征对bounding box的位置、大小进行调整，
we only learn from a proposal P if it is nearby at least one ground-truth box. We implement “nearness” by assigning P to the ground-truth box G with which it has maximum IoU overlap (in case it overlaps more than one) if and only if the overlap is greater than a threshold (which we set to 0.6 using a validation set).
对于离得太远的proposal，re不re都没啥意思，所以只选择和某一个ground-truth的IoU大于0.6的训练。
输入：CNN得到的feature
输出：

d_{x} (P), d_{y} (P), d_{w} (P), d_{h} (P)

利用上述输出可以得到调整后的proposal的位置：

{\hat{G}}_{x} = P_{w} d_{x} (P) + P_{x} {\hat{G}}_{y} = P_{h} d_{y} (P) + P_{y} {\hat{G}}_{w} = P_{w} e x p (d_{w} (P)) {\hat{G}}_{h} = P_{h} e x p (d_{h} (P))

目标函数：

\sum_{i = 1}^{N} (t_{*}^{I} - {\hat{w}}_{*}^{T} ϕ_{5} (P^{i}))^{2} + λ | | {\hat{w}}_{*} | |^{2}

其中*为x、y、w、h代表中心位置和宽高

t_{x} = (G_{x} - P_{x}) / P_{w} t_{y} = (G_{y} - P_{y}) / P_{h} t_{w} = l o g (G_{w} / P_{w}) t_{h} = l o g (G_{h} / P_{h})

$λ$ 设置为1000。

objectproposal transformations的方法
1. tightest square with context。在原始图片上截取正方形（猜测是向短边两边延展），超过的部分用均值
2. tight- est square without context。直接用均值
3. warp。不管三七二十一，变形就变形吧，直接调整长宽比
还有padding，作者选了padding=16
【神经网络】目标检测——RCNN

本站文章如无特殊说明，均为本站原创，如若转载，请注明出处：【神经网络】目标检测——RCNN - Python技术站

【神经网络】目标检测——RCNN

相关文章