您的位置:首页 > 运维架构

opencv contrib-master cnn_3dobj module doc note

2016-02-19 16:15 274 查看

Simultaneous Recognition and Homography Extraction of Local Patches with a Simple Linear Classifier

introduction

可显示实时的关键点identities和位姿估计要比2个独立的步骤(使用以前的方法)要可靠。在学习阶段训练的简单的线性分类器和线性估计器对关于上述任务是充分的。源于线性估计器,获取的位姿可达亚像素精度。一个单关键点足够估计物体的姿势。这样我们可以实时的处理没有显著文理的物体。

恢复关键点的位姿并且匹配它们对于很对应用是一个关键的步骤如基于视觉的机器人定位,目标识别,图像恢复。通常的方法是从关键点位姿估计中解耦匹配过程。标准的方法是使用一些ahhoc放射过程检测器并且依赖矩形区域的SIFT来匹配。由于该过程受检测器的限制,在我们的工作中,我们使用其他的方式处理。首先恢复关键鉴别点使用Ferns-based 分类器,然后估计他们的姿势。这样会在速度和可靠性上有所提高。

虽然误差是不可避免的,这篇文章中,我们看到实时的关键点匹配和位姿估计使问题变简单,并且得到更加可靠的方法。

As a result, a simple and fast linear classifier coupled with linear predictors is

sufficient to handle the problem.

我们首先在训练过程中建立一个线性分类器和一系列线性估计器,分类器产生线性估计器限制的关键鉴别点和位姿的假设,这使我们能恢复正确的鉴别点集和拥有正确测量的姿势。分类由一个关键点鉴别和一个量化位姿组成。线性分类器提供少量的预测。我们选择正确的通过一些考虑线性估计器的受限位姿的矫正。输出包含可靠的匹配和精确的位姿估计,位姿使用透视变换表示。这样使得我们可以处理非常弱文理的目标。在文章中我们首先讨论关于放射检测器的工作,然后我们描述了我们的方法,并和现有的方法比较,最后展示一个TLD的应用。

Related work

反射区域检测器对于很多应用很有吸引力因为他们允许避免大部分图像的透视畸变。很多不同的方法已被提出,并且显示Miko-

lajczyk and Schmid Hessian-仿射检测器和MSER检测器是最可靠的。对于hessian仿射检测器,恢复仿射变换基于图像二阶矩阵。他对旋转归一化,旋转角可以根据主梯度的方向计算。者改革过程中应用了ad hoc方法。例如考虑梯度直方图峰值方法,

然而应用这个启发式方法对于变形图像包倾向于使得方法不太稳定。这种情形下MSER检测子,很多不同的方法领用区域形状也是可能的,一通常的方法是从区域方差矩阵计算变换并且使用ocal maximums of curvature and bitangents解出其他的自由度。归一化之后,SIFT描数字计算用来匹配区域。这种方法使用另外一种方式运行。首先得到包的鉴别点,逼近位姿的粗糙值。然后应用一个复杂的线性估计器恢复精密放射变换。这种方法变现浩宇之前提出的二方法,主要是因为他可以利用训练阶段。然而我呢在本文中给出了一种新的方法效果会更好。我们使用线性估计器调整位姿,因为他们在该任务中表现很好。主要的共in啊是展示了一个简单的线性分类器和线性估计器的结合,这样可以实时的恢复关键鉴别点和位姿,效果更好。

Proposed Approach

Given an image patch, we want to match it against a database of possible patches defined

around keypoints in reference images, and accurately estimate its pose, represented by

a homography.

A linear classifier gives us a set of hypotheses on the identity of the corresponding patch and the pose. We then select the correct one by refining the pose using linear predictors and comparing the rectified patch and the predicted ones with normalized cross-correlation.

The classification step requires the quantization of the pose space, and because a careful quantization method significantly improves the result, we also describe our quantization method here.

Linear classifier

The classification step applies a matrix A to a vector p that contains the intensities of the

patch we want to recognize:

ATp=y

The result y will give us a set of hypotheses.

ai,j is now taken as the solution of the problem:

∀j′,kai,j⋅pi,j,k={+1ifj=j′−1otherwise

Because there are much less positive examples than negative ones, we give a weight w=N−1 to the positive examples equations, where N is the number of possible poses, and a weight of 1 to the negative examples equations. The row vector ai,j can therefore be computed as:

ai,j=(PWWTPT)−1PWWTy

where P is a matrix made of the patches column vectors pi′,j′,k,y is a row vector made

of +1 and -1 values, and W is a diagonal matrix containing the equations weights. In practice, y and P are large and computing a i, j directly using the above equation can become heavy. We therefore decompose its computation into:

ai,j=([P1,...,PL][w21P1,...,w2LPL])−1([w21P1,...,w2LPL][y1,...,yL])=(∑lw2lPl)−1(∑lw2lPl)

Because the two terms of the products can be computed incrementally, this form can be computed without the full training set present in the computer memory.

Best Hypothesis Selection

For a given input patch p, and for each keypoint in the database, the previous step gives us a list Γ of possible pose indices. We select for each keypoint i the best pose j that maximizes the normalized cross-correlation between p and the mean of the training examples

of keypoint i under pose j. Since the patch intensities are already normalized, this can be written as looking for:

∀i:argmaxp⋅pi,j,j∈Γi where p¯i,j is computed as:

p¯i,j=1K∑k=1kpi,j,k

Thus, for each keypoint i, we get the best quantized pose H¯i.

Final Keypoint Selection and Pose Extraction

For each best hypothesis consisting of keypoint i and the best quantized homography Hi , we use the hyperplane approximation of [4] to obtain an estimate of the corrective homography parameters x i using the following equation

xi=Bi(p(Hi)−p∗i)

1.where Bi is the matrix of our linear predictor that depends on the patch identity i;

2.p(Hi) is a vector that contains the intensities of the original patch p warped by the current estimate Hi of the transformation.

3.p∗i is a vector that contains the intensity values of the reference patch, which is the image patch centered on the keypoint i in a reference image.

Homography Space Quantization

We still have to explain how we quantize the homography space. This is done based on

the formula:

H=K(ΔR+δt⋅nTd)K−1

which is the expression of the homography H relating two views of a 3–D plane, where K

is the matrix of the camera internal parameters, [nT,dT] the parameters of the plane in the first view, and ΔR and δt the camera displacement between the two views. For simplification, we assume we have a frontal view of the reference patches. We first tried discretizing the motion between the views by simply discretizing the rotation angles around the three

axes. However, for the linear predictors of Eq. (8) to work well, they must be initialized as close as possible to the correct solution, and we provide another solution that improves their convergence rates. As shown by the left image of Fig. 3, we found that the vertices of

(almost) regular polyhedrons provide a more regular sampling that is useful to discretize the angle the second view in Eq. (10) makes with the patch plane. Unfortunately, there exists only a few convex regular polyhedrons - the Platonic solids - with the icosahedron

the one with the largest number of vertices, 12. As the right image of Fig. 3 illustrates, we obtain a finer sampling by recursively substituting each triangle into four almost equilateral triangles. The vertices of the created polyhedron give us the two out-of-plane rotation

angles for the sampled pose, that is around the x- and y-axes of Fig. 3. We discretize the in-plane rotation angle to cover the 360 ◦ range with 10 ◦ steps. We still have to sample the scale changes. For that, we simply fix the translation and the plane equation but multiply the homography matrix obtained with Eq. (10) by a scaling matrix to cover three different scale levels 1/2, 1, 2.

Learning Descriptors for Object Recognition and 3D Pose Estimation

2015 cvpr

we train a Convolutional Neural Network to compute these descriptors by enforcing simple similarity and dissimilarity constraints between the descriptors.

We therefore seek to learn a descriptor with the two following properties:

a) The Euclidean distance between descriptors from two different objects should be large;

b) The Euclidean distance between descriptors from the same object should be representative of the similarity between their poses.

Training the CNN

In order to train the network we need a set S train of training samples, where each sample s=(x,c,p) is made of an

image x of an object, which can be a color or grayscale image or a depth map, or a combination of the two; the identity

c of the object; and the 3D pose p of the object relative to the camera.

Additionally, we define a set Sdb of templates where each element is defined in the same way as a training sample. Descriptors for these templates are calculated and stored with the classifier for k-nearest neighbor search. The template

set can be a subset of the training set, the whole training set or a separate set. Details for the creation of training and

template data are given in the implementation section.

Defining the Cost Function

We argue that a good mapping from the images to the descriptors should be so that the Euclidean distance between

two descriptors of the same object and similar poses are small and in every other case (either different objects or

different poses) the distance should be large. In particular, each descriptor of a training sample should have a small

distance to the one template descriptor from the same class with the most similar pose and a larger distance to all descriptors of templates from other classes, or the same class but less similar pose.

We enforce these requirements by minimizing the following objective function over the parameters w of the

CNN:

L=Ltriplets+Lpairs+λ∥w∥22

Triplet-wise terms

We first define a set T of triplets (si,sj,sk) of training samples. Each triplet in T is selected such that one of the two following conditions is fulfilled:

• either si and sj are from the same object and sk from another object, or

• the three samples si,sj, and sk are from the same object, but the poses pi and pj are more similar than the poses pi and pk.

These triplets can therefore be seen as made of a pair of

similar samples (si and sj ) and a pair of dissimilar ones (si and sk ). We introduce a cost function for such a triplet:

c(si,sj,sk)=max(0,1−∥fw(xi)−fw(xk)∥2∥fw(xi)−fw(xj)∥2+m)

where fw(x) is the output of the CNN for an input image x and thus our descriptor for x, and m is a margin. We can now define the term L triplets as the sum of this cost function over all the triplets in T :

Ltriplets=∑si,sj,skc(si,sj,sk)

The margin m serves two purposes. First, it introduces a

margin for the classification. It also defines a minimum ratio for the Euclidean distances of the dissimilar pair of samples and the similar one. This counterbalances the weight regularization term, which naturally contracts the output of the network and thus the descriptor space. We set m to 0.01 in all our experiments.

Pair-wise terms

In addition to the triplet-wise terms, we also use pair-wise terms. These terms make the descriptor robust to noise and other distracting artifacts such as changing illumination. We consider the set P of pairs (si,sj) of samples from the same object under very similar poses, ideally the same, and we define the L pairs term as the sum of the squared Euclidean distances between the descriptors for these samples:

Lpairs=∥fw(xi)−fw(xj)∥2

This term therefore enforces the fact that for two images

of the same object and same pose, we want to obtain two descriptors which are as close as possible to each other, even if they are from different imaging conditions: Ideally we want the same descriptors even if the two images have different backgrounds or different illuminations.

Note that we do not consider dissimilar pairs unlike work in keypoint descriptors learning for example. With dissimilar pairs the problem arises how strong to penalize a certain distance between the two samples, given their individual labels. Using triplets instead gives the possibility to only consider relative dissimilarity.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: