您的位置:首页 > 其它

1

2016-05-25 11:24 162 查看
FREAK: Fast Retina Keypoint

Alexandre Alahi, Raphael Ortiz, Pierre Vandergheynst

Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Switzerland

Abstract

A large number of vision applications rely on matching

keypoints across images. The last decade featured

an arms-race towards faster and more robust keypoints

and association algorithms: Scale Invariant Feature Transform

(SIFT)[17], Speed-up Robust Feature (SURF)[4], and

more recently Binary Robust Invariant Scalable Keypoints

(BRISK)[16] to name a few. These days, the deployment

of vision algorithms on smart phones and embedded devices

with low memory and computation complexity has

even upped the ante: the goal is to make descriptors faster

to compute, more compact while remaining robust to scale,

rotation and noise.

To best address the current requirements, we propose a

novel keypoint descriptor inspired by the human visual system

and more precisely the retina, coined Fast Retina Keypoint

(FREAK). A cascade of binary strings is computed by

efficiently comparing image intensities over a retinal sampling

pattern. Our experiments show that FREAKs are in

general faster to compute with lower memory load and also

more robust than SIFT, SURF or BRISK. They are thus competitive

alternatives to existing keypoints in particular for

embedded applications.

1. Introduction

Visual correspondence, object matching, and many other

vision applications rely on representing images with sparse

number of keypoints. A real challenge is to efficiently describe

keypoints, i.e. image patches, with stable, compact

and robust representations invariant to scale, rotation, affine

transformation, and noise. The past decades witnessed key

players to efficiently describe keypoints and match them.

The most popular descriptor is the histogram of oriented

gradient proposed by Lowe [17] to describe the Scale Invariant

Feature Transform (SIFT) keypoints. Most of the

efforts in the last years was to perform as good as SIFT [14]

with lower computational complexity. The Speeded up Robust

Feature (SURF) by Bay et al. [4] is a good example.

It has similar matching rates with much faster performance

+

-

-

- +

+

- -

+

10110

Figure 1: llustration of our FREAK descriptor. A series of Difference of

Gaussians (DoG) over a retinal pattern are 1 bit quantized.

by describing keypoints with the responses of few Haar-like

filters. In general, Alahi et al. show in [2] that a grid of descriptors,

similar to SIFT and SURF, is better than a single

one to match an image region. Typically, a grid of covariance

matrices [30] attains high detection rate but remains

computationally too expensive for real-time applications.

The deployment of cameras on every phone coupled with

the growing computing power of mobile devices has enabled

a new trend: vision algorithms need to run on mobile

devices with low computing power and memory capacity.

Images obtained by smart phones can be used to

perform structure from motion [27], image retrieval [22],

or object recognition [15]. As a result, new algorithms

are needed where fixed-point operations and low memory

load are preferred. The Binary Robust Independent Elementary

Feature (BRIEF) [5], the Oriented Fast and Rotated

BRIEF (ORB)[26], and the Binary Robust Invariant

Scalable Keypoints[16] (BRISK) are good examples. In the

next section, we will briefly present these descriptors. Their

stimulating contribution is that a binary string obtained by

simply comparing pairs of image intensities can efficiently

describe a keypoint, i.e. an image patch. However, several

problems remain: how to efficiently select the ideal pairs

within an image patch? How to match them? Interestingly,

such trend is inline with the models of the nature to describe

complex observations with simple rules. We propose to address

such unknowns by designing a descriptor inspired by

the Human Visual System, and more precisely the retina.

We propose the Fast Retina Keypoint (FREAK) as a fast,

1

compact and robust keypoint descriptor. A cascade of binary

strings is computed by efficiently comparing pairs of

image intensities over a retinal sampling pattern. Interestingly,

selecting pairs to reduce the dimensionality of the descriptor

yields a highly structured pattern that mimics the

saccadic search of the human eyes.

2. Related work

Keypoint descriptors are often coupled with their detection.

Tuytelaar et al. in [29] and Gauglitz et al. in [11] presented

a detailed survey. We briefly present state-of-the-art

detectors and mainly focus on descriptors.

2.1. Keypoint detectors

A first solution is to consider corners as keypoints. Harris

and Stephen in [12] proposed the Harris corner detector.

Mikolajczyk and Schmid made it scale invariant in [20].

Another solution is to use local extrema of the responses

of certain filters as potential keypoints. Lowe in [17] filtered

the image with differences of Gaussians. Bay et al.

in [4] used a Fast Hessian detector. Agrawal et al. in [1]

proposed simplified center-surround filters to approximate

the Laplacian.. Ebrahimi and Mayol-Cuevas in [7] accelerated

the process by skipping the computation of the filter

response if the response for the previous pixel is very low.

Rostenand and Drummond proposed in [25] the FAST criterion

for corner detection, improved by Mair et al. in [18]

with their AGAST detector. The latter is a fast algorithm to

locate keypoints. The detector used in BRISK by Leutenegger

et al. in [16] is a multi-scale AGAST. They search for

maxima in scale-space using the FAST score as a measure

of saliency. We use the same detector for our evaluation of

FREAK.

2.2. SIFTlike

descriptors

Once keypoints are located, we are interested in describing

the image patch with a robust feature vector. The most

well-known descriptor is SIFT [17]. A 128-dimensional

vector is obtained from a grid of histograms of oriented gradient.

Its high descriptive power and robustness to illumination

change have ranked it as the reference keypoint descriptor

for the past decade. A family of SIFT-like descriptor has

emerged in the past years. The PCA-SIFT [14] reduces the

description vector from 128 to 36 dimension using principal

component analysis. The matching time is reduced, but the

time to build the descriptor is increased leading to a small

gain in speed and a loss of distinctiveness. The GLOH descriptor

[21] is an extension of the SIFT descriptor that is

more distinctive, but also more expensive to compute. The

robustness to change of viewpoint is improved in [31] by

simulating multiple deformations to the descriptive patch.

Good compromises between performances and the number

of simulated patches lead to an algorithm twice slower than

SIFT. Ambai and Yoshida proposed a Compact And Realtime

Descriptors (CARD) in [3] to extract the histogram of

oriented gradient from the grid binning of SIFT or the logpolar

binning of GLOH. The computation of the histograms

is simplified by using lookup tables.

One of the widely used keypoints at the moment is

clearly SURF [4]. It has similar matching performances as

SIFT but is much faster. It also relies on local gradient histograms.

The Haar-wavelet responses are efficiently computed

with integral images leading to 64 or 128-dimensional

vectors. However, the dimensionality of the feature vector

is still too high for large-scale applications such as image

retrieval or 3D reconstruction. Often, Principal Component

Analysis (PCA), or hashing functions are used to reduce the

dimensionality of the descriptors [24]. Such steps involve

time-consuming computation and hence affect the real-time

performance.

2.3. Binary descriptors

Calonder et al. in [5] showed that it is possible to shortcut

the dimensionality reduction step by directly building

a short binary descriptor in which each bits are independent,

called BRIEF. A clear advantage of binary descriptors

is that the Hamming distance (bitwise XOR followed

by a bit count) can replace the usual Euclidean distance.

The descriptor vector is obtained by comparing the intensity

of 512 pairs of pixels after applying a Gaussian smoothing

to reduce the noise sensitivity. The positions of the pixels

are pre-selected randomly according to a Gaussian distribution

around the patch center. The obtained descriptor

is not invariant to scale and rotation changes unless coupled

with detector providing it. Calonder et al. also highlighted

in their work that usually orientation detection reduces

the recognition rate and should therefore be avoided

when it is not required by the target application. Rublee et

al. in [26] proposed the Oriented Fast and Rotated BRIEF

(ORB) descriptor. Their binary descriptor is invariant to

rotation and robust to noise. Similarly, Leutenegger et al.

in [16] proposed a binary descriptor invariant to scale and

rotation called BRISK. To build the descriptor bit-stream,

a limited number of points in a specific sampling pattern

is used. Each point contributes to many pairs. The pairs

are divided in short-distance and long-distance subsets. The

long-distance subset is used to estimate the direction of the

keypoint while the short-distance subset is used to build binary

descriptor after rotating the sampling pattern.

In Section 5, we compare our proposed FREAK descriptor

with the above presented descriptors. But first, we

present a possible intuition on why these trendy binary descriptors

can work based on the study of the human retina.

! " # $""

! " # %""

! " # &""

! " # '""

(" )"

! " # *""

! " # +""

! " # ,""

(" )"

- $.$$..$"

/0121345462137"

/894:7"

;<=>:81="?4::7"

@5A1="6124=A<:7"

B8=<3C"7238=>"

?4::7"

D8=4<3" E1=-:8=4<3"

FGH<=""

34A=<"

?1H6G243"

I8781="

Figure 2: From human retina to computer vision: the biological pathways

leading to action potentials is emulated by simple binary tests over pixel

regions. [Upper part of the image is a courtesy of the book Avian Visual

Cognition by R: Cook].

3. Human retina

3.1. Motivations

In the presented literature, we have seen that recent

progress in image representation has shown that simple intensity

comparison of several pairs of pixels can be good

enough to describe and match image patches [5, 26, 16].

However, there exist some open interrogations on the ideal

selection of pairs. How should we sample them and compare

them? How to be robust to noise? Should we smooth

with a single Gaussian kernel? In this work, we show how

to gain performance by selecting a solution inspired by the

human retina, while enforcing low computational complexity.

Neuroscience has made lots of progress in understanding

the visual system and how images are transmitted to the

brain [8]. It is believed that the human retina extracts details

from images using Difference of Gaussians (DoG) of

various sizes and encodes such differences with action potentials.

The topology of the retina plays an important role.

We propose to mimic the same strategy to design our image

descriptor.

3.2. Analogy: from retinal photoreceptors to pixels

The topology and spatial encoding of the retina is quite

fascinating. First, several photoreceptors influence a ganglion

cell. The region where light influences the response

of a ganglion cell is the receptive field. Its size and dendritic

field increases with radial distance from the foveola

(Figure 3). The spatial distribution of ganglion cells reduces

exponentially with the distance to the foveal. They are segmented

into four areas: foveal, fovea, parafoveal, and perifoveal.

Each area plays an interesting role in the process of

detecting and recognizing objects since higher resolution is

(a) Density of ganglion cells over

the retina [10].

(b) Retina areas [13]

Figure 3: Illustration of the distribution of ganglion cells over the retina.

The density is clustered into four areas: (a) the foveola, (b) fovea, (c)

parafoveal, and (d) perifoveal.

captured in the fovea whereas a low acuity image is formed

in the perifoveal. One can interpret the decrease of resolution

as a body resource optimization. Let us now turn these

insights into an actual keypoint descriptor. Figure 2 presents

the proposed analogy.

4. FREAK

4.1. Retinal sampling pattern

Many sampling grids are possible to compare pairs of

pixel intensities. BRIEF and ORB use random pairs.

BRISK uses a circular pattern where points are equally

spaced on circles concentric, similar to DAISY [28]. We

propose to use the retinal sampling grid which is also circular

with the difference of having higher density of points

near the center. The density of points drops exponentially

as can be seen in Figure 3. .

Each sample point needs to be smoothed to be less sensitive

to noise. BRIEF and ORB use the same kernel for

all points in the patch. To match the retina model, we

use different kernels size for every sample points similar

to BRISK. The difference with BRISK is the exponential

change in size and the overlapping receptive fields. Figure

4 illustrates the topology of the receptive fields. Each circle

represents the standard deviations of the Gaussian kernels

applied to the corresponding sampling points.

We have experimentally observed that changing the size

of the Gaussian kernels with respect to the log-polar retinal

pattern leads to better performance. In addition, overlapping

the receptive fields also increases the performance. A

possible reason is that with the presented overlap in Figure

4, more information is captured. We add redundancy that

brings more discriminative power. Let’s consider the intensities

Ii measured at the receptive fields A;B, and C where:

IA > IB, IB > IC, and IA > IC: (1)

If the fields do not have overlap, then the last test IA >

IC is not adding any discriminant information. However,

Figure 4: Illustration of the FREAK sampling pattern similar to the retinal

ganglion cells distribution with their corresponding receptive fields. Each

circle represents a receptive field where the image is smoothed with its

corresponding Gaussian kernel.

if the fields overlap, partially new information can be encoded.

In general, adding redundancy allow us to use less

receptive fields which is a known strategy employed in compressed

sensing or dictionary learning [6]. According to Olshausen

and Field in [23], such redundancy also exists in the

receptive fields of the retina.

4.2. Coarsetofine

descriptor

We construct our binary descriptor F by thresholding the

difference between pairs of receptive fields with their corresponding

Gaussian kernel. In other words, F is a binary

string formed by a sequence of one-bit Difference of Gaussians

(DoG):

F =

X

0a<N

2aT(Pa); (2)

where Pa is a pair of receptive fields, N is the desired size

of the descriptor, and

T(Pa) =

1 if (I(Pr1

a )
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: