您的位置:首页 > Web前端

Feature系列-GIST

2014-02-25 11:04 162 查看
原文地址:Feature系列-GIST作者:lowrank
http://ilab.usc.edu/siagian/Research/Gist/Gist.html




Gist/Context of a Scene



We describe and validate a simple context-based scene recognitionalgorithm using a multiscale set of early-visual features, whichcapture the “gist” of the scene into a low-dimensional signaturevector. Distinct from previous approaches, the algorithm presentsthe
advantage of being biologically plausible and of having lowcomputational complexity, sharing its low-level features with amodel for visual attention that may operate concurrently on avision system.

We compare classification accuracy using scenes filmed at threeoutdoor sites on campus (13,965 to 34,711 frames per site).Dividing each site into nine segments, we obtain segmentclassification rates between 84.21% and 88.62%. Combining scenesfrom all sites
(75,073 frames in total) yields 86.45% correctclassification, demonstrating generalization and scalability of theapproach.

Index Terms: Gist of a scene, saliency,scene recognition, computational neuroscience, imageclassification, image statistics, robot vision, robotlocalization.


Papers

Main paper for model details:







C.Siagian,
L. Itti, RapidBiologically-Inspired Scene Classification Using Features Sharedwith Visual Attention,

IEEE Transactions on Pattern Analysis andMachineIntelligence, Vol. 29,No. 2, pp. 300-312, Feb2007.

Comparison with 3 other gist models (Renniger and Malik[2004vr], Oliva and Torralba [2001ijcv], and Torralba et. al.[2003iccv]):






C.Siagian,
L. Itti, Comparison ofgist models in rapid scene categorization tasks,

In: Proc. VisionScience Society Annual Meeting(VSS08), May2008.


Source Codes and Dataset

Thecode is integrated to the iLabNeuromorphic Vision C++ Toolkit.In
order to gain code access, please follow the downloadinstructions there.

Special instruction to access the gist code can befound here.

The dataset can be found here.


Introduction

Significant number of mobile-robotics approaches addresses thisfundamental problem by utilizing sonar, laser, or other rangesensors [Fox1999,Thrun1998a]. They are particularly effectiveindoors due to many spatial and structural regularities such asflat walls
and narrow corridors. In the outdoors, however, thesesensors become less robust given all the protrusions and surfaceirregularities [Lingemann2004]. For example, a slight change inpose can result in large jumps in range reading because of treetrunks, moving
branches, and leaves.

These difficulties with traditional robot sensors have promptedresearch towards vision. Within Computer Vision, lighting(especially in the outdoors), dynamic backgrounds, andview-invariant matching become major hurdles toovercome.

Object-based approaches [Abe1999,Thrun1998b] recognize physicallocations by identifying sets of pre-determined landmark objects(and their configuration) known to be present at a location. Thistypically involves intermediate steps such as segmentation, featuregrouping,
and object recognition. Such layered approach is prone tocarrying over and amplifying low-level errors along the stream ofprocessing.

It should also be pointed out that this approach may beenvironment-specific in that the objects are hand-picked asselecting reliable landmarks is an openproblem.

Region-based approaches [Katsura2003,Matsumoto2000,Murrieta-Cid2002] uses segmented image regions and theirrelationships to form a signature of a location. This requiresrobust segmentation of individual regions, which is hard forunconstrained environment such
as a park where vegetationdominates.

Context-based approaches ([Renniger and Malik 2004],[Ulrich andNourbakhsh 2000],[Oliva and Torralba 2001],[Torralba 2003]), on theother hand, bypass the above traditional processing steps andconsider the input image as a whole and extract a low-dimensionalsignature
that summarizes the image's statistics and/or semantics.One motivation for such approach is that it is more robustsolutions because random noise, which may catastrophicallyinfluence local processing, tends to average outglobally.

Despite recent advances in computer vision and robotics, humansstill perform orders of magnitude better in outdoors localizationand navigation than the best available systems. And thus, it isinspiring to examine the low-level mechanisms as well as thesystem-level
computational architecture according to which humanvision is organized (figure 1).



Figure 1. Biological Vision Model

Early on, the human visual processing system already makesdecisions to focus attention and processing resources onto smallregions which look more interesting. The mechanism by which veryrapid holistic image analysis gives rise to a small set ofcandidate salient
locations in a scene has recently been thesubject of comprehensive research efforts and is fairly wellunderstood [Treisman_Gelade80, Wolfe94, Itti_etal98,Itti_Koch01].

Parallel with attention guidance and mechanisms for saliencycomputation, humans demonstrate ability in capturing the "gist" ofa scene; for example, following presentation of a photograph forjust a fraction of a second, an observer may report that it is anindoor
kitchen scene with numerous colorful objects on thecountertop [Potter1975,Biederman82,Tversky1983,Oliva1997]. Suchreport at a first glance (brief exposures of 100ms or below) ontoan image is remarkable considering that it summarizes thequintessential characteristics
of an image, a process previouslyexpected to require much analysis such as general semanticattributes (e.g., indoors, outdoors, office, kitchen), recognitionof places with a restricted spatial layout [Epstein_Kanwisher00]and a coarse evaluation of distributions
of visual features (e.g.,highly colorful, grayscale, several large masses, many smallobjects) [Sanocki_Epstein97,Rensink00].

The idea that saliency and gist runs in parallel is furtherstrengthened in a psychophysics experiment that humans can answerspecific questions even when the subject's attention issimultaneously engaged by another concurrent visual discriminationtask [Li_etal02].
From the point of view of desired results, gistand saliency appear to be complementary opposites: finding salientlocations requires finding those image regions which stand out bysignificantly differing from their neighbors, while computing gistinvolves accumulating
image statistics over the entire scene. Yet,despite these differences, there is only one visual cortex in theprimate brain, which must serve both saliency and gistcomputations. Part of our contribution is to make the connectionbetween these two crucial components
of biological mid-levelvision. To this end, we here explicitly explore whether it ispossible to devise a working system where the low-level featureextraction mechanisms - coarsely corresponding to cortical visualareas V1 through V4 and MT - are shared as opposed
to computedseparately by two different machine vision modules. The divergencecomes at a later stage, in how the low-level vision features arefurther processed before being utilized. In our neural simulationof posterior parietal cortex along the dorsal or ``where''
streamof visual processing [Ungerleider_Mishkin82], a saliency map isbuilt through spatial competition of low-level feature responsesthroughout the visual field. This competition quiets down locationswhich may initially yield strong local feature responses
butresemble their neighbors, while amplifying locations which havedistinctive appearances. In contrast, in our neural simulation ofinferior temporal or the ``what'' stream of visual processing,responses from the low-level feature detectors are combined toproduce
the gist vector as a holistic low-dimensional signature ofthe entire input image. The two models, when run in parallel, canhelp each other and provide a more complete description of thescene in question.

While exploitation of the saliency map has been extensivelydescribed previously for a number of vision tasks[Itti_etal98pami,Itti_Koch00vr,Itti_Koch01nrn,Itti04tip], wedescribe how our algorithm compute gist in an inexpensive manner byusing the same low-level
visual front-end as the saliency model. Inwhat follows, we use the term gist in a more specific sense thanits broad psychological definition (what observers can gather froma scene over a single glance), by formalizing it as a relativelylow-dimensional scene
representation which is acquired over veryshort time frames and use it to classify scenes as belonging to agiven category. We extensively test the gist model in threechallenging outdoor environments across multiple days and times ofdays, where the dominating
shadows, vegetation, and otherephemerous phenomena are expected to defeat landmark-based andregion-based approaches. Our success in achieving reliableperformance in each environment is further generalized by showingthat performance does not degrade when combining
all threeenvironments. These results support our hypothesis that gist canreliably be extracted at very low computational cost, using verysimple visual features shared with an attention system in anoverall biologically-correct framework.


Design and Implementation

The core of our present research focuses on the process ofextracting the gist of an image using features from severaldomains, calculating its holistic characteristics but still takinginto account coarse spatial information. The starting point for theproposed
new model is the existing saliency model of Itti et al.[Itti_etal98pami], freely available on theWorld-Wide-Web.

Please see the iLabNeuromorphic Vision C++ Toolkit for all thesource code.


Visual Feature Extraction

Inthe saliency model, an input image is filtered in a number oflow-level visual feature channels - color, intensity, orientation,flicker
and motion - at multiple spatial scales. Some channels,like color, orientation, or motion, have several sub-channels, onefor each color type, orientation, or direction of motion. Eachsub-channel has a nine-scale pyramidal representation of filteroutputs. Within
each sub-channel, the model performscenter-surround operations between filter output at differentscales to produce feature maps. The different feature maps for eachtype allows the system to pick up regions at several scales withthe added lighting invariance.
The intensity channel output for theillustration image of figure below shows different-sized regionsbeing emphasized according to their respective center-surroundparameter.





Figure 2. Gist Model


Thesaliency model uses feature maps to detect conspicuous regions ineach channel through additional winner-take-all mechanisms
to yielda saliency map which emphasize locations which substantially differfrom their neighbors [Itti_etal98pami]. To re-use the sameintermediate maps for gist as for attention, our gist model usesthe already available orientation, color and intensity channels(flicker
and motion are here assumed to be more dominantlydetermined by the robot's egomotion and hence unreliable in forminga gist signature of a given location). The basic approach is toexploit statistical data of color and texture measurements inpredetermined regionsubdivisions.

We incorporate information from the orientation channel, employingGabor filters to the greyscale input image at four different
anglesand at four spatial scales for a subtotal of sixteen sub-channels.We do not perform center-surround on the Gabor filter outputsbecause these filters already are differential by nature. The colorand intensity channel combine to compose three pairs of
coloropponents derived from Ewald Hering's Color Opponency theories[Turner1994], which identify color channels' red-green andblue-yellow opponency pairs along with intensity channel'sdark-bright opponency. Each of the opponent pairs are used toconstruct six
center-surround scale combinations. These eighteensub-channels along with the sixteen Gabor combinations add up to atotal of thirty-four sub-channels altogether. Because the presentgist model is not specific to any domain, other channels such asstereo could
be used as well.


Gist Feature Extraction

Afterthe center-surround features are computed, each sub-channelextracts a gist vector from its corresponding feature map. We
applyaveraging operations (the simplest neurally-plausible computation)in a fixed four-by-four grid sub-regions over the map. Observe asub-channel in figure below for visualization of the process. Thisis in contrast with the winner-take-all competition operations
usedto compute saliency; hence, saliency and gist emphasize twocomplementary aspects of the data in the feature maps: saliencyfocuses on the most salient peaks of activity while gist estimatesoverall activation in different image regions.





Figure 3. Gist Extraction



PCA/ICA Dimension Reduction

The total number of raw gist feature dimension is 544, 34 featuremaps times 16 regions per map (figure below). We reduce thedimensions using Principal Component Analysis (PCA) and thenIndependent Component Analysis (ICA) with FastICA to a morepractical number
of 80 while still preserving up to 97% of thevariance for a set in the upwards of 30,000 campus scenes.


Scene Classification

Forscene classification, we use a three-layer neural network (withintermediate layers of 200 and 100 nodes), trained with theback-propagation
algorithm. The complete process is illustrated infigure 2.


Testing and Results

Wetest the system using this dataset.

The result for each site is shown in Tables 1 to 6, in columnar andconfusion matrix format. Table 7 and 8 will be explained below.
Fortable 1, 3, 5 and 7, The term "False +" or false positive forsegment x means the percentage of incorrect segment x guesses giventhat the correct answer is another segment, while "False-" or falsenegative is the number of incorrect guesses given that the
correctanswer is segment x.

The system is able to classify the ACB segments with an overall87.96% correctness while AnF is marginally lower (84.21%). If
welook at the challenges presented by the scenes in the second site(dominated by vegetation) it is quite an accomplishment to onlylose less than 4 percent in performance with no calibration done inmoving from the first environment to the second. Increase in
lengthof segments also do not markedly affect the results as FDF(86.38%), which is have the longest lengths among the experimentsare better than AnF. As a performance reference, when we test thesystem with a set of data taken back-to-back with training data,the
classification rate are about 89 to 91 percent. On the otherhand, when lighting condition of a testing data are not included intraining, the error would triple to thirty to forty percent whichsuggest that lighting coverage in the training phase iscritical.


Ahmanson Center for Biological Science (ACB)

A video of a test run for Ahmanson Center for Biological Sciencecan be viewed here






Associate and Founders Park (AnF)

A video of a test run for Associate and Founders Park can beviewed here






Frederick D. Fagg park (FDF)

A video of a test run for Frederick D. Fagg park can beviewed here






Combined Sites

Asa way to gauge the system's scalability, we combine scenes from allthree sites and train it to classify twenty seven differentsegments.
We use the same procedure as well as training and testingdata (175,406 and 75,073 frames, respectively). The only differenceis in the neural-network classifier, the output layer now consistsof twenty-seven nodes. The number of the input and hidden nodesremains
the same. During training we print the confusion matrixperiodically to analyze the process and find that the networkconverges from inter-site classification before going further andeliminate the intra-site errors. We organize the results intosegment-level
(Table 7) and site-level (Table 8) statistics. Forsegment-level classification, the overall success rate is 84.61%,not much worse than the previous three experiments. Notice alsothat the success among the individual sites changes as well. Fromthe site-level
confusion matrix (table 8), we see that the systemcan reliably pin the scene to the correct site (higher than 94percent). This is encouraging because the classifier can providevarious levels of outputs. That is, when the system is unsure aboutthe actual segment
location, it can at least rely on being at theright site.






Model Comparisons

we also compared our model with three other models:

Renniger and Malik [2004] use a set of texture descriptors ashistogram entries
Oliva and Torralba [2001] perform 2D Fourier Transform analysis(followed by PCA) in sub-region grid.
Torralba et. al. [2003] use steerable wavelet pyramids

They are reported in VSS2008 poster





Discussion

Wehave shown that the gist features succeed in classifying a largeset of images without the help of temporal filtering (one-shotrecognition),
which reduce noise significantly [Torralba2003]. Interms of robustness, the features are able to handle translationaland angular change. Because they are computed from large imagesub-regions, it takes a large translational shift to affect thevalues. As for
angular stability, the natural perturbation of acamera carried through a bumpy road during training seems to aidthe demonstrated invariance. In addition, the gist features arealso invariant to scale because the majority of the scenes(background) are stationary
and the system is trained with allviewing distances. The combined-sites experiment shows that thenumber of differentiable scenes can be quite high. Twenty sevensegments can make up a detailed map of a large area. Lastly, thegist features achieve a solid illumination
invariance when trainedwith different lighting conditions.

A drawback of the current system is that it cannot carry outpartial background matching for scenes in which large parts areoccluded
by dynamic foreground objects. As mentioned earlier thevideos are filmed during off-peak hours when few people (orvehicles) are on the road. Nevertheless, they can still createproblems when moving too close to the camera. In our system, theseimages can be
taken out using the motion cues from the not yetincorporated motion channel as a preprocessing filter, detectingsignificant occlusion by thresholding the sum of the motion channelfeature maps [Itti04tip]. Furthermore, a wide-angle lens (withsoftware distortion
correction) can help to see more of thebackground scenes and, in comparison, decrease the size of themoving foreground objects.


Conclusion

Thecurrent gist model is able to provide high-level contextinformation (a segment within a site) from various large anddifficult
outdoor environments despite using coarse features. Wefind that scenes from differing segments contrast in a globalmanner and gist automatically exploit them and thus reduce a needfor detailed calibration in which a robot has to rely on the ad-hocknowledge
of the designer for reliable landmarks. And because theraw features can be shared with the saliency model, the system canefficiently increase localization resolution. It can use salientcues to create distinct signature of individual scenes, finer pointof reference,
within segment that may not be differentiable by gistalone. The salient cues can even help guide localization for thearea between segments which we did not try toclassify.

Copyright © 2000 by theUniversity of Southern California, iLab and Prof. Laurent Itti
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐