Rich feature hierarchies for accurate object detection and semantic segmentation. Girshick, R. B, Donahue, J, Darrell, T., & Malik, J. ArXiv e-prints, November, 2013.
Rich feature hierarchies for accurate object detection and semantic segmentation [link]Paper  bibtex   
@article{Girshick:2013vu,
author = {Girshick, Ross B and Donahue, J and Darrell, Trevor and Malik, Jitendra},
title = {{Rich feature hierarchies for accurate object detection and semantic segmentation}},
journal = {ArXiv e-prints},
year = {2013},
volume = {cs.CV},
month = nov,
annote = {General idea: region proposal + CNN feature + SVM. It's kind of a patchwork, so later on pure CNN approches have been proposed.

Hoiem may have a good analysis tool for detection errors. Might worth looking at.

Relationship to OverFeat (section 4.6):

OverFeat can be seen (roughly) as a special case of R-CNN, as mentioned in Section 4.6. But maybe it fails due to some details, such as SVM training, region proposals, etc.


pp. 2 left

> However, units high up in our network, which has five convolutional layers, have very large receptive fields (195 {\texttimes} 195 pixels) and strides (32{\texttimes}32 pixels) in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge.

As in OverFeat, multiscale + bbox regression can always be used with this approach, to make for precise localization. Maybe authors think this will make the system too complicated, or it doesn't result in good performance in practice, as shown by OverFeat.

pp. 6, end of 3.2

> The network appears to learn a representation that combines a small number of class-tuned features together with a distributed representation of shape, texture, color, and material properties. The subsequent fully connected layer fc6 has the ability to model a large set of compositions of these rich features.

Seems that this is the case for all CNNs.

pp. 6 3.2

> Much of the CNN{\textquoteright}s representational power comes from its convolutional layers, rather than from the much larger densely connected layers.

This looks great.

pp. 7 top

> which suggests that the pool5 features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them.

This is good insight as well

pp. 7 above 3.3

> when compared internally to their private DPM baselines{\textemdash}both use non- public implementations of DPM that underperform the open source version [20]

Well probably this is a way to publish papers...



pp. 10 Section 5

Here they first use CPMC <http://www.maths.lth.se/matematiklth/personal/sminchis/code/cpmc/> to get segments (which are not rectangular) in the image, and then find a bbox containing the segment to send to CNN. Since two segments for different objects may have same bbox (say a person holding a very tall tripod, person and tripod may have very similar bbox), they replace background with mean color to disambiguate this. See <https://people.eecs.berkeley.edu/{\textasciitilde}rbg/slides/rcnn-cvpr14-slides.pdf>.



},
keywords = {classics, deep learning},
read = {Yes},
rating = {5},
date-added = {2017-02-19T15:46:40GMT},
date-modified = {2017-02-19T19:13:46GMT},
url = {http://arxiv.org/abs/1311.2524},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2013/Girshick/arXiv%202013%20Girshick.pdf},
file = {{arXiv 2013 Girshick.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2013/Girshick/arXiv 2013 Girshick.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/0EC70D5F-CEA5-43DE-A6E2-AF820F2E22A4}}
}

Downloads: 0