Rich feature hierarchies for accurate object detection and semantic segmentation. Girshick, R., Donahue, J., Darrell, T., & Malik, J. arXiv:1311.2524 [cs], October, 2014. arXiv: 1311.2524
Rich feature hierarchies for accurate object detection and semantic segmentation [link]Paper  abstract   bibtex   
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.
@article{girshick_rich_2014,
	title = {Rich feature hierarchies for accurate object detection and semantic segmentation},
	url = {http://arxiv.org/abs/1311.2524},
	abstract = {Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30\% relative to the previous best result on VOC 2012---achieving a mAP of 53.3\%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/{\textasciitilde}rbg/rcnn.},
	urldate = {2022-03-02},
	journal = {arXiv:1311.2524 [cs]},
	author = {Girshick, Ross and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra},
	month = oct,
	year = {2014},
	note = {arXiv: 1311.2524},
	keywords = {Computer Science - Computer Vision and Pattern Recognition},
}

Downloads: 0