Rich feature hierarchies for accurate object detection and semantic segmentation

Rich feature hierarchies for accurate object detection and semantic segmentation. Girshick, R., Donahue, J., Darrell, T., & Malik, J. October, 2014. arXiv:1311.2524 [cs] version: 5

Paper doi abstract bibtex

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

@misc{girshick_rich_2014,
	title = {Rich feature hierarchies for accurate object detection and semantic segmentation},
	url = {http://arxiv.org/abs/1311.2524},
	doi = {10.48550/arXiv.1311.2524},
	abstract = {Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30\% relative to the previous best result on VOC 2012---achieving a mAP of 53.3\%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/{\textasciitilde}rbg/rcnn.},
	language = {en},
	urldate = {2023-08-14},
	publisher = {arXiv},
	author = {Girshick, Ross and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra},
	month = oct,
	year = {2014},
	note = {arXiv:1311.2524 [cs]
version: 5},
	keywords = {\#CNN, \#CVPR{\textgreater}14, \#Deep Learning, \#Detection, \#Vision, /unread, Computer Science - Computer Vision and Pattern Recognition},
}

Downloads: 0

{"_id":"BatG9x2NFbnS93uca","bibbaseid":"girshick-donahue-darrell-malik-richfeaturehierarchiesforaccurateobjectdetectionandsemanticsegmentation-2014","downloads":0,"creationDate":"2015-09-03T07:18:47.295Z","title":"Rich feature hierarchies for accurate object detection and semantic segmentation","author_short":["Girshick, R.","Donahue, J.","Darrell, T.","Malik, J."],"year":2014,"bibtype":"misc","biburl":"https://bibbase.org/zotero/zzhenry2012","bibdata":{"bibtype":"misc","type":"misc","title":"Rich feature hierarchies for accurate object detection and semantic segmentation","url":"http://arxiv.org/abs/1311.2524","doi":"10.48550/arXiv.1311.2524","abstract":"Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.","language":"en","urldate":"2023-08-14","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Girshick"],"firstnames":["Ross"],"suffixes":[]},{"propositions":[],"lastnames":["Donahue"],"firstnames":["Jeff"],"suffixes":[]},{"propositions":[],"lastnames":["Darrell"],"firstnames":["Trevor"],"suffixes":[]},{"propositions":[],"lastnames":["Malik"],"firstnames":["Jitendra"],"suffixes":[]}],"month":"October","year":"2014","note":"arXiv:1311.2524 [cs] version: 5","keywords":"#CNN, #CVPR\\textgreater14, #Deep Learning, #Detection, #Vision, /unread, Computer Science - Computer Vision and Pattern Recognition","bibtex":"@misc{girshick_rich_2014,\n\ttitle = {Rich feature hierarchies for accurate object detection and semantic segmentation},\n\turl = {http://arxiv.org/abs/1311.2524},\n\tdoi = {10.48550/arXiv.1311.2524},\n\tabstract = {Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30\\% relative to the previous best result on VOC 2012---achieving a mAP of 53.3\\%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/{\\textasciitilde}rbg/rcnn.},\n\tlanguage = {en},\n\turldate = {2023-08-14},\n\tpublisher = {arXiv},\n\tauthor = {Girshick, Ross and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra},\n\tmonth = oct,\n\tyear = {2014},\n\tnote = {arXiv:1311.2524 [cs]\nversion: 5},\n\tkeywords = {\\#CNN, \\#CVPR{\\textgreater}14, \\#Deep Learning, \\#Detection, \\#Vision, /unread, Computer Science - Computer Vision and Pattern Recognition},\n}\n\n\n\n","author_short":["Girshick, R.","Donahue, J.","Darrell, T.","Malik, J."],"key":"girshick_rich_2014-1","id":"girshick_rich_2014-1","bibbaseid":"girshick-donahue-darrell-malik-richfeaturehierarchiesforaccurateobjectdetectionandsemanticsegmentation-2014","role":"author","urls":{"Paper":"http://arxiv.org/abs/1311.2524"},"keyword":["#CNN","#CVPR\\textgreater14","#Deep Learning","#Detection","#Vision","/unread","Computer Science - Computer Vision and Pattern Recognition"],"metadata":{"authorlinks":{}},"downloads":0,"html":""},"search_terms":["rich","feature","hierarchies","accurate","object","detection","semantic","segmentation","girshick","donahue","darrell","malik"],"keywords":["#cnn","#cvpr\\textgreater14","#deep learning","#detection","#vision","/unread","computer science - computer vision and pattern recognition"],"authorIDs":[],"dataSources":["9cexBw6hrwgyZphZZ","aXmRAq63YsH7a3ufx","4NxBSuKtJxKDWdWso","Mcex2YsbPdMjaPz4T","nZHrFJKyxKKDaWYM8"]}