DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; and Darrell, T. In Proceedings of the th International Conference on Machine Learning, ICML , Beijing, China, - June , pages 647–655, 2014. JMLR.org.
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition [link]Paper  bibtex   
@inproceedings{Donahue:2014ta,
author = {Donahue, Jeff and Jia, Yangqing and Vinyals, Oriol and Hoffman, Judy and Zhang, Ning and Tzeng, Eric and Darrell, Trevor},
title = {{DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition}},
booktitle = {Proceedings of the th International Conference on Machine Learning, ICML , Beijing, China, - June },
year = {2014},
pages = {647--655},
publisher = {JMLR.org},
annote = {maybe most influential paper showing usefulness of CNN in many tasks. It's famous because it's the predecessor of Caffe.

DeCAF just means the feature vector of CNN, say fc7.

pp. 2 has the main result:


Our main result is the empirical validation that a generic visual feature based on a convolutional network weights trained on ImageNet outperforms a host of conventional vi- sual representations on standard benchmark object recog- nition tasks


Section 3.1

They use a AlexNet model trained by themselves, getting 42.9 error rate. It's same as blvc_alexnet of Caffe. See <https://github.com/BVLC/caffe/tree/rc4/models/bvlc_alexnet>

I think in practice they use 227x227 input, as indicated by the decaf github project's wiki <https://github.com/UCBAIR/decaf-release/wiki/imagenet>

as well as code in <https://github.com/UCBAIR/decaf-release/blob/6fa4cdfbd0d0b8d486d7146bf1e32edd3662fec4/decaf/scripts/imagenet.py>

For 224 vs 227 they think it's just some trick to speed up GPU computation. See <https://github.com/UCBAIR/decaf-release/wiki/imagenet>

"The Decaf implementation uses input images of size 227x227, while the cuda-convnet code uses images of size 224x224. We did 227x227 simply to have a full convolution (if the size is 224x224, the last row/column will only have height/width 8 instead of 11). We believe that cuda-convnet chose 224 for speed consideration as that creates good performance for GPUs. The performance difference should not be big.

Since we trained our network using GPU and are running on CPU, we actually observed some performance differences between them. We are not clear yet what caused it (it might be a bug in our code, admittedly).
"


pp. 4 see footnote 5. In order for tSNE to work, some random projection might be needed.

pp. 4 Figure 3 shows that fully connected layer takes most of time. Wonder if that still applies.

pp. 6 4.3 The Deformable Part descriptors sounds like averaging features from different parts, with some learned weights. But anyway, this is not important.

},
keywords = {deep learning},
read = {Yes},
rating = {3},
date-added = {2017-02-14T20:38:53GMT},
date-modified = {2017-02-17T20:55:28GMT},
url = {http://jmlr.org/proceedings/papers/v32/donahue14.html},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Donahue/ICML%202014%202014%20Donahue.pdf},
file = {{ICML 2014 2014 Donahue.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Donahue/ICML 2014 2014 Donahue.pdf:application/pdf;ICML 2014 2014 Donahue.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Donahue/ICML 2014 2014 Donahue.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/F1F8335F-D616-413E-B778-A41B7EB95AB5}}
}
Downloads: 0