DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019, 2019.

Paper abstract bibtex

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

@inproceedings{Sanh2019,
abstract = {As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.},
archivePrefix = {arXiv},
arxivId = {1910.01108},
author = {Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
booktitle = {5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019},
eprint = {1910.01108},
file = {:Users/shanest/Documents/Library/Sanh et al/5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019/Sanh et al. - 2019 - DistilBERT, a distilled version of BERT smaller, faster, cheaper and lighter.pdf:pdf},
keywords = {model},
title = {{DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}},
url = {http://arxiv.org/abs/1910.01108},
year = {2019}
}

Downloads: 0

{"_id":"JY9p5KRvzGXHBnzND","bibbaseid":"sanh-debut-chaumond-wolf-distilbertadistilledversionofbertsmallerfastercheaperandlighter-2019","authorIDs":[],"author_short":["Sanh, V.","Debut, L.","Chaumond, J.","Wolf, T."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","abstract":"As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.","archiveprefix":"arXiv","arxivid":"1910.01108","author":[{"propositions":[],"lastnames":["Sanh"],"firstnames":["Victor"],"suffixes":[]},{"propositions":[],"lastnames":["Debut"],"firstnames":["Lysandre"],"suffixes":[]},{"propositions":[],"lastnames":["Chaumond"],"firstnames":["Julien"],"suffixes":[]},{"propositions":[],"lastnames":["Wolf"],"firstnames":["Thomas"],"suffixes":[]}],"booktitle":"5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019","eprint":"1910.01108","file":":Users/shanest/Documents/Library/Sanh et al/5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019/Sanh et al. - 2019 - DistilBERT, a distilled version of BERT smaller, faster, cheaper and lighter.pdf:pdf","keywords":"model","title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter","url":"http://arxiv.org/abs/1910.01108","year":"2019","bibtex":"@inproceedings{Sanh2019,\nabstract = {As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.},\narchivePrefix = {arXiv},\narxivId = {1910.01108},\nauthor = {Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},\nbooktitle = {5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019},\neprint = {1910.01108},\nfile = {:Users/shanest/Documents/Library/Sanh et al/5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019/Sanh et al. - 2019 - DistilBERT, a distilled version of BERT smaller, faster, cheaper and lighter.pdf:pdf},\nkeywords = {model},\ntitle = {{DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}},\nurl = {http://arxiv.org/abs/1910.01108},\nyear = {2019}\n}\n","author_short":["Sanh, V.","Debut, L.","Chaumond, J.","Wolf, T."],"key":"Sanh2019","id":"Sanh2019","bibbaseid":"sanh-debut-chaumond-wolf-distilbertadistilledversionofbertsmallerfastercheaperandlighter-2019","role":"author","urls":{"Paper":"http://arxiv.org/abs/1910.01108"},"keyword":["model"],"metadata":{"authorlinks":{}},"downloads":0},"bibtype":"inproceedings","biburl":"https://www.shane.st/teaching/575/win20/MachineLearning-interpretability.bib","creationDate":"2020-01-06T20:16:55.620Z","downloads":0,"keywords":["model"],"search_terms":["distilbert","distilled","version","bert","smaller","faster","cheaper","lighter","sanh","debut","chaumond","wolf"],"title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter","year":2019,"dataSources":["okYcdTpf4JJ2zkj7A","znj7izS5PeehdLR3G","aGtG992oMsrqA3Aas"]}