Deduplicating Training Data Makes Language Models Better

Deduplicating Training Data Makes Language Models Better. Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. 2021. Publisher: arXiv Version Number: 2

Paper doi abstract bibtex

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets – for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.

@article{lee_deduplicating_2021,
	title = {Deduplicating {Training} {Data} {Makes} {Language} {Models} {Better}},
	copyright = {arXiv.org perpetual, non-exclusive license},
	url = {https://arxiv.org/abs/2107.06499},
	doi = {10.48550/ARXIV.2107.06499},
	abstract = {We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1\% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4\% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.},
	urldate = {2022-06-07},
	author = {Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas},
	year = {2021},
	note = {Publisher: arXiv
Version Number: 2},
	keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, Machine Learning (cs.LG)},
}

Downloads: 0

{"_id":"Z3SQNwcW5CmXGhAB7","bibbaseid":"lee-ippolito-nystrom-zhang-eck-callisonburch-carlini-deduplicatingtrainingdatamakeslanguagemodelsbetter-2021","author_short":["Lee, K.","Ippolito, D.","Nystrom, A.","Zhang, C.","Eck, D.","Callison-Burch, C.","Carlini, N."],"bibdata":{"bibtype":"article","type":"article","title":"Deduplicating Training Data Makes Language Models Better","copyright":"arXiv.org perpetual, non-exclusive license","url":"https://arxiv.org/abs/2107.06499","doi":"10.48550/ARXIV.2107.06499","abstract":"We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets – for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.","urldate":"2022-06-07","author":[{"propositions":[],"lastnames":["Lee"],"firstnames":["Katherine"],"suffixes":[]},{"propositions":[],"lastnames":["Ippolito"],"firstnames":["Daphne"],"suffixes":[]},{"propositions":[],"lastnames":["Nystrom"],"firstnames":["Andrew"],"suffixes":[]},{"propositions":[],"lastnames":["Zhang"],"firstnames":["Chiyuan"],"suffixes":[]},{"propositions":[],"lastnames":["Eck"],"firstnames":["Douglas"],"suffixes":[]},{"propositions":[],"lastnames":["Callison-Burch"],"firstnames":["Chris"],"suffixes":[]},{"propositions":[],"lastnames":["Carlini"],"firstnames":["Nicholas"],"suffixes":[]}],"year":"2021","note":"Publisher: arXiv Version Number: 2","keywords":"Computation and Language (cs.CL), FOS: Computer and information sciences, Machine Learning (cs.LG)","bibtex":"@article{lee_deduplicating_2021,\n\ttitle = {Deduplicating {Training} {Data} {Makes} {Language} {Models} {Better}},\n\tcopyright = {arXiv.org perpetual, non-exclusive license},\n\turl = {https://arxiv.org/abs/2107.06499},\n\tdoi = {10.48550/ARXIV.2107.06499},\n\tabstract = {We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1\\% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4\\% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.},\n\turldate = {2022-06-07},\n\tauthor = {Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas},\n\tyear = {2021},\n\tnote = {Publisher: arXiv\nVersion Number: 2},\n\tkeywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, Machine Learning (cs.LG)},\n}\n\n","author_short":["Lee, K.","Ippolito, D.","Nystrom, A.","Zhang, C.","Eck, D.","Callison-Burch, C.","Carlini, N."],"key":"lee_deduplicating_2021","id":"lee_deduplicating_2021","bibbaseid":"lee-ippolito-nystrom-zhang-eck-callisonburch-carlini-deduplicatingtrainingdatamakeslanguagemodelsbetter-2021","role":"author","urls":{"Paper":"https://arxiv.org/abs/2107.06499"},"keyword":["Computation and Language (cs.CL)","FOS: Computer and information sciences","Machine Learning (cs.LG)"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://api.zotero.org/groups/2532329/items?key=3gRHj8hfOfGJBww9EhlzpL4j&format=bibtex&limit=100","dataSources":["72w8hmKFm76DkeXyN","fQ66AuPM43rBsaruw"],"keywords":["computation and language (cs.cl)","fos: computer and information sciences","machine learning (cs.lg)"],"search_terms":["deduplicating","training","data","makes","language","models","better","lee","ippolito","nystrom","zhang","eck","callison-burch","carlini"],"title":"Deduplicating Training Data Makes Language Models Better","year":2021}