Toy Models of Superposition

Toy Models of Superposition. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. September, 2022. arXiv:2209.10652 [cs]

Paper doi abstract bibtex

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

@misc{elhage_toy_2022,
	title = {Toy {Models} of {Superposition}},
	url = {http://arxiv.org/abs/2209.10652},
	doi = {10.48550/arXiv.2209.10652},
	abstract = {Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.},
	urldate = {2025-05-20},
	publisher = {arXiv},
	author = {Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah, Christopher},
	month = sep,
	year = {2022},
	note = {arXiv:2209.10652 [cs]},
	keywords = {Computer Science - Machine Learning},
}

Downloads: 0

{"_id":"vehgcG23G3AzKEW9i","bibbaseid":"elhage-hume-olsson-schiefer-henighan-kravec-hatfielddodds-lasenby-etal-toymodelsofsuperposition-2022","author_short":["Elhage, N.","Hume, T.","Olsson, C.","Schiefer, N.","Henighan, T.","Kravec, S.","Hatfield-Dodds, Z.","Lasenby, R.","Drain, D.","Chen, C.","Grosse, R.","McCandlish, S.","Kaplan, J.","Amodei, D.","Wattenberg, M.","Olah, C."],"bibdata":{"bibtype":"misc","type":"misc","title":"Toy Models of Superposition","url":"http://arxiv.org/abs/2209.10652","doi":"10.48550/arXiv.2209.10652","abstract":"Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in \"superposition.\" We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.","urldate":"2025-05-20","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Elhage"],"firstnames":["Nelson"],"suffixes":[]},{"propositions":[],"lastnames":["Hume"],"firstnames":["Tristan"],"suffixes":[]},{"propositions":[],"lastnames":["Olsson"],"firstnames":["Catherine"],"suffixes":[]},{"propositions":[],"lastnames":["Schiefer"],"firstnames":["Nicholas"],"suffixes":[]},{"propositions":[],"lastnames":["Henighan"],"firstnames":["Tom"],"suffixes":[]},{"propositions":[],"lastnames":["Kravec"],"firstnames":["Shauna"],"suffixes":[]},{"propositions":[],"lastnames":["Hatfield-Dodds"],"firstnames":["Zac"],"suffixes":[]},{"propositions":[],"lastnames":["Lasenby"],"firstnames":["Robert"],"suffixes":[]},{"propositions":[],"lastnames":["Drain"],"firstnames":["Dawn"],"suffixes":[]},{"propositions":[],"lastnames":["Chen"],"firstnames":["Carol"],"suffixes":[]},{"propositions":[],"lastnames":["Grosse"],"firstnames":["Roger"],"suffixes":[]},{"propositions":[],"lastnames":["McCandlish"],"firstnames":["Sam"],"suffixes":[]},{"propositions":[],"lastnames":["Kaplan"],"firstnames":["Jared"],"suffixes":[]},{"propositions":[],"lastnames":["Amodei"],"firstnames":["Dario"],"suffixes":[]},{"propositions":[],"lastnames":["Wattenberg"],"firstnames":["Martin"],"suffixes":[]},{"propositions":[],"lastnames":["Olah"],"firstnames":["Christopher"],"suffixes":[]}],"month":"September","year":"2022","note":"arXiv:2209.10652 [cs]","keywords":"Computer Science - Machine Learning","bibtex":"@misc{elhage_toy_2022,\n\ttitle = {Toy {Models} of {Superposition}},\n\turl = {http://arxiv.org/abs/2209.10652},\n\tdoi = {10.48550/arXiv.2209.10652},\n\tabstract = {Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in \"superposition.\" We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.},\n\turldate = {2025-05-20},\n\tpublisher = {arXiv},\n\tauthor = {Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah, Christopher},\n\tmonth = sep,\n\tyear = {2022},\n\tnote = {arXiv:2209.10652 [cs]},\n\tkeywords = {Computer Science - Machine Learning},\n}\n\n","author_short":["Elhage, N.","Hume, T.","Olsson, C.","Schiefer, N.","Henighan, T.","Kravec, S.","Hatfield-Dodds, Z.","Lasenby, R.","Drain, D.","Chen, C.","Grosse, R.","McCandlish, S.","Kaplan, J.","Amodei, D.","Wattenberg, M.","Olah, C."],"key":"elhage_toy_2022","id":"elhage_toy_2022","bibbaseid":"elhage-hume-olsson-schiefer-henighan-kravec-hatfielddodds-lasenby-etal-toymodelsofsuperposition-2022","role":"author","urls":{"Paper":"http://arxiv.org/abs/2209.10652"},"keyword":["Computer Science - Machine Learning"],"metadata":{"authorlinks":{}}},"bibtype":"misc","biburl":"https://api.zotero.org/users/15655889/collections/G6GP9ANU/items?key=MzHVK1tHvHTcC946y3GIaoco&format=bibtex&limit=100","dataSources":["MpmemwLeQzDcKDq6x","TSvsyzYFzoTiDesZP"],"keywords":["computer science - machine learning"],"search_terms":["toy","models","superposition","elhage","hume","olsson","schiefer","henighan","kravec","hatfield-dodds","lasenby","drain","chen","grosse","mccandlish","kaplan","amodei","wattenberg","olah"],"title":"Toy Models of Superposition","year":2022}