Language Models (Mostly) Know What They Know

Language Models (Mostly) Know What They Know. Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., & Kaplan, J. November, 2022. arXiv:2207.05221 [cs]

Paper doi abstract bibtex

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.

@misc{kadavath_language_2022,
	title = {Language {Models} ({Mostly}) {Know} {What} {They} {Know}},
	url = {http://arxiv.org/abs/2207.05221},
	doi = {10.48550/arXiv.2207.05221},
	abstract = {We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.},
	urldate = {2023-07-18},
	publisher = {arXiv},
	author = {Kadavath, Saurav and Conerly, Tom and Askell, Amanda and Henighan, Tom and Drain, Dawn and Perez, Ethan and Schiefer, Nicholas and Hatfield-Dodds, Zac and DasSarma, Nova and Tran-Johnson, Eli and Johnston, Scott and El-Showk, Sheer and Jones, Andy and Elhage, Nelson and Hume, Tristan and Chen, Anna and Bai, Yuntao and Bowman, Sam and Fort, Stanislav and Ganguli, Deep and Hernandez, Danny and Jacobson, Josh and Kernion, Jackson and Kravec, Shauna and Lovitt, Liane and Ndousse, Kamal and Olsson, Catherine and Ringer, Sam and Amodei, Dario and Brown, Tom and Clark, Jack and Joseph, Nicholas and Mann, Ben and McCandlish, Sam and Olah, Chris and Kaplan, Jared},
	month = nov,
	year = {2022},
	note = {arXiv:2207.05221 [cs]},
	keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning},
}

Downloads: 0

{"_id":"RnLqb99m59kHoCw4p","bibbaseid":"kadavath-conerly-askell-henighan-drain-perez-schiefer-hatfielddodds-etal-languagemodelsmostlyknowwhattheyknow-2022","author_short":["Kadavath, S.","Conerly, T.","Askell, A.","Henighan, T.","Drain, D.","Perez, E.","Schiefer, N.","Hatfield-Dodds, Z.","DasSarma, N.","Tran-Johnson, E.","Johnston, S.","El-Showk, S.","Jones, A.","Elhage, N.","Hume, T.","Chen, A.","Bai, Y.","Bowman, S.","Fort, S.","Ganguli, D.","Hernandez, D.","Jacobson, J.","Kernion, J.","Kravec, S.","Lovitt, L.","Ndousse, K.","Olsson, C.","Ringer, S.","Amodei, D.","Brown, T.","Clark, J.","Joseph, N.","Mann, B.","McCandlish, S.","Olah, C.","Kaplan, J."],"bibdata":{"bibtype":"misc","type":"misc","title":"Language Models (Mostly) Know What They Know","url":"http://arxiv.org/abs/2207.05221","doi":"10.48550/arXiv.2207.05221","abstract":"We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability \"P(True)\" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict \"P(IK)\", the probability that \"I know\" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.","urldate":"2023-07-18","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Kadavath"],"firstnames":["Saurav"],"suffixes":[]},{"propositions":[],"lastnames":["Conerly"],"firstnames":["Tom"],"suffixes":[]},{"propositions":[],"lastnames":["Askell"],"firstnames":["Amanda"],"suffixes":[]},{"propositions":[],"lastnames":["Henighan"],"firstnames":["Tom"],"suffixes":[]},{"propositions":[],"lastnames":["Drain"],"firstnames":["Dawn"],"suffixes":[]},{"propositions":[],"lastnames":["Perez"],"firstnames":["Ethan"],"suffixes":[]},{"propositions":[],"lastnames":["Schiefer"],"firstnames":["Nicholas"],"suffixes":[]},{"propositions":[],"lastnames":["Hatfield-Dodds"],"firstnames":["Zac"],"suffixes":[]},{"propositions":[],"lastnames":["DasSarma"],"firstnames":["Nova"],"suffixes":[]},{"propositions":[],"lastnames":["Tran-Johnson"],"firstnames":["Eli"],"suffixes":[]},{"propositions":[],"lastnames":["Johnston"],"firstnames":["Scott"],"suffixes":[]},{"propositions":[],"lastnames":["El-Showk"],"firstnames":["Sheer"],"suffixes":[]},{"propositions":[],"lastnames":["Jones"],"firstnames":["Andy"],"suffixes":[]},{"propositions":[],"lastnames":["Elhage"],"firstnames":["Nelson"],"suffixes":[]},{"propositions":[],"lastnames":["Hume"],"firstnames":["Tristan"],"suffixes":[]},{"propositions":[],"lastnames":["Chen"],"firstnames":["Anna"],"suffixes":[]},{"propositions":[],"lastnames":["Bai"],"firstnames":["Yuntao"],"suffixes":[]},{"propositions":[],"lastnames":["Bowman"],"firstnames":["Sam"],"suffixes":[]},{"propositions":[],"lastnames":["Fort"],"firstnames":["Stanislav"],"suffixes":[]},{"propositions":[],"lastnames":["Ganguli"],"firstnames":["Deep"],"suffixes":[]},{"propositions":[],"lastnames":["Hernandez"],"firstnames":["Danny"],"suffixes":[]},{"propositions":[],"lastnames":["Jacobson"],"firstnames":["Josh"],"suffixes":[]},{"propositions":[],"lastnames":["Kernion"],"firstnames":["Jackson"],"suffixes":[]},{"propositions":[],"lastnames":["Kravec"],"firstnames":["Shauna"],"suffixes":[]},{"propositions":[],"lastnames":["Lovitt"],"firstnames":["Liane"],"suffixes":[]},{"propositions":[],"lastnames":["Ndousse"],"firstnames":["Kamal"],"suffixes":[]},{"propositions":[],"lastnames":["Olsson"],"firstnames":["Catherine"],"suffixes":[]},{"propositions":[],"lastnames":["Ringer"],"firstnames":["Sam"],"suffixes":[]},{"propositions":[],"lastnames":["Amodei"],"firstnames":["Dario"],"suffixes":[]},{"propositions":[],"lastnames":["Brown"],"firstnames":["Tom"],"suffixes":[]},{"propositions":[],"lastnames":["Clark"],"firstnames":["Jack"],"suffixes":[]},{"propositions":[],"lastnames":["Joseph"],"firstnames":["Nicholas"],"suffixes":[]},{"propositions":[],"lastnames":["Mann"],"firstnames":["Ben"],"suffixes":[]},{"propositions":[],"lastnames":["McCandlish"],"firstnames":["Sam"],"suffixes":[]},{"propositions":[],"lastnames":["Olah"],"firstnames":["Chris"],"suffixes":[]},{"propositions":[],"lastnames":["Kaplan"],"firstnames":["Jared"],"suffixes":[]}],"month":"November","year":"2022","note":"arXiv:2207.05221 [cs]","keywords":"Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning","bibtex":"@misc{kadavath_language_2022,\n\ttitle = {Language {Models} ({Mostly}) {Know} {What} {They} {Know}},\n\turl = {http://arxiv.org/abs/2207.05221},\n\tdoi = {10.48550/arXiv.2207.05221},\n\tabstract = {We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability \"P(True)\" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict \"P(IK)\", the probability that \"I know\" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.},\n\turldate = {2023-07-18},\n\tpublisher = {arXiv},\n\tauthor = {Kadavath, Saurav and Conerly, Tom and Askell, Amanda and Henighan, Tom and Drain, Dawn and Perez, Ethan and Schiefer, Nicholas and Hatfield-Dodds, Zac and DasSarma, Nova and Tran-Johnson, Eli and Johnston, Scott and El-Showk, Sheer and Jones, Andy and Elhage, Nelson and Hume, Tristan and Chen, Anna and Bai, Yuntao and Bowman, Sam and Fort, Stanislav and Ganguli, Deep and Hernandez, Danny and Jacobson, Josh and Kernion, Jackson and Kravec, Shauna and Lovitt, Liane and Ndousse, Kamal and Olsson, Catherine and Ringer, Sam and Amodei, Dario and Brown, Tom and Clark, Jack and Joseph, Nicholas and Mann, Ben and McCandlish, Sam and Olah, Chris and Kaplan, Jared},\n\tmonth = nov,\n\tyear = {2022},\n\tnote = {arXiv:2207.05221 [cs]},\n\tkeywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning},\n}\n\n\n\n\n\n\n\n","author_short":["Kadavath, S.","Conerly, T.","Askell, A.","Henighan, T.","Drain, D.","Perez, E.","Schiefer, N.","Hatfield-Dodds, Z.","DasSarma, N.","Tran-Johnson, E.","Johnston, S.","El-Showk, S.","Jones, A.","Elhage, N.","Hume, T.","Chen, A.","Bai, Y.","Bowman, S.","Fort, S.","Ganguli, D.","Hernandez, D.","Jacobson, J.","Kernion, J.","Kravec, S.","Lovitt, L.","Ndousse, K.","Olsson, C.","Ringer, S.","Amodei, D.","Brown, T.","Clark, J.","Joseph, N.","Mann, B.","McCandlish, S.","Olah, C.","Kaplan, J."],"key":"kadavath_language_2022-1-1","id":"kadavath_language_2022-1-1","bibbaseid":"kadavath-conerly-askell-henighan-drain-perez-schiefer-hatfielddodds-etal-languagemodelsmostlyknowwhattheyknow-2022","role":"author","urls":{"Paper":"http://arxiv.org/abs/2207.05221"},"keyword":["Computer Science - Artificial Intelligence","Computer Science - Computation and Language","Computer Science - Machine Learning"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"misc","biburl":"https://bibbase.org/zotero/saurabhr","dataSources":["nxjWwW7fWbb5tfpKz"],"keywords":["computer science - artificial intelligence","computer science - computation and language","computer science - machine learning"],"search_terms":["language","models","mostly","know","know","kadavath","conerly","askell","henighan","drain","perez","schiefer","hatfield-dodds","dassarma","tran-johnson","johnston","el-showk","jones","elhage","hume","chen","bai","bowman","fort","ganguli","hernandez","jacobson","kernion","kravec","lovitt","ndousse","olsson","ringer","amodei","brown","clark","joseph","mann","mccandlish","olah","kaplan"],"title":"Language Models (Mostly) Know What They Know","year":2022}