Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments. Atil, B., Aykent, S., Chittams, A., Fu, L., Passonneau, R. J., Radcliffe, E., Rajan Rajagopal, G., Sloan, A., Tudrej, T., Ture, F., Wu, Z., Xu, L., & Baldwin, B. In 5th Workshop on Evaluation and Comparison of NLP Systems, pages 135–148, Mumbai, India, 2025. ACL.
Paper abstract bibtex LLM (large language model) users of hosted providers commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. While it is difficult to get exact statistics, recent reports on specialty news sites and discussion boards suggest that among users in all communities, the majority of LLM usage today is through cloud-based APIs. Yet the questions of how pervasive non- determinism is, and how much it affects perfor- mance results, have not to our knowledge been systematically investigated. We apply five API- based LLMs configured to be deterministic to eight diverse tasks across 10 runs. Experiments reveal accuracy variations of up to 15% across runs, with a gap of up to 70% between best pos- sible performance and worst possible perfor- mance. No LLM consistently delivers the same outputs or accuracies, regardless of task. We speculate about the sources of non-determinism such as input buffer packing across multiple jobs. To better quantify our observations, we introduce metrics focused on quantifying de- terminism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data will be publicly available at https://github.com/Anonymous.
@inproceedings{atil_non-determinism_2025,
address = {Mumbai, India},
title = {Non-{Determinism} of “{Deterministic}” {LLM} {System} {Settings} in {Hosted} {Environments}},
url = {https://aclanthology.org/2025.eval4nlp-1.12/},
abstract = {LLM (large language model) users of hosted providers commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. While it is difficult to get exact statistics, recent reports on specialty news sites and discussion boards suggest that among users in all communities, the majority of LLM usage today is through cloud-based APIs. Yet the questions of how pervasive non- determinism is, and how much it affects perfor- mance results, have not to our knowledge been systematically investigated. We apply five API- based LLMs configured to be deterministic to eight diverse tasks across 10 runs. Experiments reveal accuracy variations of up to 15\% across runs, with a gap of up to 70\% between best pos- sible performance and worst possible perfor- mance. No LLM consistently delivers the same outputs or accuracies, regardless of task. We speculate about the sources of non-determinism such as input buffer packing across multiple jobs. To better quantify our observations, we introduce metrics focused on quantifying de- terminism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data will be publicly available at https://github.com/Anonymous.},
booktitle = {5th {Workshop} on {Evaluation} and {Comparison} of {NLP} {Systems}},
publisher = {ACL},
author = {Atil, Berk and Aykent, Sarp and Chittams, Alexa and Fu, Lisheng and Passonneau, Rebecca J. and Radcliffe, Evan and Rajan Rajagopal, Guru and Sloan, Adam and Tudrej, Tomasz and Ture, Ferhan and Wu, Zhe and Xu, Lixinyu and Baldwin, Breck},
year = {2025},
pages = {135--148},
}
Downloads: 0
{"_id":"4dskKdfFHoWAmKo5L","bibbaseid":"atil-aykent-chittams-fu-passonneau-radcliffe-rajanrajagopal-sloan-etal-nondeterminismofdeterministicllmsystemsettingsinhostedenvironments-2025","author_short":["Atil, B.","Aykent, S.","Chittams, A.","Fu, L.","Passonneau, R. J.","Radcliffe, E.","Rajan Rajagopal, G.","Sloan, A.","Tudrej, T.","Ture, F.","Wu, Z.","Xu, L.","Baldwin, B."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Mumbai, India","title":"Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments","url":"https://aclanthology.org/2025.eval4nlp-1.12/","abstract":"LLM (large language model) users of hosted providers commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. While it is difficult to get exact statistics, recent reports on specialty news sites and discussion boards suggest that among users in all communities, the majority of LLM usage today is through cloud-based APIs. Yet the questions of how pervasive non- determinism is, and how much it affects perfor- mance results, have not to our knowledge been systematically investigated. We apply five API- based LLMs configured to be deterministic to eight diverse tasks across 10 runs. Experiments reveal accuracy variations of up to 15% across runs, with a gap of up to 70% between best pos- sible performance and worst possible perfor- mance. No LLM consistently delivers the same outputs or accuracies, regardless of task. We speculate about the sources of non-determinism such as input buffer packing across multiple jobs. To better quantify our observations, we introduce metrics focused on quantifying de- terminism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data will be publicly available at https://github.com/Anonymous.","booktitle":"5th Workshop on Evaluation and Comparison of NLP Systems","publisher":"ACL","author":[{"propositions":[],"lastnames":["Atil"],"firstnames":["Berk"],"suffixes":[]},{"propositions":[],"lastnames":["Aykent"],"firstnames":["Sarp"],"suffixes":[]},{"propositions":[],"lastnames":["Chittams"],"firstnames":["Alexa"],"suffixes":[]},{"propositions":[],"lastnames":["Fu"],"firstnames":["Lisheng"],"suffixes":[]},{"propositions":[],"lastnames":["Passonneau"],"firstnames":["Rebecca","J."],"suffixes":[]},{"propositions":[],"lastnames":["Radcliffe"],"firstnames":["Evan"],"suffixes":[]},{"propositions":[],"lastnames":["Rajan","Rajagopal"],"firstnames":["Guru"],"suffixes":[]},{"propositions":[],"lastnames":["Sloan"],"firstnames":["Adam"],"suffixes":[]},{"propositions":[],"lastnames":["Tudrej"],"firstnames":["Tomasz"],"suffixes":[]},{"propositions":[],"lastnames":["Ture"],"firstnames":["Ferhan"],"suffixes":[]},{"propositions":[],"lastnames":["Wu"],"firstnames":["Zhe"],"suffixes":[]},{"propositions":[],"lastnames":["Xu"],"firstnames":["Lixinyu"],"suffixes":[]},{"propositions":[],"lastnames":["Baldwin"],"firstnames":["Breck"],"suffixes":[]}],"year":"2025","pages":"135–148","bibtex":"@inproceedings{atil_non-determinism_2025,\n\taddress = {Mumbai, India},\n\ttitle = {Non-{Determinism} of “{Deterministic}” {LLM} {System} {Settings} in {Hosted} {Environments}},\n\turl = {https://aclanthology.org/2025.eval4nlp-1.12/},\n\tabstract = {LLM (large language model) users of hosted providers commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. While it is difficult to get exact statistics, recent reports on specialty news sites and discussion boards suggest that among users in all communities, the majority of LLM usage today is through cloud-based APIs. Yet the questions of how pervasive non- determinism is, and how much it affects perfor- mance results, have not to our knowledge been systematically investigated. We apply five API- based LLMs configured to be deterministic to eight diverse tasks across 10 runs. Experiments reveal accuracy variations of up to 15\\% across runs, with a gap of up to 70\\% between best pos- sible performance and worst possible perfor- mance. No LLM consistently delivers the same outputs or accuracies, regardless of task. We speculate about the sources of non-determinism such as input buffer packing across multiple jobs. To better quantify our observations, we introduce metrics focused on quantifying de- terminism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data will be publicly available at https://github.com/Anonymous.},\n\tbooktitle = {5th {Workshop} on {Evaluation} and {Comparison} of {NLP} {Systems}},\n\tpublisher = {ACL},\n\tauthor = {Atil, Berk and Aykent, Sarp and Chittams, Alexa and Fu, Lisheng and Passonneau, Rebecca J. and Radcliffe, Evan and Rajan Rajagopal, Guru and Sloan, Adam and Tudrej, Tomasz and Ture, Ferhan and Wu, Zhe and Xu, Lixinyu and Baldwin, Breck},\n\tyear = {2025},\n\tpages = {135--148},\n}\n\n","author_short":["Atil, B.","Aykent, S.","Chittams, A.","Fu, L.","Passonneau, R. J.","Radcliffe, E.","Rajan Rajagopal, G.","Sloan, A.","Tudrej, T.","Ture, F.","Wu, Z.","Xu, L.","Baldwin, B."],"key":"atil_non-determinism_2025","id":"atil_non-determinism_2025","bibbaseid":"atil-aykent-chittams-fu-passonneau-radcliffe-rajanrajagopal-sloan-etal-nondeterminismofdeterministicllmsystemsettingsinhostedenvironments-2025","role":"author","urls":{"Paper":"https://aclanthology.org/2025.eval4nlp-1.12/"},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://api.zotero.org/users/5414994/collections/86KRVVHK/items?key=GDFnZ8EypOk31LcM2TF07ZPq&format=bibtex&limit=100","dataSources":["ehTzfgRAmsSiZYtNA"],"keywords":[],"search_terms":["non","determinism","deterministic","llm","system","settings","hosted","environments","atil","aykent","chittams","fu","passonneau","radcliffe","rajan rajagopal","sloan","tudrej","ture","wu","xu","baldwin"],"title":"Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments","year":2025}