PIE: Performance Interval Estimation for Free-Form Generation Tasks

PIE: Performance Interval Estimation for Free-Form Generation Tasks. Hsu, C., Braylan, A., Su, Y., Lease, M., & Alonso, O. January, 2026. arXiv:2509.07309 [cs]

Paper doi abstract bibtex

Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.

@misc{hsu_pie_2026,
	title = {{PIE}: {Performance} {Interval} {Estimation} for {Free}-{Form} {Generation} {Tasks}},
	shorttitle = {{PIE}},
	url = {http://arxiv.org/abs/2509.07309},
	doi = {10.48550/arXiv.2509.07309},
	abstract = {Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.},
	language = {en},
	urldate = {2026-03-05},
	publisher = {arXiv},
	author = {Hsu, Chi-Yang and Braylan, Alexander and Su, Yiheng and Lease, Matthew and Alonso, Omar},
	month = jan,
	year = {2026},
	note = {arXiv:2509.07309 [cs]},
	keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning},
}

Downloads: 0

{"_id":"6FgW3r2nWPSfkHjqT","bibbaseid":"hsu-braylan-su-lease-alonso-pieperformanceintervalestimationforfreeformgenerationtasks-2026","author_short":["Hsu, C.","Braylan, A.","Su, Y.","Lease, M.","Alonso, O."],"bibdata":{"bibtype":"misc","type":"misc","title":"PIE: Performance Interval Estimation for Free-Form Generation Tasks","shorttitle":"PIE","url":"http://arxiv.org/abs/2509.07309","doi":"10.48550/arXiv.2509.07309","abstract":"Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.","language":"en","urldate":"2026-03-05","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Hsu"],"firstnames":["Chi-Yang"],"suffixes":[]},{"propositions":[],"lastnames":["Braylan"],"firstnames":["Alexander"],"suffixes":[]},{"propositions":[],"lastnames":["Su"],"firstnames":["Yiheng"],"suffixes":[]},{"propositions":[],"lastnames":["Lease"],"firstnames":["Matthew"],"suffixes":[]},{"propositions":[],"lastnames":["Alonso"],"firstnames":["Omar"],"suffixes":[]}],"month":"January","year":"2026","note":"arXiv:2509.07309 [cs]","keywords":"Computer Science - Computation and Language, Computer Science - Machine Learning","bibtex":"@misc{hsu_pie_2026,\n\ttitle = {{PIE}: {Performance} {Interval} {Estimation} for {Free}-{Form} {Generation} {Tasks}},\n\tshorttitle = {{PIE}},\n\turl = {http://arxiv.org/abs/2509.07309},\n\tdoi = {10.48550/arXiv.2509.07309},\n\tabstract = {Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.},\n\tlanguage = {en},\n\turldate = {2026-03-05},\n\tpublisher = {arXiv},\n\tauthor = {Hsu, Chi-Yang and Braylan, Alexander and Su, Yiheng and Lease, Matthew and Alonso, Omar},\n\tmonth = jan,\n\tyear = {2026},\n\tnote = {arXiv:2509.07309 [cs]},\n\tkeywords = {Computer Science - Computation and Language, Computer Science - Machine Learning},\n}\n\n\n\n\n\n\n\n","author_short":["Hsu, C.","Braylan, A.","Su, Y.","Lease, M.","Alonso, O."],"key":"hsu_pie_2026","id":"hsu_pie_2026","bibbaseid":"hsu-braylan-su-lease-alonso-pieperformanceintervalestimationforfreeformgenerationtasks-2026","role":"author","urls":{"Paper":"http://arxiv.org/abs/2509.07309"},"keyword":["Computer Science - Computation and Language","Computer Science - Machine Learning"],"metadata":{"authorlinks":{}}},"bibtype":"misc","biburl":"https://bibbase.org/zotero-group/pratikmhatre/5933976","dataSources":["yJr5AAtJ5Sz3Q4WT4"],"keywords":["computer science - computation and language","computer science - machine learning"],"search_terms":["pie","performance","interval","estimation","free","form","generation","tasks","hsu","braylan","su","lease","alonso"],"title":"PIE: Performance Interval Estimation for Free-Form Generation Tasks","year":2026}