PIE: Performance Interval Estimation for Free-Form Generation Tasks. Hsu, C., Braylan, A., Su, Y., Lease, M., & Alonso, O. January, 2026. arXiv:2509.07309 [cs]
PIE: Performance Interval Estimation for Free-Form Generation Tasks [link]Paper  doi  abstract   bibtex   
Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.
@misc{hsu_pie_2026,
	title = {{PIE}: {Performance} {Interval} {Estimation} for {Free}-{Form} {Generation} {Tasks}},
	shorttitle = {{PIE}},
	url = {http://arxiv.org/abs/2509.07309},
	doi = {10.48550/arXiv.2509.07309},
	abstract = {Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.},
	language = {en},
	urldate = {2026-03-05},
	publisher = {arXiv},
	author = {Hsu, Chi-Yang and Braylan, Alexander and Su, Yiheng and Lease, Matthew and Alonso, Omar},
	month = jan,
	year = {2026},
	note = {arXiv:2509.07309 [cs]},
	keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning},
}

Downloads: 0