Instance-level Performance Prediction for Long-form Generation Tasks

Instance-level Performance Prediction for Long-form Generation Tasks. Hsu, C., Braylan, A., Su, Y., Alonso, O., & Lease, M. September, 2025. arXiv:2509.07309 [cs]

Paper doi abstract bibtex

We motivate and share a new benchmark1 for instance-level performance prediction of long-form generation tasks having multifaceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.

@misc{hsu_instance-level_2025,
	title = {Instance-level {Performance} {Prediction} for {Long}-form {Generation} {Tasks}},
	url = {http://arxiv.org/abs/2509.07309},
	doi = {10.48550/arXiv.2509.07309},
	abstract = {We motivate and share a new benchmark1 for instance-level performance prediction of long-form generation tasks having multifaceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.},
	language = {en},
	urldate = {2025-10-02},
	publisher = {arXiv},
	author = {Hsu, Chi-Yang and Braylan, Alexander and Su, Yiheng and Alonso, Omar and Lease, Matthew},
	month = sep,
	year = {2025},
	note = {arXiv:2509.07309 [cs]},
	keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning, Explorable},
}

Downloads: 0

{"_id":"pMi8X3pgpZnRecdZB","bibbaseid":"hsu-braylan-su-alonso-lease-instancelevelperformancepredictionforlongformgenerationtasks-2025","author_short":["Hsu, C.","Braylan, A.","Su, Y.","Alonso, O.","Lease, M."],"bibdata":{"bibtype":"misc","type":"misc","title":"Instance-level Performance Prediction for Long-form Generation Tasks","url":"http://arxiv.org/abs/2509.07309","doi":"10.48550/arXiv.2509.07309","abstract":"We motivate and share a new benchmark1 for instance-level performance prediction of long-form generation tasks having multifaceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.","language":"en","urldate":"2025-10-02","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Hsu"],"firstnames":["Chi-Yang"],"suffixes":[]},{"propositions":[],"lastnames":["Braylan"],"firstnames":["Alexander"],"suffixes":[]},{"propositions":[],"lastnames":["Su"],"firstnames":["Yiheng"],"suffixes":[]},{"propositions":[],"lastnames":["Alonso"],"firstnames":["Omar"],"suffixes":[]},{"propositions":[],"lastnames":["Lease"],"firstnames":["Matthew"],"suffixes":[]}],"month":"September","year":"2025","note":"arXiv:2509.07309 [cs]","keywords":"Computer Science - Computation and Language, Computer Science - Machine Learning, Explorable","bibtex":"@misc{hsu_instance-level_2025,\n\ttitle = {Instance-level {Performance} {Prediction} for {Long}-form {Generation} {Tasks}},\n\turl = {http://arxiv.org/abs/2509.07309},\n\tdoi = {10.48550/arXiv.2509.07309},\n\tabstract = {We motivate and share a new benchmark1 for instance-level performance prediction of long-form generation tasks having multifaceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.},\n\tlanguage = {en},\n\turldate = {2025-10-02},\n\tpublisher = {arXiv},\n\tauthor = {Hsu, Chi-Yang and Braylan, Alexander and Su, Yiheng and Alonso, Omar and Lease, Matthew},\n\tmonth = sep,\n\tyear = {2025},\n\tnote = {arXiv:2509.07309 [cs]},\n\tkeywords = {Computer Science - Computation and Language, Computer Science - Machine Learning, Explorable},\n}\n\n\n\n","author_short":["Hsu, C.","Braylan, A.","Su, Y.","Alonso, O.","Lease, M."],"key":"hsu_instance-level_2025","id":"hsu_instance-level_2025","bibbaseid":"hsu-braylan-su-alonso-lease-instancelevelperformancepredictionforlongformgenerationtasks-2025","role":"author","urls":{"Paper":"http://arxiv.org/abs/2509.07309"},"keyword":["Computer Science - Computation and Language","Computer Science - Machine Learning","Explorable"],"metadata":{"authorlinks":{}}},"bibtype":"misc","biburl":"https://bibbase.org/zotero-group/pratikmhatre/5933976","dataSources":["yJr5AAtJ5Sz3Q4WT4"],"keywords":["computer science - computation and language","computer science - machine learning","explorable"],"search_terms":["instance","level","performance","prediction","long","form","generation","tasks","hsu","braylan","su","alonso","lease"],"title":"Instance-level Performance Prediction for Long-form Generation Tasks","year":2025}