Instance-level Performance Prediction for Long-form Generation Tasks. Hsu, C., Braylan, A., Su, Y., Alonso, O., & Lease, M. September, 2025. arXiv:2509.07309 [cs]
Instance-level Performance Prediction for Long-form Generation Tasks [link]Paper  doi  abstract   bibtex   
We motivate and share a new benchmark1 for instance-level performance prediction of long-form generation tasks having multifaceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.
@misc{hsu_instance-level_2025,
	title = {Instance-level {Performance} {Prediction} for {Long}-form {Generation} {Tasks}},
	url = {http://arxiv.org/abs/2509.07309},
	doi = {10.48550/arXiv.2509.07309},
	abstract = {We motivate and share a new benchmark1 for instance-level performance prediction of long-form generation tasks having multifaceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.},
	language = {en},
	urldate = {2025-10-02},
	publisher = {arXiv},
	author = {Hsu, Chi-Yang and Braylan, Alexander and Su, Yiheng and Alonso, Omar and Lease, Matthew},
	month = sep,
	year = {2025},
	note = {arXiv:2509.07309 [cs]},
	keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning},
}

Downloads: 0