{"_id":"pMi8X3pgpZnRecdZB","bibbaseid":"hsu-braylan-su-alonso-lease-instancelevelperformancepredictionforlongformgenerationtasks-2025","author_short":["Hsu, C.","Braylan, A.","Su, Y.","Alonso, O.","Lease, M."],"bibdata":{"bibtype":"misc","type":"misc","title":"Instance-level Performance Prediction for Long-form Generation Tasks","url":"http://arxiv.org/abs/2509.07309","doi":"10.48550/arXiv.2509.07309","abstract":"We motivate and share a new benchmark1 for instance-level performance prediction of long-form generation tasks having multifaceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.","language":"en","urldate":"2025-10-02","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Hsu"],"firstnames":["Chi-Yang"],"suffixes":[]},{"propositions":[],"lastnames":["Braylan"],"firstnames":["Alexander"],"suffixes":[]},{"propositions":[],"lastnames":["Su"],"firstnames":["Yiheng"],"suffixes":[]},{"propositions":[],"lastnames":["Alonso"],"firstnames":["Omar"],"suffixes":[]},{"propositions":[],"lastnames":["Lease"],"firstnames":["Matthew"],"suffixes":[]}],"month":"September","year":"2025","note":"arXiv:2509.07309 [cs]","keywords":"Computer Science - Computation and Language, Computer Science - Machine Learning","bibtex":"@misc{hsu_instance-level_2025,\n\ttitle = {Instance-level {Performance} {Prediction} for {Long}-form {Generation} {Tasks}},\n\turl = {http://arxiv.org/abs/2509.07309},\n\tdoi = {10.48550/arXiv.2509.07309},\n\tabstract = {We motivate and share a new benchmark1 for instance-level performance prediction of long-form generation tasks having multifaceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.},\n\tlanguage = {en},\n\turldate = {2025-10-02},\n\tpublisher = {arXiv},\n\tauthor = {Hsu, Chi-Yang and Braylan, Alexander and Su, Yiheng and Alonso, Omar and Lease, Matthew},\n\tmonth = sep,\n\tyear = {2025},\n\tnote = {arXiv:2509.07309 [cs]},\n\tkeywords = {Computer Science - Computation and Language, Computer Science - Machine Learning},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","author_short":["Hsu, C.","Braylan, A.","Su, Y.","Alonso, O.","Lease, M."],"key":"hsu_instance-level_2025","id":"hsu_instance-level_2025","bibbaseid":"hsu-braylan-su-alonso-lease-instancelevelperformancepredictionforlongformgenerationtasks-2025","role":"author","urls":{"Paper":"http://arxiv.org/abs/2509.07309"},"keyword":["Computer Science - Computation and Language","Computer Science - Machine Learning"],"metadata":{"authorlinks":{}},"downloads":0},"bibtype":"misc","biburl":"https://bibbase.org/zotero-group/pratikmhatre/5933976","dataSources":["yJr5AAtJ5Sz3Q4WT4"],"keywords":["computer science - computation and language","computer science - machine learning"],"search_terms":["instance","level","performance","prediction","long","form","generation","tasks","hsu","braylan","su","alonso","lease"],"title":"Instance-level Performance Prediction for Long-form Generation Tasks","year":2025}