AdaGen: Workload-Adaptive Cluster Scheduler for Latency-Optimal LLM Inference Serving

AdaGen: Workload-Adaptive Cluster Scheduler for Latency-Optimal LLM Inference Serving. Shubha, S. S., Goel, A., Tootaghaj, D. Z., Diab, K., Soni, H., Ramakrishnan, K. K., Sharma, P., & Shen, H. In The Proceedings of the 21st European Conference on Computer Systems, pages 1111–1127, McEwan Hall/The University of Edinburgh Edinburgh Scotland UK, April, 2026. ACM.

Paper doi abstract bibtex

The inference workloads of Large Language Models (LLMs) pose significant latency and cost challenges due to increasing model sizes and demand for real-time responses. Existing cluster schedulers for multi-instance LLM serving primarily focus on load balancing to optimize memory usage, which is insufficient for workloads with diverse request characteristics. In such cases, the compute layout—the arrangement of tokens across iterations within each instance—plays a crucial role in determining latency. We propose AdaGen, a workload-adaptive cluster scheduler that minimizes latency and thus maximizes SLO attainment by optimizing compute layouts across instances. AdaGen employs a multistep scheduling strategy: it first classifies requests based on prefill and decode lengths, then balances load, and finally performs selective distributed execution across instances. Each step incrementally refines the scheduling based on the compute layouts derived from the decision of the previous step. To avoid the overhead of actual execution to generate the layouts, AdaGen introduces a novel simulation-based estimator. Extensive experiments using production workloads show that AdaGen achieves up to 3.6× higher SLO attainment and 2× better cost-efficiency compared to the existing systems, while ensuring scalability.

@inproceedings{shubha_adagen_2026,
	address = {McEwan Hall/The University of Edinburgh Edinburgh Scotland UK},
	title = {{AdaGen}: {Workload}-{Adaptive} {Cluster} {Scheduler} for {Latency}-{Optimal} {LLM} {Inference} {Serving}},
	isbn = {979-8-4007-2212-7},
	shorttitle = {{AdaGen}},
	url = {https://dl.acm.org/doi/10.1145/3767295.3769345},
	doi = {10.1145/3767295.3769345},
	abstract = {The inference workloads of Large Language Models (LLMs) pose significant latency and cost challenges due to increasing model sizes and demand for real-time responses. Existing cluster schedulers for multi-instance LLM serving primarily focus on load balancing to optimize memory usage, which is insufficient for workloads with diverse request characteristics. In such cases, the compute layout—the arrangement of tokens across iterations within each instance—plays a crucial role in determining latency. We propose AdaGen, a workload-adaptive cluster scheduler that minimizes latency and thus maximizes SLO attainment by optimizing compute layouts across instances. AdaGen employs a multistep scheduling strategy: it first classifies requests based on prefill and decode lengths, then balances load, and finally performs selective distributed execution across instances. Each step incrementally refines the scheduling based on the compute layouts derived from the decision of the previous step. To avoid the overhead of actual execution to generate the layouts, AdaGen introduces a novel simulation-based estimator. Extensive experiments using production workloads show that AdaGen achieves up to 3.6× higher SLO attainment and 2× better cost-efficiency compared to the existing systems, while ensuring scalability.},
	language = {en},
	urldate = {2026-05-12},
	booktitle = {The {Proceedings} of the 21st {European} {Conference} on {Computer} {Systems}},
	publisher = {ACM},
	author = {Shubha, Sudipta Saha and Goel, Ayush and Tootaghaj, Diman Zad and Diab, Khaled and Soni, Hardik and Ramakrishnan, K. K. and Sharma, Puneet and Shen, Haiying},
	month = apr,
	year = {2026},
	keywords = {Foundational, SYS: CosmicAI Contact Author, WG: Explorable},
	pages = {1111--1127},
}

Downloads: 0

{"_id":"z2uf48baMdci2hQFv","bibbaseid":"shubha-goel-tootaghaj-diab-soni-ramakrishnan-sharma-shen-adagenworkloadadaptiveclusterschedulerforlatencyoptimalllminferenceserving-2026","author_short":["Shubha, S. S.","Goel, A.","Tootaghaj, D. Z.","Diab, K.","Soni, H.","Ramakrishnan, K. K.","Sharma, P.","Shen, H."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"McEwan Hall/The University of Edinburgh Edinburgh Scotland UK","title":"AdaGen: Workload-Adaptive Cluster Scheduler for Latency-Optimal LLM Inference Serving","isbn":"979-8-4007-2212-7","shorttitle":"AdaGen","url":"https://dl.acm.org/doi/10.1145/3767295.3769345","doi":"10.1145/3767295.3769345","abstract":"The inference workloads of Large Language Models (LLMs) pose significant latency and cost challenges due to increasing model sizes and demand for real-time responses. Existing cluster schedulers for multi-instance LLM serving primarily focus on load balancing to optimize memory usage, which is insufficient for workloads with diverse request characteristics. In such cases, the compute layout—the arrangement of tokens across iterations within each instance—plays a crucial role in determining latency. We propose AdaGen, a workload-adaptive cluster scheduler that minimizes latency and thus maximizes SLO attainment by optimizing compute layouts across instances. AdaGen employs a multistep scheduling strategy: it first classifies requests based on prefill and decode lengths, then balances load, and finally performs selective distributed execution across instances. Each step incrementally refines the scheduling based on the compute layouts derived from the decision of the previous step. To avoid the overhead of actual execution to generate the layouts, AdaGen introduces a novel simulation-based estimator. Extensive experiments using production workloads show that AdaGen achieves up to 3.6× higher SLO attainment and 2× better cost-efficiency compared to the existing systems, while ensuring scalability.","language":"en","urldate":"2026-05-12","booktitle":"The Proceedings of the 21st European Conference on Computer Systems","publisher":"ACM","author":[{"propositions":[],"lastnames":["Shubha"],"firstnames":["Sudipta","Saha"],"suffixes":[]},{"propositions":[],"lastnames":["Goel"],"firstnames":["Ayush"],"suffixes":[]},{"propositions":[],"lastnames":["Tootaghaj"],"firstnames":["Diman","Zad"],"suffixes":[]},{"propositions":[],"lastnames":["Diab"],"firstnames":["Khaled"],"suffixes":[]},{"propositions":[],"lastnames":["Soni"],"firstnames":["Hardik"],"suffixes":[]},{"propositions":[],"lastnames":["Ramakrishnan"],"firstnames":["K.","K."],"suffixes":[]},{"propositions":[],"lastnames":["Sharma"],"firstnames":["Puneet"],"suffixes":[]},{"propositions":[],"lastnames":["Shen"],"firstnames":["Haiying"],"suffixes":[]}],"month":"April","year":"2026","keywords":"Foundational, SYS: CosmicAI Contact Author, WG: Explorable","pages":"1111–1127","bibtex":"@inproceedings{shubha_adagen_2026,\n\taddress = {McEwan Hall/The University of Edinburgh Edinburgh Scotland UK},\n\ttitle = {{AdaGen}: {Workload}-{Adaptive} {Cluster} {Scheduler} for {Latency}-{Optimal} {LLM} {Inference} {Serving}},\n\tisbn = {979-8-4007-2212-7},\n\tshorttitle = {{AdaGen}},\n\turl = {https://dl.acm.org/doi/10.1145/3767295.3769345},\n\tdoi = {10.1145/3767295.3769345},\n\tabstract = {The inference workloads of Large Language Models (LLMs) pose significant latency and cost challenges due to increasing model sizes and demand for real-time responses. Existing cluster schedulers for multi-instance LLM serving primarily focus on load balancing to optimize memory usage, which is insufficient for workloads with diverse request characteristics. In such cases, the compute layout—the arrangement of tokens across iterations within each instance—plays a crucial role in determining latency. We propose AdaGen, a workload-adaptive cluster scheduler that minimizes latency and thus maximizes SLO attainment by optimizing compute layouts across instances. AdaGen employs a multistep scheduling strategy: it first classifies requests based on prefill and decode lengths, then balances load, and finally performs selective distributed execution across instances. Each step incrementally refines the scheduling based on the compute layouts derived from the decision of the previous step. To avoid the overhead of actual execution to generate the layouts, AdaGen introduces a novel simulation-based estimator. Extensive experiments using production workloads show that AdaGen achieves up to 3.6× higher SLO attainment and 2× better cost-efficiency compared to the existing systems, while ensuring scalability.},\n\tlanguage = {en},\n\turldate = {2026-05-12},\n\tbooktitle = {The {Proceedings} of the 21st {European} {Conference} on {Computer} {Systems}},\n\tpublisher = {ACM},\n\tauthor = {Shubha, Sudipta Saha and Goel, Ayush and Tootaghaj, Diman Zad and Diab, Khaled and Soni, Hardik and Ramakrishnan, K. K. and Sharma, Puneet and Shen, Haiying},\n\tmonth = apr,\n\tyear = {2026},\n\tkeywords = {Foundational, SYS: CosmicAI Contact Author, WG: Explorable},\n\tpages = {1111--1127},\n}\n\n\n\n","author_short":["Shubha, S. S.","Goel, A.","Tootaghaj, D. Z.","Diab, K.","Soni, H.","Ramakrishnan, K. K.","Sharma, P.","Shen, H."],"key":"shubha_adagen_2026","id":"shubha_adagen_2026","bibbaseid":"shubha-goel-tootaghaj-diab-soni-ramakrishnan-sharma-shen-adagenworkloadadaptiveclusterschedulerforlatencyoptimalllminferenceserving-2026","role":"author","urls":{"Paper":"https://dl.acm.org/doi/10.1145/3767295.3769345"},"keyword":["Foundational","SYS: CosmicAI Contact Author","WG: Explorable"],"metadata":{"authorlinks":{}},"downloads":0},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero-group/pratikmhatre/5933976","dataSources":["yJr5AAtJ5Sz3Q4WT4"],"keywords":["foundational","sys: cosmicai contact author","wg: explorable"],"search_terms":["adagen","workload","adaptive","cluster","scheduler","latency","optimal","llm","inference","serving","shubha","goel","tootaghaj","diab","soni","ramakrishnan","sharma","shen"],"title":"AdaGen: Workload-Adaptive Cluster Scheduler for Latency-Optimal LLM Inference Serving","year":2026}