Degree-Based Scheduling and Memory Management for Large-Scale Exact Online GNN Inference

Degree-Based Scheduling and Memory Management for Large-Scale Exact Online GNN Inference. Namazi, A., Shen, H., Sen, T., & Zhang, M. In 2025 IEEE International Conference on Big Data (BigData), pages 1–10, Macau, China, December, 2025. IEEE.

Paper doi abstract bibtex

Graph Neural Networks (GNNs) have demonstrated remarkable performance on tasks operating on graph-structured data. Some applications require exact online inference, which processes all neighbors rather than sampling a random subset to avoid stochasticity and accuracy degradation, while still meeting tight latency constraints. In this paper, we analyze exact online inference and identify two key challenges: headof-line blocking caused by heterogeneous request processing times, and highly variable memory usage that complicates cache size optimization for balancing computation and CPU–GPU communication. To reduce the average request latency, we propose GSM, an exact online GNN inference serving system with efficient scheduling and memory management. GSM integrates three main components. The parameter tuner optimizes the tradeoff between computation and CPU–GPU communication overhead by finding near-optimal parameters, such as batch in-degree sum and batch subgraph size, for the target workload. The degree-based scheduler groups requests with similar in-degrees into batches, ensuring each batch’s in-degree sum approaches the optimal value and prioritizing batches with shorter processing times. The batch divider partitions a batch subgraph to achieve the optimal subgraph size and improve processing efficiency. Largescale experiments on graphs with billions of edges (Papers100M and Friendster) demonstrate that GSM achieves up to 89% lower latency and 71% higher throughput compared with state-of-theart systems. The implementation of GSM is publicly available.

@inproceedings{namazi_degree-based_2025,
	address = {Macau, China},
	title = {Degree-{Based} {Scheduling} and {Memory} {Management} for {Large}-{Scale} {Exact} {Online} {GNN} {Inference}},
	copyright = {https://doi.org/10.15223/policy-029},
	isbn = {979-8-3315-9447-3},
	url = {https://ieeexplore.ieee.org/document/11402115/},
	doi = {10.1109/BigData66926.2025.11402115},
	abstract = {Graph Neural Networks (GNNs) have demonstrated remarkable performance on tasks operating on graph-structured data. Some applications require exact online inference, which processes all neighbors rather than sampling a random subset to avoid stochasticity and accuracy degradation, while still meeting tight latency constraints. In this paper, we analyze exact online inference and identify two key challenges: headof-line blocking caused by heterogeneous request processing times, and highly variable memory usage that complicates cache size optimization for balancing computation and CPU–GPU communication. To reduce the average request latency, we propose GSM, an exact online GNN inference serving system with efficient scheduling and memory management. GSM integrates three main components. The parameter tuner optimizes the tradeoff between computation and CPU–GPU communication overhead by finding near-optimal parameters, such as batch in-degree sum and batch subgraph size, for the target workload. The degree-based scheduler groups requests with similar in-degrees into batches, ensuring each batch’s in-degree sum approaches the optimal value and prioritizing batches with shorter processing times. The batch divider partitions a batch subgraph to achieve the optimal subgraph size and improve processing efficiency. Largescale experiments on graphs with billions of edges (Papers100M and Friendster) demonstrate that GSM achieves up to 89\% lower latency and 71\% higher throughput compared with state-of-theart systems. The implementation of GSM is publicly available.},
	language = {en},
	urldate = {2026-03-19},
	booktitle = {2025 {IEEE} {International} {Conference} on {Big} {Data} ({BigData})},
	publisher = {IEEE},
	author = {Namazi, Alireza and Shen, Haiying and Sen, Tanmoy and Zhang, Minjia},
	month = dec,
	year = {2025},
	keywords = {Foundational, SYS: CosmicAI Contact Author, WG: Explorable},
	pages = {1--10},
}

Downloads: 0

{"_id":"kMu7DCmc4pC6Brs4b","bibbaseid":"namazi-shen-sen-zhang-degreebasedschedulingandmemorymanagementforlargescaleexactonlinegnninference-2025","author_short":["Namazi, A.","Shen, H.","Sen, T.","Zhang, M."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Macau, China","title":"Degree-Based Scheduling and Memory Management for Large-Scale Exact Online GNN Inference","copyright":"https://doi.org/10.15223/policy-029","isbn":"979-8-3315-9447-3","url":"https://ieeexplore.ieee.org/document/11402115/","doi":"10.1109/BigData66926.2025.11402115","abstract":"Graph Neural Networks (GNNs) have demonstrated remarkable performance on tasks operating on graph-structured data. Some applications require exact online inference, which processes all neighbors rather than sampling a random subset to avoid stochasticity and accuracy degradation, while still meeting tight latency constraints. In this paper, we analyze exact online inference and identify two key challenges: headof-line blocking caused by heterogeneous request processing times, and highly variable memory usage that complicates cache size optimization for balancing computation and CPU–GPU communication. To reduce the average request latency, we propose GSM, an exact online GNN inference serving system with efficient scheduling and memory management. GSM integrates three main components. The parameter tuner optimizes the tradeoff between computation and CPU–GPU communication overhead by finding near-optimal parameters, such as batch in-degree sum and batch subgraph size, for the target workload. The degree-based scheduler groups requests with similar in-degrees into batches, ensuring each batch’s in-degree sum approaches the optimal value and prioritizing batches with shorter processing times. The batch divider partitions a batch subgraph to achieve the optimal subgraph size and improve processing efficiency. Largescale experiments on graphs with billions of edges (Papers100M and Friendster) demonstrate that GSM achieves up to 89% lower latency and 71% higher throughput compared with state-of-theart systems. The implementation of GSM is publicly available.","language":"en","urldate":"2026-03-19","booktitle":"2025 IEEE International Conference on Big Data (BigData)","publisher":"IEEE","author":[{"propositions":[],"lastnames":["Namazi"],"firstnames":["Alireza"],"suffixes":[]},{"propositions":[],"lastnames":["Shen"],"firstnames":["Haiying"],"suffixes":[]},{"propositions":[],"lastnames":["Sen"],"firstnames":["Tanmoy"],"suffixes":[]},{"propositions":[],"lastnames":["Zhang"],"firstnames":["Minjia"],"suffixes":[]}],"month":"December","year":"2025","keywords":"Foundational, SYS: CosmicAI Contact Author, WG: Explorable","pages":"1–10","bibtex":"@inproceedings{namazi_degree-based_2025,\n\taddress = {Macau, China},\n\ttitle = {Degree-{Based} {Scheduling} and {Memory} {Management} for {Large}-{Scale} {Exact} {Online} {GNN} {Inference}},\n\tcopyright = {https://doi.org/10.15223/policy-029},\n\tisbn = {979-8-3315-9447-3},\n\turl = {https://ieeexplore.ieee.org/document/11402115/},\n\tdoi = {10.1109/BigData66926.2025.11402115},\n\tabstract = {Graph Neural Networks (GNNs) have demonstrated remarkable performance on tasks operating on graph-structured data. Some applications require exact online inference, which processes all neighbors rather than sampling a random subset to avoid stochasticity and accuracy degradation, while still meeting tight latency constraints. In this paper, we analyze exact online inference and identify two key challenges: headof-line blocking caused by heterogeneous request processing times, and highly variable memory usage that complicates cache size optimization for balancing computation and CPU–GPU communication. To reduce the average request latency, we propose GSM, an exact online GNN inference serving system with efficient scheduling and memory management. GSM integrates three main components. The parameter tuner optimizes the tradeoff between computation and CPU–GPU communication overhead by finding near-optimal parameters, such as batch in-degree sum and batch subgraph size, for the target workload. The degree-based scheduler groups requests with similar in-degrees into batches, ensuring each batch’s in-degree sum approaches the optimal value and prioritizing batches with shorter processing times. The batch divider partitions a batch subgraph to achieve the optimal subgraph size and improve processing efficiency. Largescale experiments on graphs with billions of edges (Papers100M and Friendster) demonstrate that GSM achieves up to 89\\% lower latency and 71\\% higher throughput compared with state-of-theart systems. The implementation of GSM is publicly available.},\n\tlanguage = {en},\n\turldate = {2026-03-19},\n\tbooktitle = {2025 {IEEE} {International} {Conference} on {Big} {Data} ({BigData})},\n\tpublisher = {IEEE},\n\tauthor = {Namazi, Alireza and Shen, Haiying and Sen, Tanmoy and Zhang, Minjia},\n\tmonth = dec,\n\tyear = {2025},\n\tkeywords = {Foundational, SYS: CosmicAI Contact Author, WG: Explorable},\n\tpages = {1--10},\n}\n\n\n\n","author_short":["Namazi, A.","Shen, H.","Sen, T.","Zhang, M."],"key":"namazi_degree-based_2025","id":"namazi_degree-based_2025","bibbaseid":"namazi-shen-sen-zhang-degreebasedschedulingandmemorymanagementforlargescaleexactonlinegnninference-2025","role":"author","urls":{"Paper":"https://ieeexplore.ieee.org/document/11402115/"},"keyword":["Foundational","SYS: CosmicAI Contact Author","WG: Explorable"],"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero-group/pratikmhatre/5933976","dataSources":["yJr5AAtJ5Sz3Q4WT4"],"keywords":["foundational","sys: cosmicai contact author","wg: explorable"],"search_terms":["degree","based","scheduling","memory","management","large","scale","exact","online","gnn","inference","namazi","shen","sen","zhang"],"title":"Degree-Based Scheduling and Memory Management for Large-Scale Exact Online GNN Inference","year":2025}