ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache. Zhang, Z. & Shen, H. February, 2025. arXiv:2408.04107 [cs]
Paper doi abstract bibtex In large-language models, memory constraints in the Key-Value Cache (KVC) pose a challenge during inference. In this work, we propose ZACK, the first KV dimensionality compression system that achieves zero-overhead compression and decompression and also reduces attention computation time. It complements and can be combined with evictionbased and quantization-based methods to further enhance KV compression. Moreover, ZACK employs adaptive compression, tailoring KV compression rates across heads and layers based on their contributions to inference to maximize overall compression while maintaining an accuracy loss constraint. Additionally, ZACK enhances the self-attention kernel to balance the uneven workloads caused by the adaptive compression approach to further reduce attention computation latency. Comprehensive experiments demonstrate that when combined with ZACK, state-of-the-art eviction-based and quantizationbased methods for KV compression further reduce KV size by up to 68%, Time-To-First-Token (TTFT) by up to 44%, and Time-Between-Tokens (TBT) by up to 55% and achieve up to 1.72× throughput under the same latency, while maintaining 99% of the baseline accuracy. We open-sourced the code.
@misc{zhang_zack_2025,
title = {{ZACK}: {Zero}-{Overhead} {LLM} {Inference} {Acceleration} via {Dimensionality} {Compression} of the {Key}-{Value} {Cache}},
shorttitle = {{ZACK}},
url = {http://arxiv.org/abs/2408.04107},
doi = {10.48550/arXiv.2408.04107},
abstract = {In large-language models, memory constraints in the Key-Value Cache (KVC) pose a challenge during inference. In this work, we propose ZACK, the first KV dimensionality compression system that achieves zero-overhead compression and decompression and also reduces attention computation time. It complements and can be combined with evictionbased and quantization-based methods to further enhance KV compression. Moreover, ZACK employs adaptive compression, tailoring KV compression rates across heads and layers based on their contributions to inference to maximize overall compression while maintaining an accuracy loss constraint. Additionally, ZACK enhances the self-attention kernel to balance the uneven workloads caused by the adaptive compression approach to further reduce attention computation latency. Comprehensive experiments demonstrate that when combined with ZACK, state-of-the-art eviction-based and quantizationbased methods for KV compression further reduce KV size by up to 68\%, Time-To-First-Token (TTFT) by up to 44\%, and Time-Between-Tokens (TBT) by up to 55\% and achieve up to 1.72× throughput under the same latency, while maintaining 99\% of the baseline accuracy. We open-sourced the code.},
language = {en},
urldate = {2025-05-28},
publisher = {arXiv},
author = {Zhang, Zeyu and Shen, Haiying},
month = feb,
year = {2025},
note = {arXiv:2408.04107 [cs]},
keywords = {Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning},
}
Downloads: 0
{"_id":"EnfinDjtKq7346ujK","bibbaseid":"zhang-shen-zackzerooverheadllminferenceaccelerationviadimensionalitycompressionofthekeyvaluecache-2025","author_short":["Zhang, Z.","Shen, H."],"bibdata":{"bibtype":"misc","type":"misc","title":"ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache","shorttitle":"ZACK","url":"http://arxiv.org/abs/2408.04107","doi":"10.48550/arXiv.2408.04107","abstract":"In large-language models, memory constraints in the Key-Value Cache (KVC) pose a challenge during inference. In this work, we propose ZACK, the first KV dimensionality compression system that achieves zero-overhead compression and decompression and also reduces attention computation time. It complements and can be combined with evictionbased and quantization-based methods to further enhance KV compression. Moreover, ZACK employs adaptive compression, tailoring KV compression rates across heads and layers based on their contributions to inference to maximize overall compression while maintaining an accuracy loss constraint. Additionally, ZACK enhances the self-attention kernel to balance the uneven workloads caused by the adaptive compression approach to further reduce attention computation latency. Comprehensive experiments demonstrate that when combined with ZACK, state-of-the-art eviction-based and quantizationbased methods for KV compression further reduce KV size by up to 68%, Time-To-First-Token (TTFT) by up to 44%, and Time-Between-Tokens (TBT) by up to 55% and achieve up to 1.72× throughput under the same latency, while maintaining 99% of the baseline accuracy. We open-sourced the code.","language":"en","urldate":"2025-05-28","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Zhang"],"firstnames":["Zeyu"],"suffixes":[]},{"propositions":[],"lastnames":["Shen"],"firstnames":["Haiying"],"suffixes":[]}],"month":"February","year":"2025","note":"arXiv:2408.04107 [cs]","keywords":"Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning","bibtex":"@misc{zhang_zack_2025,\n\ttitle = {{ZACK}: {Zero}-{Overhead} {LLM} {Inference} {Acceleration} via {Dimensionality} {Compression} of the {Key}-{Value} {Cache}},\n\tshorttitle = {{ZACK}},\n\turl = {http://arxiv.org/abs/2408.04107},\n\tdoi = {10.48550/arXiv.2408.04107},\n\tabstract = {In large-language models, memory constraints in the Key-Value Cache (KVC) pose a challenge during inference. In this work, we propose ZACK, the first KV dimensionality compression system that achieves zero-overhead compression and decompression and also reduces attention computation time. It complements and can be combined with evictionbased and quantization-based methods to further enhance KV compression. Moreover, ZACK employs adaptive compression, tailoring KV compression rates across heads and layers based on their contributions to inference to maximize overall compression while maintaining an accuracy loss constraint. Additionally, ZACK enhances the self-attention kernel to balance the uneven workloads caused by the adaptive compression approach to further reduce attention computation latency. Comprehensive experiments demonstrate that when combined with ZACK, state-of-the-art eviction-based and quantizationbased methods for KV compression further reduce KV size by up to 68\\%, Time-To-First-Token (TTFT) by up to 44\\%, and Time-Between-Tokens (TBT) by up to 55\\% and achieve up to 1.72× throughput under the same latency, while maintaining 99\\% of the baseline accuracy. We open-sourced the code.},\n\tlanguage = {en},\n\turldate = {2025-05-28},\n\tpublisher = {arXiv},\n\tauthor = {Zhang, Zeyu and Shen, Haiying},\n\tmonth = feb,\n\tyear = {2025},\n\tnote = {arXiv:2408.04107 [cs]},\n\tkeywords = {Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning},\n}\n\n\n\n\n\n\n\n\n\n\n\n","author_short":["Zhang, Z.","Shen, H."],"key":"zhang_zack_2025","id":"zhang_zack_2025","bibbaseid":"zhang-shen-zackzerooverheadllminferenceaccelerationviadimensionalitycompressionofthekeyvaluecache-2025","role":"author","urls":{"Paper":"http://arxiv.org/abs/2408.04107"},"keyword":["Computer Science - Distributed","Parallel","and Cluster Computing","Computer Science - Machine Learning"],"metadata":{"authorlinks":{}}},"bibtype":"misc","biburl":"https://bibbase.org/zotero-group/pratikmhatre/5933976","dataSources":["yJr5AAtJ5Sz3Q4WT4"],"keywords":["computer science - distributed","parallel","and cluster computing","computer science - machine learning"],"search_terms":["zack","zero","overhead","llm","inference","acceleration","via","dimensionality","compression","key","value","cache","zhang","shen"],"title":"ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache","year":2025}