기본 정보
연구 분야
프로젝트
발행물
구성원
preprint|
green
·인용수 0
·2025
Throughput-Oriented LLM Inference via KV-Activation Hybrid Caching with A Single GPU
Sanghyeon Lee, Hongbeen Kim, Soojin Hwang, Guseul Heo, Minwoo Noh, Jaehyuk Huh
초록

Large language models (LLMs) with growing model sizes use many GPUs to meet memory capacity requirements, incurring substantial costs for token generation. To provide cost-effective LLM inference with relaxed latency constraints, recent studies proposed to expand GPU memory by leveraging the host memory. However, in such host memory offloading, the host-GPU interconnect becomes a performance bottleneck for the transfer of weights and key-value (KV) cache entries, causing the underutilization of GPUs. To address the challenge, we introduce Capture, an LLM inference system with KVactivation hybrid caching. The activation cache stores activation checkpoints instead of keys and values during intermediate inference stages, requiring half of the memory capacity compared to the conventional KV cache. While model parameters are transferred to GPU from host memory, idle GPU resources can be utilized for the partial recomputation. To balance the latency of activation recomputation and parameter loading, our KVactivation hybrid caching scheme determines the optimal ratio between KV and activation caches to manage both recomputation time and data transfer times. Capture achieves <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$2.19 \times$</tex> throughput improvement over the state-of-the-art prior work for offloading both model weights and KV cache.

키워드
InferenceComputer scienceDistributed computingArtificial intelligence
타입
preprint
IF / 인용수
- / 0
게재 연도
2025