Throughput-Oriented LLM Inference via KV-Activation Hybrid Caching with A Single GPU | 허재혁 교수 연구실 | 한국과학기술원 전산학부

허재혁 교수 연구실

서비스 플랜

연구실 검색

프로젝트 공고

정부 과제 추천

AI 기반 기업 서칭

홈

기본 정보

연구 분야

프로젝트

발행물

구성원

preprint|

green

·인용수 0

·2025

Throughput-Oriented LLM Inference via KV-Activation Hybrid Caching with A Single GPU

Sanghyeon Lee, Hongbeen Kim, Soojin Hwang, Guseul Heo, Minwoo Noh, Jaehyuk Huh

초록

Large language models (LLMs) with growing model sizes use many GPUs to meet memory capacity requirements, incurring substantial costs for token generation. To provide cost-effective LLM inference with relaxed latency constraints, recent studies proposed to expand GPU memory by leveraging the host memory. However, in such host memory offloading, the host-GPU interconnect becomes a performance bottleneck for the transfer of weights and key-value (KV) cache entries, causing the underutilization of GPUs. To address the challenge, we introduce Capture, an LLM inference system with KVactivation hybrid caching. The activation cache stores activation checkpoints instead of keys and values during intermediate inference stages, requiring half of the memory capacity compared to the conventional KV cache. While model parameters are transferred to GPU from host memory, idle GPU resources can be utilized for the partial recomputation. To balance the latency of activation recomputation and parameter loading, our KVactivation hybrid caching scheme determines the optimal ratio between KV and activation caches to manage both recomputation time and data transfer times. Capture achieves <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$2.19 \times$</tex> throughput improvement over the state-of-the-art prior work for offloading both model weights and KV cache.

키워드

InferenceComputer scienceDistributed computingArtificial intelligence

타입

preprint

IF / 인용수

- / 0

원문

https://doi.org/10.1109/iccd65941.2025.00045

게재 연도

2025