기본 정보
연구 분야
프로젝트
발행물
구성원
article|
인용수 0
·2025
AiDE: Attention-FFN Disaggregated Execution for Cost-Effective LLM Decoding on CXL-PNM
Kyung-Soo Kim, Oh‐Hoon Kwon, Yeonhong Park, Jae W. Lee
IF 1.4IEEE Computer Architecture Letters
초록

Disaggregating the prefill and decode phases has recently emerged as a promising strategy in the large language model (LLM) serving systems, driven by the distinct resource demands of each phase. Inspired by this coarse-grained disaggregation, we identify a similar opportunity within the decode phase itself: the feedforward network (FFN) is compute-intensive, whereas attention is constrained by memory bandwidth and capacity due to its key-value (KV) cache. To exploit this heterogeneity, we introduce AiDE, a heterogeneous decoding cluster that executes FFN operations on GPUs while offloading attention computations to Compute Express Link-based Processing Near Memory (CXL-PNM) devices. CXL-PNM provides scalable memory bandwidth and capacity, making it well-suited for attention-heavy workloads. In addition, we propose a batch-level pipelining approach enhanced with request scheduling to optimize the utilization of heterogeneous resources. Our AiDE architecture achieves up to 3.87× higher throughput, 2.72× lower p90 time per output token (TPOT), and a 2.31× reduction in decode latency compared to a GPU-only baseline, at comparable cost, demonstrating significant potential of fine-grained disaggregation for cost-effective LLM inference.

키워드
Decoding methodsComputer scienceParallel computingAlgorithm
타입
article
IF / 인용수
1.4 / 0
게재 연도
2025