AiDE: Attention-FFN Disaggregated Execution for Cost-Effective LLM Decoding on CXL-PNM | 권오훈 교수 연구실 | 한국과학기술원 화학과

권오훈 교수 연구실

서비스 플랜

연구실 검색

프로젝트 공고

정부 과제 추천

AI 기반 기업 서칭

홈

기본 정보

연구 분야

프로젝트

발행물

구성원

article|

인용수 0

·2025

AiDE: Attention-FFN Disaggregated Execution for Cost-Effective LLM Decoding on CXL-PNM

Kyung-Soo Kim, Oh‐Hoon Kwon, Yeonhong Park, Jae W. Lee

IF 1.4IEEE Computer Architecture Letters

초록

Disaggregating the prefill and decode phases has recently emerged as a promising strategy in the large language model (LLM) serving systems, driven by the distinct resource demands of each phase. Inspired by this coarse-grained disaggregation, we identify a similar opportunity within the decode phase itself: the feedforward network (FFN) is compute-intensive, whereas attention is constrained by memory bandwidth and capacity due to its key-value (KV) cache. To exploit this heterogeneity, we introduce AiDE, a heterogeneous decoding cluster that executes FFN operations on GPUs while offloading attention computations to Compute Express Link-based Processing Near Memory (CXL-PNM) devices. CXL-PNM provides scalable memory bandwidth and capacity, making it well-suited for attention-heavy workloads. In addition, we propose a batch-level pipelining approach enhanced with request scheduling to optimize the utilization of heterogeneous resources. Our AiDE architecture achieves up to 3.87× higher throughput, 2.72× lower p90 time per output token (TPOT), and a 2.31× reduction in decode latency compared to a GPU-only baseline, at comparable cost, demonstrating significant potential of fine-grained disaggregation for cost-effective LLM inference.

키워드

Decoding methodsComputer scienceParallel computingAlgorithm

타입

article

IF / 인용수

1.4 / 0

원문

https://doi.org/10.1109/lca.2025.3597323

게재 연도

2025