Recently, multiple ASICs [1]–[6] have been proposed to accelerate large language models (LLMs). However, the enormous number of LLM parameters leads to significant energy consumption due to external memory access (EMA). When normalizing the system energy required to process 1024 input tokens by the number of parameters, previous ASICs [1]–[5] required 79-222pJ/param for small models with 336-682M parameters, as shown in Fig. 23.9.1. Even an ASIC [6] designed to reduce EMA still consumes 26pJ/param for a 708M model. Due to large EMA, prior works [1]–[6] were limited to processing small models like GPT-2 [7] and cannot handle highly accurate AI models with billions of parameters, such as Llama [8]. To overcome this, we adopt binary or ternary quantization and new hardware optimizations, enabling the proposed ASIC to consume 9pJ/param and less than 5mW power for a billion-parameter Llama.