MaxiMoff: Designing Matrix Multiplication Accelerator for Effective Multiply-Add Operations Offloading
S. Kim, Dongho Ha, Seunghwan Sung, Won Woo Ro
IF 5.4
IEEE Transactions on Emerging Topics in Computing
Contemporary GPU architectures integrate specialized computing units for matrix multiplication, named matrix multiplication units (MXUs), to effectively process neural network applications. However, since MXUs are limited to matrix multiplications, GPUs show inefficiencies in computing resource utilization while applications are unrelated to matrix multiplications. Furthermore, despite prior work to leverage MXUs in general-purpose computing, they are constrained by static analysis, limiting their adaptability and hardware utilization efficiency. This study observes that the techniques emulating high-bitwidth multiplication with low-bitwidth ones transform a single high-bitwidth Multiply-and-ADd (MAD) operation into a low-bitwidth dot-product operation. Leveraging this observation, we propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MaxiMoff</i>, a novel GPU architecture to utilize general-purpose cores and MXUs while computing MAD instructions dynamically. With this extended design, MaxiMoff achieves an average speedup of 1.39× and reduces total energy consumption by 17.3%.
q-Point: A Numeric Format for Quantum Circuit Simulation Using Polar Form Complex Numbers
Seungwoo Choi, Enhyeok Jang, Youngmin Kim, Sungwoo Ahn, Won Woo Ro
IF 5.4
IEEE Transactions on Emerging Topics in Computing
Quantum circuit simulation is playing a critical role in the current era of quantum computing. However, quantum circuit simulation suffers from huge memory requirements that scale exponentially according to the number of qubits. Our observation reveals that the conventional complex number representation using real and imaginary values adds to the memory overhead beyond the intrinsic cost of simulating quantum states. Instead, using the radius and phase value of a complex number better reflects the properties of the complex values used in the quantum circuit simulation providing better memory efficiency. This paper proposes q-Point, a compact numeric format for quantum circuit simulation that utilizes polar form representation instead of rectangular form representation to store complex numbers. The proposed q-Point format consists of three fields: [i)] exponent bits for radius value mantissa bits for radius value mantissa bits for phase value. However, a naive application of the q-Point format has the potential to cause issues with both simulation accuracy and simulation speed. To preserve simulation accuracy with fewer bits, we use a multi-level encoding scheme that employs different mantissa bits depending on the exponent range. Additionally, to prevent possible slowdown due to the add operation in polar form complex numbers, we use a technique that adaptively applies both polar and rectangular forms. Equipped with these optimizations, the proposed q-Point format demonstrates reasonable simulation accuracy while using only half of the memory requirement using the baseline format. Additionally, the q-Point format enables an average of 1.37× and 1.16× faster simulation for QAOA and VQE benchmark circuits.
MaxiMoff: Designing Matrix Multiplication Accelerator for Effective Multiply-Add Operations Offloading
S. Kim, Dongho Ha, Seunghwan Sung, Won Woo Ro
IF 5.4
IEEE Transactions on Emerging Topics in Computing
Contemporary GPU architectures integrate specialized computing units for matrix multiplication, named matrix multiplication units (MXUs), to effectively process neural network applications. However, since MXUs are limited to matrix multiplications, GPUs show inefficiencies in computing resource utilization while applications are unrelated to matrix multiplications. Furthermore, despite prior work to leverage MXUs in general-purpose computing, they are constrained by static analysis, limiting their adaptability and hardware utilization efficiency. This study observes that the techniques emulating high-bitwidth multiplication with low-bitwidth ones transform a single high-bitwidth Multiply-and-ADd (MAD) operation into a low-bitwidth dot-product operation. Leveraging this observation, we propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MaxiMoff</i>, a novel GPU architecture to utilize general-purpose cores and MXUs while computing MAD instructions dynamically. With this extended design, MaxiMoff achieves an average speedup of 1.39× and reduces total energy consumption by 17.3%.
q-Point: A Numeric Format for Quantum Circuit Simulation Using Polar Form Complex Numbers
Seungwoo Choi, Enhyeok Jang, Youngmin Kim, Sungwoo Ahn, Won Woo Ro
IF 5.4
IEEE Transactions on Emerging Topics in Computing
Quantum circuit simulation is playing a critical role in the current era of quantum computing. However, quantum circuit simulation suffers from huge memory requirements that scale exponentially according to the number of qubits. Our observation reveals that the conventional complex number representation using real and imaginary values adds to the memory overhead beyond the intrinsic cost of simulating quantum states. Instead, using the radius and phase value of a complex number better reflects the properties of the complex values used in the quantum circuit simulation providing better memory efficiency. This paper proposes q-Point, a compact numeric format for quantum circuit simulation that utilizes polar form representation instead of rectangular form representation to store complex numbers. The proposed q-Point format consists of three fields: [i)] exponent bits for radius value mantissa bits for radius value mantissa bits for phase value. However, a naive application of the q-Point format has the potential to cause issues with both simulation accuracy and simulation speed. To preserve simulation accuracy with fewer bits, we use a multi-level encoding scheme that employs different mantissa bits depending on the exponent range. Additionally, to prevent possible slowdown due to the add operation in polar form complex numbers, we use a technique that adaptively applies both polar and rectangular forms. Equipped with these optimizations, the proposed q-Point format demonstrates reasonable simulation accuracy while using only half of the memory requirement using the baseline format. Additionally, the q-Point format enables an average of 1.37× and 1.16× faster simulation for QAOA and VQE benchmark circuits.
Plutarch: Toward Scalable Operational Parallelism on Racetrack-Shaped Trapped-Ion Processors
Enhyeok Jang, H J Kim, Yongju Lee, Jaewon Kwon, Yipeng Huang, Won Woo Ro
ArXiv.org
A recent advancement in quantum computing shows a quantum advantage of certified randomness on the racetrack processor. This work investigates the execution efficiency of this architecture for general-purpose programs. We first explore the impact of increasing zones on runtime efficiency. Counterintuitively, our evaluations using variational programs reveal that expanding zones may degrade runtime performance under the existing scheduling policy. This degradation may be attributed to the increase in track length, which increases ion circulation overhead, offsetting the benefits of enhanced parallelism. To mitigate this, the proposed \textit{Plutarch} exploits 3 strategies: (i) unitary decomposition and translation to maximize zone utilization, (ii) prioritizing the execution of nearby gates over ion circulation, and (iii) implementing shortcuts to provide the alternative path.
Enhyeok Jang, Zihan Chen, Dongho Ha, Seungwoo Choi, Yongju Lee, Jaewon Kwon, Eddy Z. Zhang, Yipeng Huang, Won Woo Ro
arXiv (Cornell University)
The quantum approximate optimization algorithm (QAOA) is a hybrid quantum-classical algorithm for solving combinatorial optimization problems. Multi-angle QAOA (MA-QAOA), which assigns independent parameters to each Hamiltonian operator term, achieves superior approximation performance even with fewer layers than standard QAOA. Unfortunately, this increased expressibility can raise the classical computational cost due to a greater number of parameters. The recently proposed Layerwise MA-QAOA (LMA-QAOA) reduces this overhead by training one layer at a time, but it may suffer from obtaining the precise solution due to the previously fixed parameters. This work addresses two questions for efficient MA-QAOA training: (i) What is the optimal granularity for parameter updates per epoch, and (ii) How can we get precise final cost function results while only partially updating the parameters per epoch? Despite the benefit of reducing the parameters that update per epoch can reduce the classical computation overhead, too fine or coarse a granularity of Hamiltonian update can degrade the MA-QAOA training efficiency. We find that optimizing one complete layer per epoch is an efficient granularity. Moreover, selectively retraining each layer by tracking gradient variations can achieve a final cost function equivalent to the standard MA-QAOA while lowering the parameter update overhead. Based on these insights, we propose Orbit-QAOA, which cyclically revisits layers and selectively freezes stabilized parameters. Across diverse graph benchmarks, Orbit-QAOA reduces training steps by up to 81.8%, reduces approximation ratio error by up to 72x compared to the unified stop condition-applied enhanced LMA-QAOA, and achieves equivalent approximation performance compared to the standard MA-QAOA.
Plutarch: Toward Scalable Operational Parallelism on Racetrack-Shaped Trapped-Ion Processors
Enhyeok Jang, H J Kim, Yongju Lee, Jaewon Kwon, Yipeng Huang, Won Woo Ro
arXiv (Cornell University)
A recent advancement in quantum computing shows a quantum advantage of certified randomness on the racetrack processor. This work investigates the execution efficiency of this architecture for general-purpose programs. We first explore the impact of increasing zones on runtime efficiency. Counterintuitively, our evaluations using variational programs reveal that expanding zones may degrade runtime performance under the existing scheduling policy. This degradation may be attributed to the increase in track length, which increases ion circulation overhead, offsetting the benefits of enhanced parallelism. To mitigate this, the proposed \textit{Plutarch} exploits 3 strategies: (i) unitary decomposition and translation to maximize zone utilization, (ii) prioritizing the execution of nearby gates over ion circulation, and (iii) implementing shortcuts to provide the alternative path.
Toward Scalable Gate-Level Parallelism on Trapped-Ion Processors with Racetrack Electrodes
E Jang, Dongjo Kim, Yongju Lee, Jaewon Kwon, Yipeng Huang, Won Woo Ro
A recent advancement in quantum computing shows a quantum advantage of certified randomness on the racetrack processor. This work investigates the execution efficiency of this architecture for general-purpose programs. We first explore the impact of increasing zones on runtime efficiency. Counterintuitively, our evaluations using variational programs reveal that expanding zones may degrade runtime performance under the existing scheduling policy. This degradation may be attributed to the increase in track length, which increases ion circulation overhead, offsetting the benefits of enhanced parallelism. To mitigate this, the proposed Plutarch exploits 3 strategies: (i) unitary decomposition and translation to maximize zone utilization, (ii) prioritizing the execution of nearby gates over ion circulation, and (iii) implementing shortcuts to provide the alternative path.
Leveraging Phase Polynomials for Quantum Circuits Optimization
Zihan Chen, Henry Chen, Xiangyu Gao, Yuwei Jin, Minghao Guo, Enhyeok Jang, Jiakang Li, Chun Man Chan, Won Woo Ro, Eddy Z. Zhang
Quantum computing has transformative computational power to make classically intractable computing feasible. As the algorithms that achieve practical quantum advantage are beyond manual tuning, quantum circuit optimization has become extremely important and integrated into today's quantum software stack. This paper focuses on a critical type of quantum circuit optimization - phase-polynomial optimization. Phase polynomials represents a class of building-block circuits that appear frequently in quantum modular exponentials (the most time-consuming component in Shor's factoring algorithm), in quantum approximation optimization algorithms (QAOA), and in Hamiltonian simulations. Compared to prior work on phase polynomials, we focus more on the impact of phase polynomial synthesis in the context of whole-circuit optimization, from single-block phase polynomials to multiple block phase polynomials, from greedy equivalent sub-circuit replacement strategies to a systematic parity matrix optimization approach, and from hardware-oblivious logical circuit optimization to potential hardware and fault-tolerant friendly circuit optimization. Our experiments demonstrate improvements of up to 48.58 % with an average total gate reduction of 35.83%-and reductions in the CNOT gate count of up to 40.63%, averaging 27.9%, for logical circuits across a representative set of important benchmarks.
Study of Remote Page Table Entry Caching to Reduce Page Walk Latency in Multi-chip GPUs
Gun Ko, Won Woo Ro
Journal of the Institute of Electronics and Information Engineers
최근 트랜지스터 집적도 향상이 점차 어려워짐에 따라, 단일 GPU의 연산 성능을 지속적으로 향상시키기 위한 방안으로 multi-chip-module (MCM) GPU가 제안되었다. 그러나, MCM GPU는 여러 칩렛이 속도가 느린 오프-칩 인터커넥트를 통해 연결되어 칩렛 간 데이터 전송 시 성능 저하를 초래한다. 본 논문은 가상 메모리 환경에서 페이지 테이블 워크 과정 중 빈번한 리모트 메모리 접근으로 인한 문제를 해결하고자, MCM GPU의 페이지 워크 시 리모트 페이지 테이블 엔트리 (PTE)를 로컬 L2 캐시에 캐싱하는 구조를 제안한다. 제안하는 구조는 리모트 PTE를 로컬 캐시에만 저장하는 방식과 로컬 및 리모트 캐시에 모두 저장하는 두 가지 방법으로 구현되었으며, 이를 바탕으로 각 방식이 페이지 워크 지연 감소에 미치는 효과를 실험을 통해 분석한다. 실험 결과, 두 방식 모두 기존 구조 대비 페이지 워크 지연 시간을 51.8% 이상 단축하였으며 최대 1.7배의 성능 향상을 달성함을 확인하였다. 이러한 결과는 리모트 페이지 워크가 MCM GPU의 주소 변환 성능에 미치는 영향을 시사하며, 효과적인 PTE 캐싱 구조가 성능 개선에 기여할 수 있음을 보여준다.