Reconciling Time Slice Conflicts of Virtual Machines With Dual Time Slice for Clouds
Taeklim Kim, Chang Hyun Park, Jaehyuk Huh, Jeongseob Ahn
IF 6
IEEE Transactions on Parallel and Distributed Systems
The proliferation of system virtualization poses a new challenge for the coarse-grained time sharing techniques for consolidation, since operating systems are running on virtual CPUs. The current system stack was designed under the assumption that operating systems can seize CPU resources at any moment. However, for the guest operating system on a virtual machine (VM), such assumption cannot be guaranteed, since virtual CPUs of VMs share a limited number of physical cores. Due to the time-sharing of physical cores, the execution of a virtual CPU is not contiguous, with a gap between the virtual and real time spaces. Such a virtual time discontinuity problem leads to significant inefficiency for lock and interrupt handling, which rely on the immediate availability of CPUs whenever the operating system requires computation. To reduce scheduling latencies of virtual CPUs, shortening time slices can be a straightforward strategy, but it may lead to the increased overhead of context switching costs across virtual machines for some workloads. It is challenging to determine a single time slice to satisfy all the VMs. In this article, we propose to have dual time slice to resolve the time slice conflict problem occurred in different types of virtual machines.
ZeroKernel: Secure Context-isolated Execution on Commodity GPUs
Ohmin Kwon, Yonggon Kim, Jaehyuk Huh, Hyunsoo Yoon
IF 7.5
IEEE Transactions on Dependable and Secure Computing
In the last decade, the dedicated graphics processing unit (GPU) has emerged as an architecture for high-performance computing workloads. Recently, researchers have also focused on the isolation property of a dedicated GPU and suggested GPU-based secure computing environments with several promising applications. However, despite the security analysis conducted by the prior studies, it has been unclear whether a dedicated GPU can be leveraged as a secure processor in the presence of a kernel-privileged attacker. In this paper, we first demonstrate the security of dedicated GPUs through comprehensive studies on context information for GPU execution. The paper shows that a kernel-privileged attacker can manipulate the GPU contexts to redirect memory accesses or execute arbitrary GPU codes on the running GPU kernel. Based on the security analysis, this paper proposes a new on-chip execution model for the dedicated GPU and a novel defense mechanism supporting the security of the on-chip execution. With comprehensive evaluation, the paper assures that the proposed solutions effectively isolate sensitive data in on-chip storages and defend against known attack vectors from a privileged attacker, supporting that the commodity GPUs can be leveraged as a secure processor.
https://doi.org/10.1109/tdsc.2019.2946250
Computer science
Kernel (algebra)
Context (archaeology)
CUDA
Graphics processing unit
General-purpose computing on graphics processing units
GVTS: Global Virtual Time Fair Scheduling to Support Strict Fairness on Many Cores
Changdae Kim, Seungbeom Choi, Jaehyuk Huh
IF 6
IEEE Transactions on Parallel and Distributed Systems
Proportional fairness in CPU scheduling has been widely adopted to fairly distribute CPU shares corresponding to their weights. With the emergence of cloud environments, the proportionally fair scheduling has been extended to groups of threads or nested groups to support virtual machines or containers. Such proportional fairness has been supported by popular schedulers, such as Linux Completely Fair Scheduler (CFS) through virtual time scheduling. However, CFS, with a distributed runqueue per CPU, implements the virtual time scheduling locally. Across different queues, the virtual times of threads are not strictly maintained to avoid potential scalability bottlenecks. The uneven fluctuation of CPU shares caused by the limitations of CFS not only violates the fairness support for CPU assignments, but also significantly increases the tail latencies of latency-sensitive applications. To mitigate the limitations of CFS, this paper proposes a global virtual-time fair scheduler (GVTS), which enforces global virtual time fairness for threads and thread groups, even if they run across many physical cores. The new scheduler employs the hierarchical enforcement of target virtual time to enhance the scalability of schedulers, which is aware of the topology of CPU organization. We implemented GVTS in Linux kernel 4.6.4 with several optimizations to provide global virtual time efficiently. Our experimental results show that GVTS can almost eliminate the fairness violation of CFS for both non-grouped and grouped executions. Furthermore, GVTS can curtail the tail latency when latency-sensitive applications are co-running with batch tasks.
Virtual Snooping Coherence for Multi-Core Virtualized Systems
Daehoon Kim, Chang Hyun Park, Hwanju Kim, Jaehyuk Huh
IF 6
IEEE Transactions on Parallel and Distributed Systems
Proliferation of virtualized systems opens a new opportunity to improve the scalability of multi-core architectures. Among the scalability bottlenecks in multi-cores, cache coherence has been one of the most critical problems. Although snoop-based protocols have been dominating commercial multi-core designs, it has been difficult to scale them for more cores, as snooping protocols require high network bandwidth and power consumption for snooping all the caches.In this paper, we propose a novel snoop-based cache coherence protocol, called virtual snooping, for virtualized multi-core architectures. Virtual snooping exploits memory isolation across virtual machines and prevents unnecessary snoop requests from crossing the virtual machine boundaries. Each virtual machine becomes a virtual snoop domain, consisting of a subset of the cores in a system. Although the majority of virtual machine memory is isolated, sharing of cachelines across VMs still occur. To address such data sharing, this paper investigates three factors, data sharing through the hypervisor, virtual machine relocation, and content-based sharing. In this paper, we explore the design space of virtual snooping with experiments on emulated and real virtualized systems including the mechanisms and overheads of the hypervisor. In addition, the paper discusses the scheduling impact on the effectiveness of virtual snooping.
Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhang, Doug Burger, Stephen W. Keckler
IF 6
IEEE Transactions on Parallel and Distributed Systems
We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.
Reconciling Time Slice Conflicts of Virtual Machines With Dual Time Slice for Clouds
Taeklim Kim, Chang Hyun Park, Jaehyuk Huh, Jeongseob Ahn
IF 6
IEEE Transactions on Parallel and Distributed Systems
The proliferation of system virtualization poses a new challenge for the coarse-grained time sharing techniques for consolidation, since operating systems are running on virtual CPUs. The current system stack was designed under the assumption that operating systems can seize CPU resources at any moment. However, for the guest operating system on a virtual machine (VM), such assumption cannot be guaranteed, since virtual CPUs of VMs share a limited number of physical cores. Due to the time-sharing of physical cores, the execution of a virtual CPU is not contiguous, with a gap between the virtual and real time spaces. Such a virtual time discontinuity problem leads to significant inefficiency for lock and interrupt handling, which rely on the immediate availability of CPUs whenever the operating system requires computation. To reduce scheduling latencies of virtual CPUs, shortening time slices can be a straightforward strategy, but it may lead to the increased overhead of context switching costs across virtual machines for some workloads. It is challenging to determine a single time slice to satisfy all the VMs. In this article, we propose to have dual time slice to resolve the time slice conflict problem occurred in different types of virtual machines.
ZeroKernel: Secure Context-isolated Execution on Commodity GPUs
Ohmin Kwon, Yonggon Kim, Jaehyuk Huh, Hyunsoo Yoon
IF 7.5
IEEE Transactions on Dependable and Secure Computing
In the last decade, the dedicated graphics processing unit (GPU) has emerged as an architecture for high-performance computing workloads. Recently, researchers have also focused on the isolation property of a dedicated GPU and suggested GPU-based secure computing environments with several promising applications. However, despite the security analysis conducted by the prior studies, it has been unclear whether a dedicated GPU can be leveraged as a secure processor in the presence of a kernel-privileged attacker. In this paper, we first demonstrate the security of dedicated GPUs through comprehensive studies on context information for GPU execution. The paper shows that a kernel-privileged attacker can manipulate the GPU contexts to redirect memory accesses or execute arbitrary GPU codes on the running GPU kernel. Based on the security analysis, this paper proposes a new on-chip execution model for the dedicated GPU and a novel defense mechanism supporting the security of the on-chip execution. With comprehensive evaluation, the paper assures that the proposed solutions effectively isolate sensitive data in on-chip storages and defend against known attack vectors from a privileged attacker, supporting that the commodity GPUs can be leveraged as a secure processor.
https://doi.org/10.1109/tdsc.2019.2946250
Computer science
Kernel (algebra)
Context (archaeology)
CUDA
Graphics processing unit
General-purpose computing on graphics processing units
GVTS: Global Virtual Time Fair Scheduling to Support Strict Fairness on Many Cores
Changdae Kim, Seungbeom Choi, Jaehyuk Huh
IF 6
IEEE Transactions on Parallel and Distributed Systems
Proportional fairness in CPU scheduling has been widely adopted to fairly distribute CPU shares corresponding to their weights. With the emergence of cloud environments, the proportionally fair scheduling has been extended to groups of threads or nested groups to support virtual machines or containers. Such proportional fairness has been supported by popular schedulers, such as Linux Completely Fair Scheduler (CFS) through virtual time scheduling. However, CFS, with a distributed runqueue per CPU, implements the virtual time scheduling locally. Across different queues, the virtual times of threads are not strictly maintained to avoid potential scalability bottlenecks. The uneven fluctuation of CPU shares caused by the limitations of CFS not only violates the fairness support for CPU assignments, but also significantly increases the tail latencies of latency-sensitive applications. To mitigate the limitations of CFS, this paper proposes a global virtual-time fair scheduler (GVTS), which enforces global virtual time fairness for threads and thread groups, even if they run across many physical cores. The new scheduler employs the hierarchical enforcement of target virtual time to enhance the scalability of schedulers, which is aware of the topology of CPU organization. We implemented GVTS in Linux kernel 4.6.4 with several optimizations to provide global virtual time efficiently. Our experimental results show that GVTS can almost eliminate the fairness violation of CFS for both non-grouped and grouped executions. Furthermore, GVTS can curtail the tail latency when latency-sensitive applications are co-running with batch tasks.
Virtual Snooping Coherence for Multi-Core Virtualized Systems
Daehoon Kim, Chang Hyun Park, Hwanju Kim, Jaehyuk Huh
IF 6
IEEE Transactions on Parallel and Distributed Systems
Proliferation of virtualized systems opens a new opportunity to improve the scalability of multi-core architectures. Among the scalability bottlenecks in multi-cores, cache coherence has been one of the most critical problems. Although snoop-based protocols have been dominating commercial multi-core designs, it has been difficult to scale them for more cores, as snooping protocols require high network bandwidth and power consumption for snooping all the caches.In this paper, we propose a novel snoop-based cache coherence protocol, called virtual snooping, for virtualized multi-core architectures. Virtual snooping exploits memory isolation across virtual machines and prevents unnecessary snoop requests from crossing the virtual machine boundaries. Each virtual machine becomes a virtual snoop domain, consisting of a subset of the cores in a system. Although the majority of virtual machine memory is isolated, sharing of cachelines across VMs still occur. To address such data sharing, this paper investigates three factors, data sharing through the hypervisor, virtual machine relocation, and content-based sharing. In this paper, we explore the design space of virtual snooping with experiments on emulated and real virtualized systems including the mechanisms and overheads of the hypervisor. In addition, the paper discusses the scheduling impact on the effectiveness of virtual snooping.
Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhang, Doug Burger, Stephen W. Keckler
IF 6
IEEE Transactions on Parallel and Distributed Systems
We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.
Enabling Computation and Communication Overlap in PIMs for on-device LLM Inference
Siu Jeong, Suhwan Kim, Changyong Eom, Jaehyuk Huh
IF 1.4
IEEE Computer Architecture Letters
DRAM-based processing-in-memory (PIM) exploits all-bank execution to accelerate memory-bound workloads, but limited PIM capacity makes runtime data movement unavoid able and often performance-critical. Existing PIM architectures tightly couple computation and memory access across all banks, serializing computation and data transfer and limiting opportunities to hide communication latency. This paper proposes a bank-level computation–communication overlap mechanism that selectively assigns a subset of bank resources for communication, enabling concurrent data transfer without disrupting ongoing computation while preserving the unified control model of all bank PIM. Evaluations on mixture-of-experts (MoE) and multi small language model (multi-SLM) inference workloads using a simulated LPDDR5-based PIM system show execution time reductions of up to 19% and 16%, respectively, along with consistently improved channel bandwidth utilization compared to all-bank and channel-level overlap baselines.
Large language models (LLMs) with growing model sizes use many GPUs to meet memory capacity requirements, incurring substantial costs for token generation. To provide cost-effective LLM inference with relaxed latency constraints, recent studies proposed to expand GPU memory by leveraging the host memory. However, in such host memory offloading, the host-GPU interconnect becomes a performance bottleneck for the transfer of weights and key-value (KV) cache entries, causing the underutilization of GPUs. To address the challenge, we introduce Capture, an LLM inference system with KVactivation hybrid caching. The activation cache stores activation checkpoints instead of keys and values during intermediate inference stages, requiring half of the memory capacity compared to the conventional KV cache. While model parameters are transferred to GPU from host memory, idle GPU resources can be utilized for the partial recomputation. To balance the latency of activation recomputation and parameter loading, our KVactivation hybrid caching scheme determines the optimal ratio between KV and activation caches to manage both recomputation time and data transfer times. Capture achieves <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$2.19 \times$</tex> throughput improvement over the state-of-the-art prior work for offloading both model weights and KV cache.
Unified Memory Protection with Multi-granular MAC and Integrity Tree for Heterogeneous Processors
S. R. Lee, Seonjin Na, Jeongwon Choi, Jinwon Pyo, Jaehyuk Huh
Recent system-on-a-chip (SoC) architectures for edge systems incorporate a variety of processing units, such as CPUs, GPUs, and NPUs.Although hardware-based memory protection is crucial for the security of edge systems, conventional mechanisms experience a significant performance degradation in such heterogeneous SoCs due to the increased memory traffic with diverse access patterns from different processing units.To mitigate the overheads, recent studies, targeting a specific domain such as machine learning software or accelerator, proposed techniques based on custom granularities applicable either to counters or MACs, but not both.In response to this challenge, we propose a unified mechanism to support both multi-granular MACs and counters in a device-independent way.It supports a granularity-aware integrity tree to make it adaptable to various access patterns.The multi-granular tree architecture stores both coarse-grained and fine-grained counters at different levels in the tree.Combined with the multi-granularity technique for MACs.Our optimization technique, termed multi-granular MAC&tree, supports four different levels of granularity.Its dynamic detection mechanism can select the most appropriate granularity for different memory regions accessed by heterogeneous processing units.In addition, we combine the multi-granularity support with the prior subtree approaches to further reduce the overheads.Our simulationbased evaluation results show that the multi-granular MAC and tree reduce the execution time by 14.2% from the conventional fixed-granular MAC&tree.By combining prior sub-tree techniques, the multi-granular MAC and tree finally reduce the execution time by 21.1% compared to the conventional fixed-granular MAC&tree.
Supporting Secure Multi-GPU Computing with Dynamic and Batched Metadata Management
Seonjin Na, Jung-Woo Kim, Sunho Lee, Jaehyuk Huh
With growing problem sizes for GPU computing, multi-GPU systems with fine-grained memory sharing have emerged to improve the current coarse-grained unified memory support based on page migration. Such multi-GPU systems with shared memory pose a new challenge in securing CPU-GPU and inter-GPU communications, as the cost of secure data transfers adds a significant performance overhead. There are two overheads of secure communication in multi-GPU systems: First, extra overhead is added to generate one-time pads (OTPs) for authenticated encryption. Second, the security metadata such as MACs and counters passed along with encrypted data consume precious network bandwidth. This study investigates the performance impact of secure communication in multi-GPU systems and evaluates the prior CPU-oriented OTP precomputation schemes adapted for multi-GPU systems. Our investigation identifies the challenge with the limited OTP buffers for inter-GPU communication and the opportunity to reduce traffic for security meta-data with bursty communications in GPUs. Based on the analysis, this paper proposes a new dynamic OTP buffer allocation technique, which adjusts the buffer assignment for each source-destination pair to reflect the communication patterns. To address the bandwidth problem by extra security metadata, the study employs a dynamic batching scheme to transfer only a single set of metadata for each batched group of data responses. The proposed design constantly tracks the communication pattern from each GPU, periodically adjusts the allocated buffer size, and dynamically forms batches of data transfers. Our evaluation shows that in a 16-GPU system, the proposed scheme can improve the performance by 13.2 % and 17.5 % on average from the prior cached and private schemes, respectively.
https://doi.org/10.1109/hpca57654.2024.00025
Computer science
Metadata
Metadata management
General-purpose computing on graphics processing units
Privacy-Preserving Consumer Churn Prediction in Telecommunication Through Federated Machine Learning
Jaehyuk Huh, Woongsup Lee
In the competitive telecommunications industry, understanding and predicting customer churn-customers discontinuing service-is crucial for revenue and subscriber retention. Traditional customer churn prediction (CCP) methods require extensive user data, raising privacy concerns when sharing data across different companies. This paper introduces a novel federated learning (FL) framework for CCP that enhances prediction accuracy while safeguarding privacy.