Compressed textures are indispensable in most 3D graphics applications to reduce memory traffic and increase performance. For higher-quality graphics, the number and size of textures in an application have continuously increased. Additionally, the ETC2 texture format, which is mandatory in OpenGL ES 3.0, OpenGL 4.3, and Android 4.3 (and later versions), requires more complex texture compression than the traditional ETC1 format. As a result, texture compression becomes more and more time-consuming. To accelerate ETC2 compression, we introduce two new compression techniques, named QuickETC2. The first technique is an early compression-mode decision scheme. Instead of testing all ETC1/2 modes to compress a texel block, we select proper modes for each block by exploiting the luma difference of the block to reduce unnecessary compression overhead. The second technique is a fast luma-based T- and H-mode compression method. When clustering each texel into two groups, we replace the 3D RGB space with the 1D luma space and quickly find the two groups that have the minimum luma differences. We also selectively perform the T- or H-mode and reduce its distance candidates, according to the luma differences of each group. We have implemented both techniques with AVX2 intrinsics to exploit SIMD parallelism. According to our experiments, QuickETC2 can compress more than 2000 1K×1K-sized images per second on an octa-core CPU.
Z <sup>2</sup> traversal order: An interleaving approach for VR stereo rendering on tile-based GPUs
Jae‐Ho Nah, Yeongkyu Lim, Sunho Ki, Chulho Shin
IF 18.3
Computational Visual Media
With increasing demands of virtual reality (VR) applications, efficient VR rendering techniques are becoming essential. Because VR stereo rendering has increased computational costs to separately render views for the left and right eyes, to reduce the rendering cost in VR applications, we present a novel traversal order for tile-based mobile GPU architectures: Z2 traversal order. In tile-based mobile GPU architectures, a tile traversal order that maximizes spatial locality can increase GPU cache efficiency. For VR applications, our approach improves upon the traditional Z order curve. We render corresponding screen tiles in left and right views in turn, or simultaneously, and as a result, we can exploit spatial adjacency of the two tiles. To evaluate our approach, we conducted a trace-driven hardware simulation using Mesa and a hardware simulator. Our experimental results show that Z2 traversal order can reduce external memory bandwidth requirements and increase rendering performance.
HART: A Hybrid Architecture for Ray Tracing Animated Scenes
Jae‐Ho Nah, Jinwoo Kim, Junho Park, Won‐Jong Lee, Jeong‐Soo Park, Seokyoon Jung, Woo-Chan Park, Dinesh Manocha, Tack‐Don Han
IF 6.5
IEEE Transactions on Visualization and Computer Graphics
We present a hybrid architecture, inspired by asynchronous BVH construction [1], for ray tracing animated scenes. Our hybrid architecture utilizes heterogeneous hardware resources: dedicated ray-tracing hardware for BVH updates and ray traversal and a CPU for BVH reconstruction. We also present a traversal scheme using a primitive's axis-aligned bounding box (PrimAABB). This scheme reduces ray-primitive intersection tests by reusing existing BVH traversal units and the primAABB data for tree updates; it enables the use of shallow trees to reduce tree build times, tree sizes, and bus bandwidth requirements. Furthermore, we present a cache scheme that exploits consecutive memory access by reusing data in an L1 cache block. We perform cycle-accurate simulations to verify our architecture, and the simulation results indicate that the proposed architecture can achieve real-time Whitted ray tracing animated scenes at 1,920 × 1,200 resolution. This result comes from our high-performance hardware architecture and minimized resource requirements for tree updates.
Jae‐Ho Nah, Hyuck-Joo Kwon, Dongseok Kim, Cheol-Ho Jeong, Jin‐Hong Park, Tack‐Don Han, Dinesh Manocha, Woo-Chan Park
IF 9.5
ACM Transactions on Graphics
We present RayCore, a mobile ray-tracing hardware architecture. RayCore facilitates high-quality rendering effects, such as reflection, refraction, and shadows, on mobile devices by performing real-time Whitted ray tracing. RayCore consists of two major components: ray-tracing units (RTUs) based on a unified traversal and intersection pipeline and a tree-building unit (TBU) for dynamic scenes. The overall RayCore architecture offers considerable benefits in terms of die area, memory access, and power consumption. We have evaluated our architecture based on FPGA and ASIC evaluations and demonstrate its performance on different benchmarks. According to the results, our architecture demonstrates high performance per unit area and unit energy, making it highly suitable for use in mobile devices.
Jae‐Ho Nah, Jeong‐Soo Park, Chanmin Park, Jin‐Woo Kim, Yun-Hye Jung, Woo-Chan Park, Tack‐Don Han
IF 9.5
ACM Transactions on Graphics
Ray tracing naturally supports high-quality global illumination effects, but it is computationally costly. Traversal and intersection operations dominate the computation of ray tracing. To accelerate these two operations, we propose a hardware architecture integrating three novel approaches. First, we present an ordered depth-first layout and a traversal architecture using this layout to reduce the required memory bandwidth. Second, we propose a three-phase ray-triangle intersection architecture that takes advantage of early exit. Third, we propose a latency hiding architecture defined as the ray accumulation unit. Cycle-accurate simulation results indicate our architecture can achieve interactive distributed ray tracing.
Compressed textures are indispensable in most 3D graphics applications to reduce memory traffic and increase performance. For higher-quality graphics, the number and size of textures in an application have continuously increased. Additionally, the ETC2 texture format, which is mandatory in OpenGL ES 3.0, OpenGL 4.3, and Android 4.3 (and later versions), requires more complex texture compression than the traditional ETC1 format. As a result, texture compression becomes more and more time-consuming. To accelerate ETC2 compression, we introduce two new compression techniques, named QuickETC2. The first technique is an early compression-mode decision scheme. Instead of testing all ETC1/2 modes to compress a texel block, we select proper modes for each block by exploiting the luma difference of the block to reduce unnecessary compression overhead. The second technique is a fast luma-based T- and H-mode compression method. When clustering each texel into two groups, we replace the 3D RGB space with the 1D luma space and quickly find the two groups that have the minimum luma differences. We also selectively perform the T- or H-mode and reduce its distance candidates, according to the luma differences of each group. We have implemented both techniques with AVX2 intrinsics to exploit SIMD parallelism. According to our experiments, QuickETC2 can compress more than 2000 1K×1K-sized images per second on an octa-core CPU.
Z <sup>2</sup> traversal order: An interleaving approach for VR stereo rendering on tile-based GPUs
Jae‐Ho Nah, Yeongkyu Lim, Sunho Ki, Chulho Shin
IF 18.3
Computational Visual Media
With increasing demands of virtual reality (VR) applications, efficient VR rendering techniques are becoming essential. Because VR stereo rendering has increased computational costs to separately render views for the left and right eyes, to reduce the rendering cost in VR applications, we present a novel traversal order for tile-based mobile GPU architectures: Z2 traversal order. In tile-based mobile GPU architectures, a tile traversal order that maximizes spatial locality can increase GPU cache efficiency. For VR applications, our approach improves upon the traditional Z order curve. We render corresponding screen tiles in left and right views in turn, or simultaneously, and as a result, we can exploit spatial adjacency of the two tiles. To evaluate our approach, we conducted a trace-driven hardware simulation using Mesa and a hardware simulator. Our experimental results show that Z2 traversal order can reduce external memory bandwidth requirements and increase rendering performance.
HART: A Hybrid Architecture for Ray Tracing Animated Scenes
Jae‐Ho Nah, Jinwoo Kim, Junho Park, Won‐Jong Lee, Jeong‐Soo Park, Seokyoon Jung, Woo-Chan Park, Dinesh Manocha, Tack‐Don Han
IF 6.5
IEEE Transactions on Visualization and Computer Graphics
We present a hybrid architecture, inspired by asynchronous BVH construction [1], for ray tracing animated scenes. Our hybrid architecture utilizes heterogeneous hardware resources: dedicated ray-tracing hardware for BVH updates and ray traversal and a CPU for BVH reconstruction. We also present a traversal scheme using a primitive's axis-aligned bounding box (PrimAABB). This scheme reduces ray-primitive intersection tests by reusing existing BVH traversal units and the primAABB data for tree updates; it enables the use of shallow trees to reduce tree build times, tree sizes, and bus bandwidth requirements. Furthermore, we present a cache scheme that exploits consecutive memory access by reusing data in an L1 cache block. We perform cycle-accurate simulations to verify our architecture, and the simulation results indicate that the proposed architecture can achieve real-time Whitted ray tracing animated scenes at 1,920 × 1,200 resolution. This result comes from our high-performance hardware architecture and minimized resource requirements for tree updates.
Jae‐Ho Nah, Hyuck-Joo Kwon, Dongseok Kim, Cheol-Ho Jeong, Jin‐Hong Park, Tack‐Don Han, Dinesh Manocha, Woo-Chan Park
IF 9.5
ACM Transactions on Graphics
We present RayCore, a mobile ray-tracing hardware architecture. RayCore facilitates high-quality rendering effects, such as reflection, refraction, and shadows, on mobile devices by performing real-time Whitted ray tracing. RayCore consists of two major components: ray-tracing units (RTUs) based on a unified traversal and intersection pipeline and a tree-building unit (TBU) for dynamic scenes. The overall RayCore architecture offers considerable benefits in terms of die area, memory access, and power consumption. We have evaluated our architecture based on FPGA and ASIC evaluations and demonstrate its performance on different benchmarks. According to the results, our architecture demonstrates high performance per unit area and unit energy, making it highly suitable for use in mobile devices.
Jae‐Ho Nah, Jeong‐Soo Park, Chanmin Park, Jin‐Woo Kim, Yun-Hye Jung, Woo-Chan Park, Tack‐Don Han
IF 9.5
ACM Transactions on Graphics
Ray tracing naturally supports high-quality global illumination effects, but it is computationally costly. Traversal and intersection operations dominate the computation of ray tracing. To accelerate these two operations, we propose a hardware architecture integrating three novel approaches. First, we present an ordered depth-first layout and a traversal architecture using this layout to reduce the required memory bandwidth. Second, we propose a three-phase ray-triangle intersection architecture that takes advantage of early exit. Third, we propose a latency hiding architecture defined as the ray accumulation unit. Cycle-accurate simulation results indicate our architecture can achieve interactive distributed ray tracing.
Considerations for the Acceleration Structure of Sound Propagation on Mobile Devices: Kd-Trees Versus Multi-Bounding Volume Hierarchies
Hyeon-ki Lee, Hyeju Kim, Dong-Yun Kim, Woo-Chan Park, Jae‐Ho Nah
IF 3.6
IEEE Access
Sound propagation algorithms can provide immersive auditory experiences for users in various domains, such as games, virtual/augmented reality, and other applications. Recently, there has been a growing trend to apply spatial sound effects not only on desktops but also on mobile devices with limited computational resources. In geometric-acoustic (GA) sound propagation utilizing ray tracing methods for spatial audio effects, the choice of acceleration structure is an important factor that can either degrade or enhance performance. From this perspective, we seek to provide insights into the essential considerations for selecting acceleration structures for sound rendering on mobile devices. In this paper, we propose guidelines for mobile devices that address both how to select acceleration structures for sound rendering depending on the scene characteristics and how to optimize the selected acceleration structures. We used kd-trees and multi-bounding volume hierarchies (MBVHs), both of which are widely used acceleration structures for ray tracing. According to our experiments, our optimization approach, when compared against the baseline kd-trees, not only achieved performance improvements of up to 1.33× and 1.44× for optimized kd-trees and MBVHs, respectively, with minimal increases in power consumption on a Google Pixel 8a, but also enabled analysis of the advantages and disadvantages of each acceleration structure in various test scenes.We expect that our research will serve as a valuable reference for future studies on sound propagation and the broader multimedia community.
A Practical Encoding Approach for Texture Compression: Combining Multi-Processing and Multi-Threading
Hyeon-ki Lee, Jae-Ho Nah
High-resolution textures are critical for delivering immersive graphics. In game development, these textures are typically stored in compressed formats and encoded offline. However, when encoding a large number of textures in parallel, the performance benefits of multi-threading can be limited by bottlenecks, including image loading and decoding (e.g., PNG).
Efficient Haze Removal from a Single Image Using a DCP-Based Lightweight U-Net Neural Network Model
Yunho Han, Jiyoung Kim, Jinyoung Lee, Jae‐Ho Nah, Yo‐Sung Ho, Woo-Chan Park
IF 3.5
Sensors
In this paper, we propose a lightweight U-net architecture neural network model based on Dark Channel Prior (DCP) for efficient haze (fog) removal with a single input. The existing DCP requires high computational complexity in its operation. These computations are challenging to accelerate, and the problem is exacerbated when dealing with high-resolution images (videos), making it very difficult to apply to general-purpose applications. Our proposed model addresses this issue by employing a two-stage neural network structure, replacing the computationally complex operations of the conventional DCP with easily accelerated convolution operations to achieve high-quality fog removal. Furthermore, our proposed model is designed with an intuitive structure using a relatively small number of parameters (2M), utilizing resources efficiently. These features demonstrate the effectiveness and efficiency of the proposed model for fog removal. The experimental results show that the proposed neural network model achieves an average Peak Signal-to-Noise Ratio (PSNR) of 26.65 dB and a Structural Similarity Index Measure (SSIM) of 0.88, indicating an improvement in the average PSNR of 11.5 dB and in SSIM of 0.22 compared to the conventional DCP. This shows that the proposed neural network achieves comparable results to CNN-based neural networks that have achieved SOTA-class performance, despite its intuitive structure with a relatively small number of parameters.
An Architecture and Implementation of Real-Time Sound Propagation Hardware for Mobile Devices
Eunjae Kim, Sukwon Choi, J.K. Kim, Jae‐Ho Nah, Woonam Jung, Tae-Hyeong Lee, Yeon-Kug Moon, Woo-Chan Park
This paper presents a high-performance and low-power hardware architecture for real-time sound rendering on mobile devices. Traditional sound rendering algorithms require high-performance CPUs or GPUs for processing because of its high computational complexities to realize ultra-realistic 3D audio. Thus, it has been hard to achieve real-time rates on low-power mobile devices. To overcome this limitation, we propose a hardware architecture that adopts hardware-friendly sound-propagation-path calculation algorithms. We verified the function and performance of our architecture through its implementation on an FPGA board. According to ASIC evaluation with the 8-nm process technology, it achieves high performance with 120 FPS, low power consumption with 50 mW, and a small silicon area with 0.31 mm2, allowing real-time sound rendering on mobile devices.