Efficient Occupancy Prediction with Instance-Level Attention
Sungjin Park, Jaeha Song, Soonmin Hwang
Occupancy prediction is a critical task in autonomous driving, enabling better understanding of 3D environments for downstream tasks. Previous methods often rely on dense back-projection methods to extract 3D features from 2D images by distributing information across all voxels. While effective, these approaches are computationally expensive and inefficient due to the dense nature of 3D voxel representations. Inspired by recent works, we address this challenge by instance-level attention that utilizes representative queries for groups of voxels, reducing computational cost while maintaining competitive performance. By applying attention mechanisms to this instance-level representation, we achieve an mIoU score of 35.26 with a latency of 0.04s on the <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">O</tex>cc3D dataset in RTX4090. These results demonstrate that focusing on instance-level representations provides an efficient and practical solution for real-time occupancy prediction tasks.
Jeongmin Shin, Jiwon Kim, S. M. Kwon, Namil Kim, Soonmin Hwang, Yukyung Choi
IF 4.5
IEEE Sensors Journal
Multispectral image alignment plays a crucial role in exploiting complementary information between different spectral images. Homography-based image alignment can be a practical solution considering a tradeoff between runtime and accuracy. Existing methods, however, have difficulty with multispectral images due to the additional spectral gap or require expensive human labels to train models. To solve these problems, this paper presents a comprehensive study on multispectral homography estimation in an unsupervised learning manner. We propose a curriculum data augmentation, an effective solution for models learning spectrum-agnostic representation by providing diverse input pairs. We also propose to use the phase congruency loss that explicitly calculates the reconstruction between images based on low-level structural information in the frequency domain. To encourage multispectral alignment research, we introduce a novel FLIR corresponding dataset that has manually labeled local correspondences between multispectral images. Our model achieves state-of-the-art alignment performance on the proposed FLIR correspondence dataset among supervised and unsupervised methods while running at <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">151 FPS</i> . Furthermore, our model shows good generalization ability on the M3FD dataset without finetuning.
TransDSSL: Transformer Based Depth Estimation via Self-Supervised Learning
Daechan Han, Jeongmin Shin, Namil Kim, Soonmin Hwang, Yukyung Choi
IF 5.3
IEEE Robotics and Automation Letters
Recently, transformers have been widely adopted for various computer vision tasks and show promising results due to their ability to encode long-range spatial dependencies in an image effectively. However, very few studies on adopting transformers in self-supervised depth estimation have been conducted. When replacing the CNN architecture with the transformer in self-supervised learning of depth, we encounter several problems such as problematic multi-scale photometric loss function when used with transformers and, insufficient ability to capture local details. In this letter, we propose an attention-based decoder module, Pixel-Wise Skip Attention (PWSA), to enhance fine details in feature maps while keeping global context from transformers. In addition, we propose utilizing self-distillation loss with single-scale photometric loss to alleviate the instability of transformer training by using correct training signals. We demonstrate that the proposed model performs accurate predictions on large objects and thin structures that require global context and local details. Our model achieves state-of-the-art performance among the self-supervised monocular depth estimation methods on KITTI and DDAD benchmarks.
Efficient Occupancy Prediction with Instance-Level Attention
Sungjin Park, Jaeha Song, Soonmin Hwang
Occupancy prediction is a critical task in autonomous driving, enabling better understanding of 3D environments for downstream tasks. Previous methods often rely on dense back-projection methods to extract 3D features from 2D images by distributing information across all voxels. While effective, these approaches are computationally expensive and inefficient due to the dense nature of 3D voxel representations. Inspired by recent works, we address this challenge by instance-level attention that utilizes representative queries for groups of voxels, reducing computational cost while maintaining competitive performance. By applying attention mechanisms to this instance-level representation, we achieve an mIoU score of 35.26 with a latency of 0.04s on the <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">O</tex>cc3D dataset in RTX4090. These results demonstrate that focusing on instance-level representations provides an efficient and practical solution for real-time occupancy prediction tasks.
Jeongmin Shin, Jiwon Kim, S. M. Kwon, Namil Kim, Soonmin Hwang, Yukyung Choi
IF 4.5
IEEE Sensors Journal
Multispectral image alignment plays a crucial role in exploiting complementary information between different spectral images. Homography-based image alignment can be a practical solution considering a tradeoff between runtime and accuracy. Existing methods, however, have difficulty with multispectral images due to the additional spectral gap or require expensive human labels to train models. To solve these problems, this paper presents a comprehensive study on multispectral homography estimation in an unsupervised learning manner. We propose a curriculum data augmentation, an effective solution for models learning spectrum-agnostic representation by providing diverse input pairs. We also propose to use the phase congruency loss that explicitly calculates the reconstruction between images based on low-level structural information in the frequency domain. To encourage multispectral alignment research, we introduce a novel FLIR corresponding dataset that has manually labeled local correspondences between multispectral images. Our model achieves state-of-the-art alignment performance on the proposed FLIR correspondence dataset among supervised and unsupervised methods while running at <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">151 FPS</i> . Furthermore, our model shows good generalization ability on the M3FD dataset without finetuning.
TransDSSL: Transformer Based Depth Estimation via Self-Supervised Learning
Daechan Han, Jeongmin Shin, Namil Kim, Soonmin Hwang, Yukyung Choi
IF 5.3
IEEE Robotics and Automation Letters
Recently, transformers have been widely adopted for various computer vision tasks and show promising results due to their ability to encode long-range spatial dependencies in an image effectively. However, very few studies on adopting transformers in self-supervised depth estimation have been conducted. When replacing the CNN architecture with the transformer in self-supervised learning of depth, we encounter several problems such as problematic multi-scale photometric loss function when used with transformers and, insufficient ability to capture local details. In this letter, we propose an attention-based decoder module, Pixel-Wise Skip Attention (PWSA), to enhance fine details in feature maps while keeping global context from transformers. In addition, we propose utilizing self-distillation loss with single-scale photometric loss to alleviate the instability of transformer training by using correct training signals. We demonstrate that the proposed model performs accurate predictions on large objects and thin structures that require global context and local details. Our model achieves state-of-the-art performance among the self-supervised monocular depth estimation methods on KITTI and DDAD benchmarks.
Boosting Cross-Spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation
S. M. Kwon, Jeongmin Shin, Namil Kim, Soonmin Hwang, Yukyung Choi
In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effectively utilize the complementary information between RGB and thermal images, they significantly decrease performance during domain adaptation. In this paper, we present a comprehensive study on cross-spectral UDA for thermal image semantic segmentation. We first propose a novel masked mutual learning strategy that promotes complementary information exchange by selectively transferring results between each spectral model while masking out uncertain regions. Additionally, we introduce a novel prototypical self-supervised loss designed to enhance the performance of the thermal segmentation model in nighttime scenarios. This approach addresses the limitations of RGB pre-trained networks, which cannot effectively transfer knowledge under low illumination due to the inherent constraints of RGB sensors. In experiments, our method achieves higher performance over previous UDA methods and comparable performance to state-of-the-art supervised methods.
DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation
Jisoo Kim, Jungbin Cho, Joonho Park, Soonmin Hwang, Da Eun Kim, Geon Kim, Youngjae Yu
Proceedings of the AAAI Conference on Artificial Intelligence
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hindering their applicability. To address these challenges, we introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs. To achieve this, we first train DEE (Dynamic Emotion Embedding), which employs probabilistic contrastive learning to forge a joint emotion embedding space for both speech and facial motion. This probabilistic framework captures the uncertainty in interpreting emotions from speech and facial motion, enabling the derivation of emotion vectors from its multifaceted space. Moreover, to generate dynamic facial motion, we design TH-VQVAE (Temporally Hierarchical VQ-VAE) as an expressive and robust motion prior overcoming limitations of VAEs and VQ-VAEs. Utilizing these strong priors, we develop DEEPTalk, a talking head generator that non-autoregressively predicts codebook indices to create dynamic facial motion, incorporating a novel emotion consistency loss. Extensive experiments on various datasets demonstrate the effectiveness of our approach in creating diverse, emotionally expressive talking faces that maintain accurate lip-sync. Our project page is available at https://whwjdqls.github.io/deeptalk.github.io/.
Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation
S. M. Kwon, Jeongmin Shin, Namil Kim, Soonmin Hwang, Yukyung Choi
ArXiv.org
In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effectively utilize the complementary information between RGB and thermal images, they significantly decrease performance during domain adaptation. In this paper, we present a comprehensive study on cross-spectral UDA for thermal image semantic segmentation. We first propose a novel masked mutual learning strategy that promotes complementary information exchange by selectively transferring results between each spectral model while masking out uncertain regions. Additionally, we introduce a novel prototypical self-supervised loss designed to enhance the performance of the thermal segmentation model in nighttime scenarios. This approach addresses the limitations of RGB pre-trained networks, which cannot effectively transfer knowledge under low illumination due to the inherent constraints of RGB sensors. In experiments, our method achieves higher performance over previous UDA methods and comparable performance to state-of-the-art supervised methods.
RoCaRS: Robust Camera-Radar BEV Segmentation for Sensor Failure Scenarios
B. Park, Jeongtae Kim, Yunseol Cho, Soonmin Hwang
While camera–radar fusion has led to notable progress in autonomous driving, many existing approaches overlook the risk of sensor failures, which can critically compromise system safety. To address this limitation, we propose RoCaRS, a robust camera–radar fusion model designed for bird’s-eye view (BEV) segmentation under sensor failure scenarios. RoCaRS incorporates two key components—Radar-aware Backbone (RB) and Feature Spreading (FS)—to enhance BEV feature representation, along with a Dynamic Input Dropout Strategy (DIDS) and Bidirectional Feature Refinement (BFR) to address missing sensor inputs. Experiments on the nuScenes benchmark show that RoCaRS not only outperforms state-of-the-art fusion models under normal conditions but also maintains high performance under various sensor failure settings. Notably, in the complete absence of camera input, RoCaRS exceeds the baseline by +23.2 mIoU for map and +30.0 IoU for vehicle. Furthermore, it retains 99% of the radar-only model’s performance and achieves 103% of the camera-only model’s performance when either all cameras or all radars are disabled—without any retraining. These results highlight the potential of intermediate fusion to match the robustness of late fusion, while more effectively leveraging complementary modalities.
Kyeong-Ju Cha, Hyunwoo Park, W. Choe, Soonmin Hwang, Sunwoo Kim
This paper proposes an end-to-end radio simultaneous localization and mapping (SLAM) algorithm that directly leverages channel impulse response (CIR) to overcome fundamental limitations in existing approaches. Traditional radio SLAM algorithms assume pre-estimated channel parameters, making performance highly sensitive to estimation accuracy, while recent end-to-end methods jointly perform parameter estimation and SLAM but suffer from high computational complexity and model mismatch vulnerability. The proposed algorithm minimizes information loss by operating directly on raw CIR measurements and utilizes end-to-end learning for enhanced robustness. Simulation results in the 3GPP TR 38.857 indoor factory scenario demonstrate that the proposed algorithm achieves comparable performance to conventional radio SLAM while reducing computational time by less than 2%, confirming its strong potential for practical deployment.
Leveraging Camera-Based Methods for Enhanced Feature-to-World Mapping
Jaeha Song, Sungjin Park, Soonmin Hwang
Scene representation in autonomous driving relies heavily on extracting meaningful features from images and accurately mapping them to 3D world coordinates. Traditional methods, such as ResNet-based backbones pretrained on ImageNet, provide a robust foundation for feature extraction but are increasingly viewed as limited when it comes to aligning features with the 3D world. This paper explores the integration of advanced segmentation models as backbones, focusing on how feature quality at the extraction stage directly impacts downstream scene representation tasks. Preliminary experiments demonstrate the potential for improved feature alignment and semantic consistency, highlighting the importance of robust backbone design in modern 3D perception pipelines.
SafeShift: Safety-Informed Distribution Shifts for Robust Trajectory Prediction in Autonomous Driving
Benjamin Stoler, Ingrid Navarro, Meghdeep Jana, Soonmin Hwang, Jonathan Francis, Jean Oh
As autonomous driving technology matures, the safety and robustness of its key components, including trajectory prediction is vital. Although real-world datasets such as Waymo Open Motion provide recorded real scenarios, the majority of the scenes appear benign, often lacking diverse safety-critical situations that are essential for developing robust models against nuanced risks. However, generating safety-critical data using simulation faces severe simulation to real gap. Using real-world environments is even less desirable due to safety risks. In this context, we propose an approach to utilize existing real-world datasets by identifying safetyrelevant scenarios naively overlooked, e.g., near misses and proactive maneuvers. Our approach expands the spectrum of safety-relevance, allowing us to study trajectory prediction models under a safety-informed, distribution shift setting. We contribute a versatile scenario characterization method, a novel scoring scheme for reevaluating a scene using counterfactual scenarios to find hidden risky scenarios, and an evaluation of trajectory prediction models in this setting. We further contribute a remediation strategy, achieving a 10% average reduction in predicted trajectories’ collision rates. To facilitate future research, we release our code for this overall SafeShift framework to the public: github.com/cmubig/SafeShift