Visual Hindsight Self-Imitation Learning for Interactive Navigation
K. M. Kim, Moonhoen Lee, Min Whoo Lee, Kisung Shin, Minsu Lee, Byoung‐Tak Zhang
IF 3.6
IEEE Access
Interactive visual navigation tasks, which involve following instructions to reach and interact with specific targets, are challenging not only because successful experiences are very rare but also because complex visual inputs require a substantial number of samples. Previous methods for these tasks often rely on intricately designed dense rewards or the use of expensive expert data for imitation learning. To tackle these challenges, we propose a novel approach, Visual Hindsight Self-Imitation Learning (VHS), which enables re-labeling in vision-based and partially observable environments through Prototypical Goal (PG) embedding. We introduce the PG embeddings, which are derived from experienced goal observations, as opposed to handling instructions as word embeddings. This embedding technique allows the agent to visually reinterpret its unsuccessful attempts, enabling vision-based goal re-labeling and self-imitation from enhanced successful experiences. Experimental results show that VHS outperforms existing techniques in interactive visual navigation tasks, confirming its superior performance, sample efficiency, and generalization.
Visual Hindsight Self-Imitation Learning for Interactive Navigation
K. M. Kim, Moonhoen Lee, Min Whoo Lee, Kisung Shin, Minsu Lee, Byoung‐Tak Zhang
IF 3.6
IEEE Access
Interactive visual navigation tasks, which involve following instructions to reach and interact with specific targets, are challenging not only because successful experiences are very rare but also because complex visual inputs require a substantial number of samples. Previous methods for these tasks often rely on intricately designed dense rewards or the use of expensive expert data for imitation learning. To tackle these challenges, we propose a novel approach, Visual Hindsight Self-Imitation Learning (VHS), which enables re-labeling in vision-based and partially observable environments through Prototypical Goal (PG) embedding. We introduce the PG embeddings, which are derived from experienced goal observations, as opposed to handling instructions as word embeddings. This embedding technique allows the agent to visually reinterpret its unsuccessful attempts, enabling vision-based goal re-labeling and self-imitation from enhanced successful experiences. Experimental results show that VHS outperforms existing techniques in interactive visual navigation tasks, confirming its superior performance, sample efficiency, and generalization.
CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision
Gi-Cheon Kang, Jung-Hyun Kim, Kyu-Hwan Shim, Jun Hyuk Lee, Byoung‐Tak Zhang
Teaching robots desired skills in real-world environments remains challenging, especially for non-experts.A key bottleneck is that collecting robotic data often requires expertise or specialized hardware, limiting accessibility and scalability.We posit that natural language offers an intuitive and accessible interface for robot learning.To this end, we study two aspects:(1) enabling non-experts to collect robotic data through natural language supervision (e.g., "move the arm to the right") and ( 2) training robot policies directly from this supervision.Specifically, we introduce a data collection framework that collects robot demonstrations based on natural language supervision and further augments these demonstrations.We then present CLIP-RT, a new vision-language-action (VLA) model that learns language-conditioned visuomotor policies from this supervision.CLIP-RT adapts the pretrained CLIP model and learns to predict language-based motion primitives via contrastive imitation learning.We train CLIP-RT on the Open X-Embodiment dataset and finetune it on in-domain data collected by our framework.In realworld evaluations, CLIP-RT demonstrates strong capabilities in learning novel manipulation skills, outperforming OpenVLA (7B parameters) by 24% in average success rates, while using 7x fewer parameters (1B).We further assess CLIP-RT's capabilities in few-shot generalization and collaborative scenarios involving large pretrained models or humans.In simulated environments, CLIP-RT also yields strong performance, achieving a 92.8% average success rate on the LIBERO benchmark with an inference throughput of 163 Hz.
Sujin Jeon, Hyundo Lee, E. T. Kim, Sanghack Lee, Byoung‐Tak Zhang, Inwoo Hwang
ArXiv.org
Concept bottleneck models (CBMs) are inherently interpretable models that make predictions based on human-understandable visual cues, referred to as concepts. As obtaining dense concept annotations with human labeling is demanding and costly, recent approaches utilize foundation models to determine the concepts existing in the images. However, such label-free CBMs often fail to localize concepts in relevant regions, attending to visually unrelated regions when predicting concept presence. To this end, we propose a framework, coined Locality-aware Concept Bottleneck Model (LCBM), which utilizes rich information from foundation models and adopts prototype learning to ensure accurate spatial localization of the concepts. Specifically, we assign one prototype to each concept, promoted to represent a prototypical image feature of that concept. These prototypes are learned by encouraging them to encode similar local regions, leveraging foundation models to assure the relevance of each prototype to its associated concept. Then we use the prototypes to facilitate the learning process of identifying the proper local region from which each concept should be predicted. Experimental results demonstrate that LCBM effectively identifies present concepts in the images and exhibits improved localization while maintaining comparable classification performance.
From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning
Junseok Park, Hyeonseo Yang, Min Whoo Lee, Wonseok Choi, Minsu Lee, Byoung‐Tak Zhang
ArXiv.org
Reinforcement learning (RL) agents often face challenges in balancing exploration and exploitation, particularly in environments where sparse or dense rewards bias learning. Biological systems, such as human toddlers, naturally navigate this balance by transitioning from free exploration with sparse rewards to goal-directed behavior guided by increasingly dense rewards. Inspired by this natural progression, we investigate the Toddler-Inspired Reward Transition in goal-oriented RL tasks. Our study focuses on transitioning from sparse to potential-based dense (S2D) rewards while preserving optimal strategies. Through experiments on dynamic robotic arm manipulation and egocentric 3D navigation tasks, we demonstrate that effective S2D reward transitions significantly enhance learning performance and sample efficiency. Additionally, using a Cross-Density Visualizer, we show that S2D transitions smooth the policy loss landscape, resulting in wider minima that improve generalization in RL models. In addition, we reinterpret Tolman's maze experiments, underscoring the critical role of early free exploratory learning in the context of S2D rewards.
CraftGround: A Flexible Reinforcement Learning Environment Based on the Latest Minecraft
Hyeonseo Yang, Minsu Lee, Byoung‐Tak Zhang
Journal of KIISE
본 논문은 최신 마인크래프트 버전(1.21)을 기반으로 한 새로운 강화학습 환경 CraftGround를 소개한다. CraftGround는 유연한 실험 설정을 제공하며, 복잡한 3D 환경에서의 강화학습을 지원한다. 시각적 데이터, 소리 신호, 생물 군계 정보, 게임 내 통계와 같은 다양한 관측 정보를 제공하여 에이전트의 성능을 다각도로 평가할 수 있다. 본 연구는 나무 채취, 적대적 몬스터 회피, 낚시와 같은 여러 태스크에서 VPT, PPO, RecurrentPPO, DQN 에이전트를 평가하였다. VPT는 사전 학습 덕분에 높은 성능과 효율성을 보였으며, PPO 및 RecurrentPPO와 같은 온라인 학습 알고리즘은 환경 변화에 적응하며 시간이 지남에 따라 성능이 향상되었다. 이 결과는 CraftGround가 동적 3D 시뮬레이션에서 적응적 에이전트 행동 연구를 촉진할 가능성을 보여준다.
Relaxed HLP: A Metric for Flexible High-Level Plan Evaluation in Embodied Environments
Suyeon Shin, Byoung‐Tak Zhang
As the field of robotics advances, Embodied Instruction Following (EIF) has emerged as a key challenge in artificial intelligence. EIF tasks require agents to interpret and execute natural language instructions by predicting and completing sequences of subgoals within a physical environment. Traditional evaluation metrics—such as Success Rate (SR), Goal Condition (GC), Path Length Weighted SR (PLWSR), and Path Length Weighted GC (PLWGC)—primarily focus on task success following low-level control actions. However, these metrics inadequately assess the accuracy of high-level planning, which is critical for overall task performance. Existing methods for evaluating high-level planning often rely on comparing predicted plans to a single human-annotated ground truth trajectory, implicitly assuming the existence of only one correct solution. In practice, many instructions allow for multiple valid trajectories that can achieve the same goal. To address these limitations, we propose Relaxed HLP, a novel metric designed to evaluate high-level planning more flexibly by accounting for alternative valid plans. Relaxed HLP introduces three key considerations: temporal agnosticism, spatial agnosticism, and interchangeable actions, thus enabling a more comprehensive assessment of high-level plan accuracy. We validate the effectiveness of Relaxed HLP through human evaluations, demonstrating that it aligns more closely with human judgment compared to traditional ground truth-based metrics. Our results underscore the robustness of Relaxed HLP in capturing diverse, semantically equivalent plans, offering a more accurate assessment of high-level planning in EIF tasks.
Exploring Ordinal Bias in Action Recognition for Instructional Videos
Joo Chan Kim, Minjoon Jung, Byoung‐Tak Zhang
ArXiv.org
Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos.
생성 모델은 이미지를 생성하는 데 탁월한 성능을 보이지만, 소리, 특히 영상의 후편집 작업에 삽입되는 효과음인 폴리 사운드(foley-sound)를 생성하는 데에는 충분한 성능을 내지 못하고 있다. 우리는 이를 보완하기 위해 폴리 사운드 생성을 위한 비디오 데이터셋을 재구축했다. VGG-Sound는 유튜브 영상의 ID와 라벨을 제공하는 비디오 데이터셋이다. 하지만 이 데이터셋의 클래스는 폴리 사운드에 적합하지 않다. 본 연구는 먼저, VGG-Sound의 유튜브 ID로 섬네일 이미지를 추출하고 이를 대형 멀티모달 모델인 LLaVA에 입력해 해당 영상의 주요 재질을 예측했다. 또한 CLAP을 이용해, 오디오의 재질을 예측했다. 그리고 두 예측을 결합하여 폴리 사운드 생성에 적합한 데이터셋을 만들었다. 평가 결과, 시각 정보와 청각 정보를 모두 사용했을 때, 더욱 오디오와 유사도가 높은 라벨이 생성되었다. 또한 이 데이터셋으로 소리 생성 모델을 미세 조정할 시 일부 정량 평가 척도에서 성능이 향상되었다.