논문 | 장병탁 교수 연구실 | 서울대학교 컴퓨터공학부

장병탁 교수 연구실

서비스 플랜

연구실 검색

프로젝트 공고

정부 과제 추천

AI 기반 기업 서칭

홈

기본 정보

연구 분야

프로젝트

발행물

구성원

논문

주요 논문

article

인용수 0

2025

INQUIRER: Harnessing internal knowledge graphs for video question generation

Woo Suk Choi, Youwon Jang, Minsu Lee, Byoung‐Tak Zhang

IF 7.6

Knowledge-Based Systems

https://doi.org/10.1016/j.knosys.2025.114033

Computer science

Knowledge graph

Knowledge management

Artificial intelligence

article

gold

인용수 2

2024

Visual Hindsight Self-Imitation Learning for Interactive Navigation

K. M. Kim, Moonhoen Lee, Min Whoo Lee, Kisung Shin, Minsu Lee, Byoung‐Tak Zhang

IF 3.6

IEEE Access

Interactive visual navigation tasks, which involve following instructions to reach and interact with specific targets, are challenging not only because successful experiences are very rare but also because complex visual inputs require a substantial number of samples. Previous methods for these tasks often rely on intricately designed dense rewards or the use of expensive expert data for imitation learning. To tackle these challenges, we propose a novel approach, Visual Hindsight Self-Imitation Learning (VHS), which enables re-labeling in vision-based and partially observable environments through Prototypical Goal (PG) embedding. We introduce the PG embeddings, which are derived from experienced goal observations, as opposed to handling instructions as word embeddings. This embedding technique allows the agent to visually reinterpret its unsuccessful attempts, enabling vision-based goal re-labeling and self-imitation from enhanced successful experiences. Experimental results show that VHS outperforms existing techniques in interactive visual navigation tasks, confirming its superior performance, sample efficiency, and generalization.

http://dx.doi.org/10.1109/access.2024.3413864

Hindsight bias

Computer science

Imitation

Artificial intelligence

Embedding

Human–computer interaction

Task (project management)

Machine learning

Computer vision

Cognitive psychology

editorial

bronze

인용수 2

2022

Editorial: Task planning and motion control problems of service robots in human-centered environments

Hyungpil Moon, Byoung‐Tak Zhang, Changjoo Nam

IF 4.3

Intelligent Service Robotics

https://doi.org/10.1007/s11370-022-00442-6

Computer science

Task (project management)

Robot

Human–computer interaction

Motion (physics)

Service (business)

Control (management)

Motion planning

Human motion

Artificial intelligence

전체 논문

319

article

인용수 0

2025

INQUIRER: Harnessing internal knowledge graphs for video question generation

Woo Suk Choi, Youwon Jang, Minsu Lee, Byoung‐Tak Zhang

IF 7.6

Knowledge-Based Systems

https://doi.org/10.1016/j.knosys.2025.114033

Computer science

Knowledge graph

Knowledge management

Artificial intelligence

article

gold

인용수 2

2024

Visual Hindsight Self-Imitation Learning for Interactive Navigation

K. M. Kim, Moonhoen Lee, Min Whoo Lee, Kisung Shin, Minsu Lee, Byoung‐Tak Zhang

IF 3.6

IEEE Access

http://dx.doi.org/10.1109/access.2024.3413864

Hindsight bias

Computer science

Imitation

Artificial intelligence

Embedding

Human–computer interaction

Task (project management)

Machine learning

Computer vision

Cognitive psychology

editorial

bronze

인용수 2

2022

Editorial: Task planning and motion control problems of service robots in human-centered environments

Hyungpil Moon, Byoung‐Tak Zhang, Changjoo Nam

IF 4.3

Intelligent Service Robotics

https://doi.org/10.1007/s11370-022-00442-6

Computer science

Task (project management)

Robot

Human–computer interaction

Motion (physics)

Service (business)

Control (management)

Motion planning

Human motion

Artificial intelligence

article

gold

인용수 1

2025

CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision

Gi-Cheon Kang, Jung-Hyun Kim, Kyu-Hwan Shim, Jun Hyuk Lee, Byoung‐Tak Zhang

Teaching robots desired skills in real-world environments remains challenging, especially for non-experts.A key bottleneck is that collecting robotic data often requires expertise or specialized hardware, limiting accessibility and scalability.We posit that natural language offers an intuitive and accessible interface for robot learning.To this end, we study two aspects:(1) enabling non-experts to collect robotic data through natural language supervision (e.g., "move the arm to the right") and ( 2) training robot policies directly from this supervision.Specifically, we introduce a data collection framework that collects robot demonstrations based on natural language supervision and further augments these demonstrations.We then present CLIP-RT, a new vision-language-action (VLA) model that learns language-conditioned visuomotor policies from this supervision.CLIP-RT adapts the pretrained CLIP model and learns to predict language-based motion primitives via contrastive imitation learning.We train CLIP-RT on the Open X-Embodiment dataset and finetune it on in-domain data collected by our framework.In realworld evaluations, CLIP-RT demonstrates strong capabilities in learning novel manipulation skills, outperforming OpenVLA (7B parameters) by 24% in average success rates, while using 7x fewer parameters (1B).We further assess CLIP-RT's capabilities in few-shot generalization and collaborative scenarios involving large pretrained models or humans.In simulated environments, CLIP-RT also yields strong performance, achieving a 92.8% average success rate on the LIBERO benchmark with an inference throughput of 163 Hz.

https://doi.org/10.15607/rss.2025.xxi.016

Natural language

Natural (archaeology)

Action (physics)

Field (mathematics)

Experiential learning

preprint

green

인용수 0

2025

Locality-aware Concept Bottleneck Model

Sujin Jeon, Hyundo Lee, E. T. Kim, Sanghack Lee, Byoung‐Tak Zhang, Inwoo Hwang

ArXiv.org

Concept bottleneck models (CBMs) are inherently interpretable models that make predictions based on human-understandable visual cues, referred to as concepts. As obtaining dense concept annotations with human labeling is demanding and costly, recent approaches utilize foundation models to determine the concepts existing in the images. However, such label-free CBMs often fail to localize concepts in relevant regions, attending to visually unrelated regions when predicting concept presence. To this end, we propose a framework, coined Locality-aware Concept Bottleneck Model (LCBM), which utilizes rich information from foundation models and adopts prototype learning to ensure accurate spatial localization of the concepts. Specifically, we assign one prototype to each concept, promoted to represent a prototypical image feature of that concept. These prototypes are learned by encouraging them to encode similar local regions, leveraging foundation models to assure the relevance of each prototype to its associated concept. Then we use the prototypes to facilitate the learning process of identifying the proper local region from which each concept should be predicted. Experimental results demonstrate that LCBM effectively identifies present concepts in the images and exhibits improved localization while maintaining comparable classification performance.

http://arxiv.org/abs/2508.14562

Bottleneck

ENCODE

Relevance (law)

Process (computing)

Information bottleneck method

Foundation (evidence)

Feature (linguistics)

preprint

green

인용수 0

2025

From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning

Junseok Park, Hyeonseo Yang, Min Whoo Lee, Wonseok Choi, Minsu Lee, Byoung‐Tak Zhang

ArXiv.org

Reinforcement learning (RL) agents often face challenges in balancing exploration and exploitation, particularly in environments where sparse or dense rewards bias learning. Biological systems, such as human toddlers, naturally navigate this balance by transitioning from free exploration with sparse rewards to goal-directed behavior guided by increasingly dense rewards. Inspired by this natural progression, we investigate the Toddler-Inspired Reward Transition in goal-oriented RL tasks. Our study focuses on transitioning from sparse to potential-based dense (S2D) rewards while preserving optimal strategies. Through experiments on dynamic robotic arm manipulation and egocentric 3D navigation tasks, we demonstrate that effective S2D reward transitions significantly enhance learning performance and sample efficiency. Additionally, using a Cross-Density Visualizer, we show that S2D transitions smooth the policy loss landscape, resulting in wider minima that improve generalization in RL models. In addition, we reinterpret Tolman's maze experiments, underscoring the critical role of early free exploratory learning in the context of S2D rewards.

http://arxiv.org/abs/2501.17842

Toddler

Reinforcement learning

Transition (genetics)

Reinforcement

Psychology

Cognitive psychology

Artificial intelligence

Computer science

Developmental psychology

Social psychology

article

인용수 0

2025

CraftGround: A Flexible Reinforcement Learning Environment Based on the Latest Minecraft

Hyeonseo Yang, Minsu Lee, Byoung‐Tak Zhang

Journal of KIISE

본 논문은 최신 마인크래프트 버전(1.21)을 기반으로 한 새로운 강화학습 환경 CraftGround를 소개한다. CraftGround는 유연한 실험 설정을 제공하며, 복잡한 3D 환경에서의 강화학습을 지원한다. 시각적 데이터, 소리 신호, 생물 군계 정보, 게임 내 통계와 같은 다양한 관측 정보를 제공하여 에이전트의 성능을 다각도로 평가할 수 있다. 본 연구는 나무 채취, 적대적 몬스터 회피, 낚시와 같은 여러 태스크에서 VPT, PPO, RecurrentPPO, DQN 에이전트를 평가하였다. VPT는 사전 학습 덕분에 높은 성능과 효율성을 보였으며, PPO 및 RecurrentPPO와 같은 온라인 학습 알고리즘은 환경 변화에 적응하며 시간이 지남에 따라 성능이 향상되었다. 이 결과는 CraftGround가 동적 3D 시뮬레이션에서 적응적 에이전트 행동 연구를 촉진할 가능성을 보여준다.

https://doi.org/10.5626/jok.2025.52.3.189

Reinforcement learning

Reinforcement

Human–computer interaction

Computer science

Psychology

Artificial intelligence

Social psychology

article

인용수 0

2025

Relaxed HLP: A Metric for Flexible High-Level Plan Evaluation in Embodied Environments

Suyeon Shin, Byoung‐Tak Zhang

As the field of robotics advances, Embodied Instruction Following (EIF) has emerged as a key challenge in artificial intelligence. EIF tasks require agents to interpret and execute natural language instructions by predicting and completing sequences of subgoals within a physical environment. Traditional evaluation metrics—such as Success Rate (SR), Goal Condition (GC), Path Length Weighted SR (PLWSR), and Path Length Weighted GC (PLWGC)—primarily focus on task success following low-level control actions. However, these metrics inadequately assess the accuracy of high-level planning, which is critical for overall task performance. Existing methods for evaluating high-level planning often rely on comparing predicted plans to a single human-annotated ground truth trajectory, implicitly assuming the existence of only one correct solution. In practice, many instructions allow for multiple valid trajectories that can achieve the same goal. To address these limitations, we propose Relaxed HLP, a novel metric designed to evaluate high-level planning more flexibly by accounting for alternative valid plans. Relaxed HLP introduces three key considerations: temporal agnosticism, spatial agnosticism, and interchangeable actions, thus enabling a more comprehensive assessment of high-level plan accuracy. We validate the effectiveness of Relaxed HLP through human evaluations, demonstrating that it aligns more closely with human judgment compared to traditional ground truth-based metrics. Our results underscore the robustness of Relaxed HLP in capturing diverse, semantically equivalent plans, offering a more accurate assessment of high-level planning in EIF tasks.

https://doi.org/10.1109/bigcomp64353.2025.00061

Embodied cognition

Metric (unit)

Computer science

Plan (archaeology)

Human–computer interaction

Artificial intelligence

Engineering

Operations management

Geology

preprint

green

인용수 0

2025

Exploring Ordinal Bias in Action Recognition for Instructional Videos

Joo Chan Kim, Minjoon Jung, Byoung‐Tak Zhang

ArXiv.org

Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos.

http://arxiv.org/abs/2504.06580

Action (physics)

Action recognition

Sequence (biology)

Vulnerability (computing)

Order (exchange)

Field (mathematics)

article

인용수 0

2025

VGG-Foley-Sound: Reconstruction of Foley-sound Video Dataset using a Large Multi-modal Model and CLAP

Yunsu Lee, Su-Hyung Choi, Hanwool Sul, Suyeon Shin, Byoung‐Tak Zhang

KIISE Transactions on Computing Practices

생성 모델은 이미지를 생성하는 데 탁월한 성능을 보이지만, 소리, 특히 영상의 후편집 작업에 삽입되는 효과음인 폴리 사운드(foley-sound)를 생성하는 데에는 충분한 성능을 내지 못하고 있다. 우리는 이를 보완하기 위해 폴리 사운드 생성을 위한 비디오 데이터셋을 재구축했다. VGG-Sound는 유튜브 영상의 ID와 라벨을 제공하는 비디오 데이터셋이다. 하지만 이 데이터셋의 클래스는 폴리 사운드에 적합하지 않다. 본 연구는 먼저, VGG-Sound의 유튜브 ID로 섬네일 이미지를 추출하고 이를 대형 멀티모달 모델인 LLaVA에 입력해 해당 영상의 주요 재질을 예측했다. 또한 CLAP을 이용해, 오디오의 재질을 예측했다. 그리고 두 예측을 결합하여 폴리 사운드 생성에 적합한 데이터셋을 만들었다. 평가 결과, 시각 정보와 청각 정보를 모두 사용했을 때, 더욱 오디오와 유사도가 높은 라벨이 생성되었다. 또한 이 데이터셋으로 소리 생성 모델을 미세 조정할 시 일부 정량 평가 척도에서 성능이 향상되었다.

https://doi.org/10.5626/ktcp.2025.31.5.271

Foley

Sound (geography)

Modal

Computer science

Acoustics

Materials science

Physics

전체 논문

319

article

인용수 0

2025

INQUIRER: Harnessing internal knowledge graphs for video question generation

Woo Suk Choi, Youwon Jang, Minsu Lee, Byoung‐Tak Zhang

IF 7.6

Knowledge-Based Systems

https://doi.org/10.1016/j.knosys.2025.114033

Computer science

Knowledge graph

Knowledge management

Artificial intelligence

article

gold

인용수 2

2024

Visual Hindsight Self-Imitation Learning for Interactive Navigation

K. M. Kim, Moonhoen Lee, Min Whoo Lee, Kisung Shin, Minsu Lee, Byoung‐Tak Zhang

IF 3.6

IEEE Access

http://dx.doi.org/10.1109/access.2024.3413864

Hindsight bias

Computer science

Imitation

Artificial intelligence

Embedding

Human–computer interaction

Task (project management)

Machine learning

Computer vision

Cognitive psychology

editorial

bronze

인용수 2

2022

Editorial: Task planning and motion control problems of service robots in human-centered environments

Hyungpil Moon, Byoung‐Tak Zhang, Changjoo Nam

IF 4.3

Intelligent Service Robotics

https://doi.org/10.1007/s11370-022-00442-6

Computer science

Task (project management)

Robot

Human–computer interaction

Motion (physics)

Service (business)

Control (management)

Motion planning

Human motion

Artificial intelligence

article

gold

인용수 1

2025

CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision

Gi-Cheon Kang, Jung-Hyun Kim, Kyu-Hwan Shim, Jun Hyuk Lee, Byoung‐Tak Zhang

https://doi.org/10.15607/rss.2025.xxi.016

Natural language

Natural (archaeology)

Action (physics)

Field (mathematics)

Experiential learning

preprint

green

인용수 0

2025

Locality-aware Concept Bottleneck Model

Sujin Jeon, Hyundo Lee, E. T. Kim, Sanghack Lee, Byoung‐Tak Zhang, Inwoo Hwang

ArXiv.org

http://arxiv.org/abs/2508.14562

Bottleneck

ENCODE

Relevance (law)

Process (computing)

Information bottleneck method

Foundation (evidence)

Feature (linguistics)

preprint

green

인용수 0

2025

From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning

Junseok Park, Hyeonseo Yang, Min Whoo Lee, Wonseok Choi, Minsu Lee, Byoung‐Tak Zhang

ArXiv.org

http://arxiv.org/abs/2501.17842

Toddler

Reinforcement learning

Transition (genetics)

Reinforcement

Psychology

Cognitive psychology

Artificial intelligence

Computer science

Developmental psychology

Social psychology

article

인용수 0

2025

CraftGround: A Flexible Reinforcement Learning Environment Based on the Latest Minecraft

Hyeonseo Yang, Minsu Lee, Byoung‐Tak Zhang

Journal of KIISE

https://doi.org/10.5626/jok.2025.52.3.189

Reinforcement learning

Reinforcement

Human–computer interaction

Computer science

Psychology

Artificial intelligence

Social psychology

article

인용수 0

2025

Relaxed HLP: A Metric for Flexible High-Level Plan Evaluation in Embodied Environments

Suyeon Shin, Byoung‐Tak Zhang

https://doi.org/10.1109/bigcomp64353.2025.00061

Embodied cognition

Metric (unit)

Computer science

Plan (archaeology)

Human–computer interaction

Artificial intelligence

Engineering

Operations management

Geology

preprint

green

인용수 0

2025

Exploring Ordinal Bias in Action Recognition for Instructional Videos

Joo Chan Kim, Minjoon Jung, Byoung‐Tak Zhang

ArXiv.org

http://arxiv.org/abs/2504.06580

Action (physics)

Action recognition

Sequence (biology)

Vulnerability (computing)

Order (exchange)

Field (mathematics)

article

인용수 0

2025

VGG-Foley-Sound: Reconstruction of Foley-sound Video Dataset using a Large Multi-modal Model and CLAP

Yunsu Lee, Su-Hyung Choi, Hanwool Sul, Suyeon Shin, Byoung‐Tak Zhang

KIISE Transactions on Computing Practices

https://doi.org/10.5626/ktcp.2025.31.5.271

Foley

Sound (geography)

Modal

Computer science

Acoustics

Materials science

Physics