기본 정보
연구 분야
프로젝트
발행물
구성원
article|
gold
·인용수 2
·2025
Human Scene Understanding Mechanism-Based Image Captioning for Blind Assistance
Jonghoon Kim, Sung-Wook Park, Jun‐Ho Huh, Se-Hoon Jung, Chun-Bo Sim
IF 3.6IEEE Access
초록

This study, a model is proposed that generates descriptive sentences for input images visually impaired individuals. For this purpose, a novel image captioning approach is introduced, integrating the principles of human visual Understanding mechanisms with the Vision Transformer (ViT) architecture, further enhanced by deep reinforcement learning. First, features are extracted from the image based on human visual perception. Second, the image features are encoded through the encoding block of ViT and input into the long short-term memory (LSTM) network to generate annotations for the image. Finally, reinforcement learning is optimized to further enhance the accuracy of the generated captions. Evaluations were performed utilizing the MSR-VTT benchmark dataset, which is widely used for image captioning tasks. Experimental results on the MSR-VTT benchmark dataset demonstrate that the proposed model achieves BLEU-4 of 43.0, METEOR of 29.1, ROUGE-L of 62.7, and CIDEr-D of 54.9, surpassing state-of-the-art baseline models across all evaluation metrics. The proposed model can be applied to video annotation applications for the visually impaired. In contrast to prior works that primarily rely on conventional convolutional architectures, the proposed model uniquely incorporates human-inspired visual perception principles and Vision Transformer-based global encoding, offering a novel and interpretable framework tailored for assistive image captioning.

키워드
Closed captioningComputer scienceComputer visionMechanism (biology)Artificial intelligenceImage (mathematics)
타입
article
IF / 인용수
3.6 / 2
게재 연도
2025