This study, a model is proposed that generates descriptive sentences for input images visually impaired individuals. For this purpose, a novel image captioning approach is introduced, integrating the principles of human visual Understanding mechanisms with the Vision Transformer (ViT) architecture, further enhanced by deep reinforcement learning. First, features are extracted from the image based on human visual perception. Second, the image features are encoded through the encoding block of ViT and input into the long short-term memory (LSTM) network to generate annotations for the image. Finally, reinforcement learning is optimized to further enhance the accuracy of the generated captions. Evaluations were performed utilizing the MSR-VTT benchmark dataset, which is widely used for image captioning tasks. Experimental results on the MSR-VTT benchmark dataset demonstrate that the proposed model achieves BLEU-4 of 43.0, METEOR of 29.1, ROUGE-L of 62.7, and CIDEr-D of 54.9, surpassing state-of-the-art baseline models across all evaluation metrics. The proposed model can be applied to video annotation applications for the visually impaired. In contrast to prior works that primarily rely on conventional convolutional architectures, the proposed model uniquely incorporates human-inspired visual perception principles and Vision Transformer-based global encoding, offering a novel and interpretable framework tailored for assistive image captioning.