Recent deep learning-based image classification studies show that Vision Transformer performs well and may replace Convolutional Neural Networks. This study compares various ViT-based Attention techniques and proposes a new MultiAttention method. As a result of examining learning stability and overfitting, the model applying the proposed MultiAttention technique showed a more stable learning process and achieved optimal performance than the existing Attention techniques. In particular, the difference between the train loss and the validation loss was not large, showing that overfitting was effectively alleviated. In terms of classification accuracy, the MultiAttention technique recorded the best performance, achieving 28% accuracy, which is 2-5% higher than the basic ViT model. Unlike existing techniques that improve performance only for specific classes such as airplane and deer, the MultiAttention technique evenly improves the overall performance, confirming that it is effective in various object recognition problems. The proposed MultiAttention technique improves ViT learning stability and generalization, showing balanced performance in object classification and strong real-world applicability. Future research could explore additional attention structures and evaluate their performance. If this research is advanced, it could potentially be applied in various application areas such as medical image analysis and autonomous driving. Since spatial attention was crucial in this study, future work will enhance its ability to learn spatial information for greater practicality.