The IRMA Community
Newsletters
Research IRM
Click a keyword to search titles using our InfoSci-OnDemand powered search:
|
Image Captioning Made Easy: Leveraging Vision Transformers and GPT-2 to Create Accurate and Coherent Descriptions From Images
Abstract
Image captioning, which is the generation of descriptive word text summaries from image content, has drawn considerable interest in computer vision and natural language processing (NLP). This research proposes a Python application that combines Vision Transformers (ViT) and GPT-2 for automatic image captioning. The system employs a pre-trained NLP connect/vit-gpt2-image-captioning model from Hugging Face, coupled with a graphical user interface (GUI) designed using Tkinter. The model efficiently extracts features from images and produces coherent, contextually appropriate captions, showing improvement over conventional Convolutional Neural Network-Long Short Term Memory(CNN-LSTM) based models. This study emphasises the architecture, methodology, and comparison of the system, highlighting its applicability in real-world applications such as visually impaired accessibility, content management, and image retrieval. Performance measurement suggests the model's capacity to produce high-quality captions in an efficient manner.
Related Content
|
Mohammad Shuaib Khan, Mohammad Mazhar Afzal.
© 2026.
36 pages.
|
|
Raj Kishor Verma, Raj Kishor Verma.
© 2026.
30 pages.
|
|
Shashikant Nishant Sharma, Kavita Dehalwar.
© 2026.
40 pages.
|
|
Mohammad Shuaib Khan, Mohammad Mazhar Afzal.
© 2026.
28 pages.
|
|
Munir Ahmad, Arifur Rahman, Bivash Ranjan Chowdhury, Hossain Mohammad Dalim.
© 2026.
24 pages.
|
|
G. Swetha, M. S. Veena, Tejaswini Krishnamurthy, S. Druva Kumar, M. Shruthi, S. Vishwanatha, D. Rajeshwari, N. Raghu, Kamal Narayanan, G. B. Arjun Kumar.
© 2026.
28 pages.
|
|
Sonu Sharma, Nikhil Kumar Goyal.
© 2026.
34 pages.
|
|
|