Image Captioning Made Easy: Leveraging Vision Transformers and GPT-2 to Create Accurate and Coherent Descriptions From Images

View Sample PDF

Author(s): Ayesha Taranum (Vidyavardhaka College of Engineering, India)and Mohammed Ezhan (Northeastern University, USA)
Copyright: 2026
Pages: 14
Source title: AI-Based Data Mobility and Intelligent Modeling for Smart Cities
Source Author(s)/Editor(s): Sultan Ahmad (Prince Sattam Bin Abdulaziz University, Saudi Arabia), Sudan Jha (Kathmandu University, Nepal)and Md Alimul Haque (Veer Kunwar Singh University, India)
DOI: 10.4018/979-8-3373-4202-3.ch011

Keywords: Civil Engineering / Information Science Reference / Mobile and Wireless Computing / Science & Engineering

Purchase

View Image Captioning Made Easy: Leveraging Vision Transformers and GPT-2 to Create Accurate and Coherent Descriptions From Images on the publisher's website for pricing and purchasing information.

Abstract

Image captioning, which is the generation of descriptive word text summaries from image content, has drawn considerable interest in computer vision and natural language processing (NLP). This research proposes a Python application that combines Vision Transformers (ViT) and GPT-2 for automatic image captioning. The system employs a pre-trained NLP connect/vit-gpt2-image-captioning model from Hugging Face, coupled with a graphical user interface (GUI) designed using Tkinter. The model efficiently extracts features from images and produces coherent, contextually appropriate captions, showing improvement over conventional Convolutional Neural Network-Long Short Term Memory(CNN-LSTM) based models. This study emphasises the architecture, methodology, and comparison of the system, highlighting its applicability in real-world applications such as visually impaired accessibility, content management, and image retrieval. Performance measurement suggests the model's capacity to produce high-quality captions in an efficient manner.

The IRMA Community

Research IRM

Image Captioning Made Easy: Leveraging Vision Transformers and GPT-2 to Create Accurate and Coherent Descriptions From Images

Purchase

Abstract

Related Content

IRMA Sponsors