Image Caption Generator using ResNet-152 and RNN

This project implements an image captioning model that automatically generates descriptive captions for images. It leverages a deep learning architecture combining a pre-trained Convolutional Neural Network (CNN) as an encoder and a Recurrent Neural Network (RNN) as a decoder.

Project Overview

The goal of this project is to create a model that can "see" an image and generate a relevant, human-like description. This is achieved by:

Image Feature Extraction: Using a pre-trained ResNet-152 model to extract a rich, high-level feature vector from the input image.
Caption Generation: Feeding this feature vector into an RNN-based decoder that generates the caption word by word.

The model was trained on a subset of the Microsoft COCO (Common Objects in Context) dataset and evaluated using the BLEU (Bilingual Evaluation Understudy) score and Cosine Similarity to measure the quality of the generated captions against human-written references.

Model Architecture

The model follows a standard Encoder-Decoder framework, which is common for sequence-to-sequence tasks.

1. CNN Encoder

Model: Pre-trained ResNet-152 on the ImageNet dataset.
Function: The CNN acts as the "eye" of the model. We remove the final fully connected (classification) layer of the ResNet-152. The output from the preceding layer is a 2048-dimensional feature vector that serves as a rich numerical representation of the image's content. This vector is then fed as the initial input to the decoder.

2. RNN Decoder

Model: A simple RNN with an embedding layer, an RNN layer, and a final linear layer.
Function: The RNN acts as the "language model."
- It takes the image feature vector from the encoder as its initial hidden state.
- An embedding layer converts the word tokens of the caption into dense vectors.
- The RNN layer processes the sequence of word embeddings to generate the next word in the caption.
- A final linear layer with a softmax activation function outputs a probability distribution over the entire vocabulary, and the word with the highest probability is chosen.
- This process repeats until an <end> token is generated or the maximum caption length is reached.

Dataset

Dataset: Microsoft COCO (Common Objects in Context) 2017 training/validation set.
Details: The COCO dataset is a large-scale object detection, segmentation, and captioning dataset. For this project, a subset was used, containing thousands of images, each with at least five reference captions.
Preprocessing:
- Images: Resized to 224x224 pixels and normalized using ImageNet's mean and standard deviation.
- Captions: Text was converted to lowercase, punctuation was removed, and a vocabulary was built from words appearing more than three times. Special tokens like <pad>, <start>, <end>, and <unk> were added.

Results

The model's performance was evaluated on a held-out test set. The training and validation loss curves show that the model begins to overfit around 20-25 epochs, which was chosen as the optimal training duration.

Training & Validation Loss (25 Epochs)

Performance Metrics

Average BLEU-1 Score: Achieved an average score of approximately 0.6, indicating a strong unigram overlap between the generated and reference captions.
Average Cosine Similarity: The model also showed high cosine similarity between the embedding vectors of the generated and reference captions, signifying semantic closeness.

Example Predictions

Example 1: High BLEU Score

Image	Generated Caption	Reference Captions
	`a herd of sheep grazing in a grassy field`	- a herd of sheep are standing in a field - a flock of sheep grazing in a green pasture - a large flock of sheep in a large grassy field

Example 2: Low BLEU Score

Image	Generated Caption	Reference Captions
	`a man is holding a baseball bat`	- a baseball player swinging a bat at a ball - a man in a baseball uniform swinging a bat - a batter prepares to hit the ball at a game

Installation

To set up and run this project locally, follow these steps:

Clone the repository:

git clone https://github.com/your-username/image-caption-generator.git
cd image-caption-generator

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required libraries:
```
pip install -r requirements.txt
```
(Note: You will need to create a requirements.txt file from your notebook environment)
Download the Dataset:
- Download the COCO 2017 dataset (train2017.zip, val2017.zip, annotations_trainval2017.zip).
- Place the data in a directory structure as expected by the notebook.

Usage

The Jupyter Notebook image_caption_notebook.ipynb contains all the code for data preprocessing, training, evaluation, and inference.

To generate a caption for a new image:

Place your test images in a folder (e.g., ./test_images/).
Load the trained model weights (epochs25hidden512adam.pt).
Use the generate_captions_for_folder function defined in the notebook to see the results.

# Example snippet from the notebook
from PIL import Image

# (Ensure encoder, decoder, and vocab are loaded)

folder_path = "./test_images"
generate_captions_for_folder(folder_path, encoder, decoder, vocab)

Technologies Used

Python 3.x
PyTorch: For building and training the neural network.
Torchvision: For pre-trained models and image transformations.
Pandas: For data manipulation and handling annotations.
NLTK: For calculating the BLEU score.
Matplotlib: For visualizing images and results.
Jupyter Notebook: For code development and experimentation.

Future Improvements

Use a More Advanced Decoder: Replace the simple RNN with an LSTM or GRU to better handle long-term dependencies in sequences.
Implement Attention Mechanism: An attention mechanism would allow the decoder to focus on specific parts of the image when generating each word, which can significantly improve caption quality.
Use Beam Search: Instead of greedily picking the most likely next word, beam search explores multiple possible captions at each step and chooses the one with the highest overall probability.
Experiment with Different Encoders: Try more modern CNN architectures like EfficientNet or Vision Transformers (ViT) as the feature extractor.
Hyperparameter Tuning: Systematically tune hyperparameters like learning rate, embedding size, and hidden layer dimensions.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
image_caption_notebook.ipynb		image_caption_notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Caption Generator using ResNet-152 and RNN

Table of Contents

Project Overview

Model Architecture

1. CNN Encoder

2. RNN Decoder

Dataset

Results

Training & Validation Loss (25 Epochs)

Performance Metrics

Example Predictions

Installation

Usage

Technologies Used

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Image Caption Generator using ResNet-152 and RNN

Table of Contents

Project Overview

Model Architecture

1. CNN Encoder

2. RNN Decoder

Dataset

Results

Training & Validation Loss (25 Epochs)

Performance Metrics

Example Predictions

Installation

Usage

Technologies Used

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages