🎨 Multi-Modal AI API

A production-ready REST API for image understanding using state-of-the-art multi-modal transformers. Built to demonstrate the bridge between ML research and cloud-native application development.

📋 Table of Contents

🎨 Multi-Modal AI API

🎯 Overview

Context: As an ML engineer with a research background in computer vision and transformers, this project demonstrates my approach to building production-ready APIs using modern cloud-native patterns learned through AWS Developer Associate training.

What this project shows:

Building RESTful APIs with FastAPI
Containerizing ML applications with Docker
Handling multi-modal transformer models (BLIP-2)
Implementing proper error handling and validation
Designing scalable API architectures
Bridging research ML expertise with software engineering practices

✨ Features

Core Capabilities

Image Captioning: Generate natural language descriptions of images
Visual Question Answering (VQA): Answer questions about image content
Multi-format Support: JPEG, PNG, BMP, GIF
Configurable Generation: Control caption length and quality parameters

Technical Features

✅ RESTful API with OpenAPI/Swagger documentation
✅ Request validation with Pydantic
✅ Docker containerization with multi-stage builds
✅ Health checks and monitoring endpoints
✅ CORS support for frontend integration
✅ Comprehensive error handling
✅ Structured logging
✅ Non-root container user for security

🏗️ Architecture

┌─────────────┐
│   Client    │
│ (Browser/   │
│  App/cURL)  │
└──────┬──────┘
       │ HTTP/REST
       ▼
┌─────────────────────────────────┐
│       FastAPI Application       │
│  ┌───────────────────────────┐  │
│  │  Endpoint Layer           │  │
│  │  /caption  /vqa  /health  │  │
│  └────────────┬──────────────┘  │
│               │                 │
│  ┌────────────▼──────────────┐  │
│  │  Validation Layer         │  │
│  │  (Pydantic Schemas)       │  │
│  └────────────┬──────────────┘  │
│               │                 │
│  ┌────────────▼──────────────┐  │
│  │  Model Layer              │  │
│  │  BLIP-2 (2.7B params)     │  │
│  │  - Processor              │  │
│  │  - Inference Engine       │  │
│  └───────────────────────────┘  │
└─────────────────────────────────┘
         Docker Container

Component Breakdown

1. API Layer (main.py)

FastAPI application with route definitions
Request/response handling
Middleware (CORS, logging)
Error handling

2. Model Layer (models.py)

Model initialization and management
Device detection (CUDA/MPS/CPU)
Inference logic for captioning and VQA
Memory management

3. Schema Layer (schema.py)

Pydantic models for validation
Response formatting
Type safety

4. Containerization

Multi-stage Docker build
Security best practices (non-root user)
Health checks
Docker Compose for orchestration

🚀 Quick Start

Prerequisites

Python 3.10+
Docker & Docker Compose
8GB+ RAM recommended (model is ~5GB)
(Optional) GPU for faster inference

Local Development

Clone the repository

git clone git@github.com:coffeedrunkpanda/multimodal-api.git
cd multimodal-api

Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Run the application

uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Access the API

API Documentation: http://localhost:8000/docs
Alternative Docs: http://localhost:8000/redoc
Health Check: http://localhost:8000/health

Docker Deployment

Build and run with Docker Compose

docker-compose up --build

Or use Docker directly

docker build -t multimodal-api .
docker run -p 8000:8000 multimodal-api

📚 API Documentation

Endpoints

`GET /`

Root endpoint with API information

{
  "status": "healthy",
  "message": "Multi-Modal AI API is running",
  "version": "1.0.0",
  "endpoints": {...}
}

`POST /caption`

Generate image caption

Request:

file: Image file (form-data)
max_length: Maximum caption length (optional, default: 50)
num_beams: Beam search beams (optional, default: 5)

Response:

{
  "caption": "a dog sitting on a bench in a park",
  "filename": "dog.jpg",
  "image_size": [800, 600],
  "max_length": 50,
  "num_beams": 5
}

cURL Example:

curl -X POST "http://localhost:8000/caption" \
  -F "file=@dog.jpg" \
  -F "max_length=50" \
  -F "num_beams=5"

`POST /vqa`

Visual question answering

Request:

file: Image file (form-data)
question: Question about the image
max_length: Maximum answer length (optional, default: 50)

Response:

{
  "question": "What color is the dog?",
  "answer": "brown",
  "filename": "dog.jpg",
  "image_size": [800, 600],
  "max_length": 50
}

cURL Example:

curl -X POST "http://localhost:8000/vqa" \
  -F "file=@dog.jpg" \
  -F "question=What color is the dog?" \
  -F "max_length=50"

`GET /health`

Health check endpoint

{
  "status": "healthy",
  "message": "All systems operational",
  "model_loaded": true
}

`GET /model-info`

Get model information

{
  "model_name": "Salesforce/blip2-opt-2.7b",
  "model_type": "Multi-modal transformer (BLIP-2)",

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
multimodal.ipynb		multimodal.ipynb
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎨 Multi-Modal AI API

📋 Table of Contents

🎯 Overview

✨ Features

Core Capabilities

Technical Features

🏗️ Architecture

Component Breakdown

🚀 Quick Start

Prerequisites

Local Development

Docker Deployment

📚 API Documentation

Endpoints

`GET /`

`POST /caption`

`POST /vqa`

`GET /health`

`GET /model-info`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎨 Multi-Modal AI API

📋 Table of Contents

🎯 Overview

✨ Features

Core Capabilities

Technical Features

🏗️ Architecture

Component Breakdown

🚀 Quick Start

Prerequisites

Local Development

Docker Deployment

📚 API Documentation

Endpoints

GET /

POST /caption

POST /vqa

GET /health

GET /model-info

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /`

`POST /caption`

`POST /vqa`

`GET /health`

`GET /model-info`

Packages