Skip to content

coffeedrunkpanda/multimodal-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎨 Multi-Modal AI API

A production-ready REST API for image understanding using state-of-the-art multi-modal transformers. Built to demonstrate the bridge between ML research and cloud-native application development.

FastAPI Docker Python Open In Colab

πŸ“‹ Table of Contents


🎯 Overview

Context: As an ML engineer with a research background in computer vision and transformers, this project demonstrates my approach to building production-ready APIs using modern cloud-native patterns learned through AWS Developer Associate training.

What this project shows:

  • Building RESTful APIs with FastAPI
  • Containerizing ML applications with Docker
  • Handling multi-modal transformer models (BLIP-2)
  • Implementing proper error handling and validation
  • Designing scalable API architectures
  • Bridging research ML expertise with software engineering practices

✨ Features

Core Capabilities

  • Image Captioning: Generate natural language descriptions of images
  • Visual Question Answering (VQA): Answer questions about image content
  • Multi-format Support: JPEG, PNG, BMP, GIF
  • Configurable Generation: Control caption length and quality parameters

Technical Features

  • βœ… RESTful API with OpenAPI/Swagger documentation
  • βœ… Request validation with Pydantic
  • βœ… Docker containerization with multi-stage builds
  • βœ… Health checks and monitoring endpoints
  • βœ… CORS support for frontend integration
  • βœ… Comprehensive error handling
  • βœ… Structured logging
  • βœ… Non-root container user for security

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Client    β”‚
β”‚ (Browser/   β”‚
β”‚  App/cURL)  β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚ HTTP/REST
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚       FastAPI Application       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Endpoint Layer           β”‚  β”‚
β”‚  β”‚  /caption  /vqa  /health  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚               β”‚                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Validation Layer         β”‚  β”‚
β”‚  β”‚  (Pydantic Schemas)       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚               β”‚                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Model Layer              β”‚  β”‚
β”‚  β”‚  BLIP-2 (2.7B params)     β”‚  β”‚
β”‚  β”‚  - Processor              β”‚  β”‚
β”‚  β”‚  - Inference Engine       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         Docker Container

Component Breakdown

1. API Layer (main.py)

  • FastAPI application with route definitions
  • Request/response handling
  • Middleware (CORS, logging)
  • Error handling

2. Model Layer (models.py)

  • Model initialization and management
  • Device detection (CUDA/MPS/CPU)
  • Inference logic for captioning and VQA
  • Memory management

3. Schema Layer (schema.py)

  • Pydantic models for validation
  • Response formatting
  • Type safety

4. Containerization

  • Multi-stage Docker build
  • Security best practices (non-root user)
  • Health checks
  • Docker Compose for orchestration

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • Docker & Docker Compose
  • 8GB+ RAM recommended (model is ~5GB)
  • (Optional) GPU for faster inference

Local Development

  1. Clone the repository
git clone git@github.com:coffeedrunkpanda/multimodal-api.git
cd multimodal-api
  1. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
  1. Run the application
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
  1. Access the API

Docker Deployment

  1. Build and run with Docker Compose
docker-compose up --build
  1. Or use Docker directly
docker build -t multimodal-api .
docker run -p 8000:8000 multimodal-api

πŸ“š API Documentation

Endpoints

GET /

Root endpoint with API information

{
  "status": "healthy",
  "message": "Multi-Modal AI API is running",
  "version": "1.0.0",
  "endpoints": {...}
}

POST /caption

Generate image caption

Request:

  • file: Image file (form-data)
  • max_length: Maximum caption length (optional, default: 50)
  • num_beams: Beam search beams (optional, default: 5)

Response:

{
  "caption": "a dog sitting on a bench in a park",
  "filename": "dog.jpg",
  "image_size": [800, 600],
  "max_length": 50,
  "num_beams": 5
}

cURL Example:

curl -X POST "http://localhost:8000/caption" \
  -F "file=@dog.jpg" \
  -F "max_length=50" \
  -F "num_beams=5"

POST /vqa

Visual question answering

Request:

  • file: Image file (form-data)
  • question: Question about the image
  • max_length: Maximum answer length (optional, default: 50)

Response:

{
  "question": "What color is the dog?",
  "answer": "brown",
  "filename": "dog.jpg",
  "image_size": [800, 600],
  "max_length": 50
}

cURL Example:

curl -X POST "http://localhost:8000/vqa" \
  -F "file=@dog.jpg" \
  -F "question=What color is the dog?" \
  -F "max_length=50"

GET /health

Health check endpoint

{
  "status": "healthy",
  "message": "All systems operational",
  "model_loaded": true
}

GET /model-info

Get model information

{
  "model_name": "Salesforce/blip2-opt-2.7b",
  "model_type": "Multi-modal transformer (BLIP-2)",

About

A FastAPI service that leverages BLIP-2 transformer models for image understanding. Features include automatic image captioning and visual question answering (VQA), all containerized with Docker for easy deployment.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages