A production-ready REST API for image understanding using state-of-the-art multi-modal transformers. Built to demonstrate the bridge between ML research and cloud-native application development.
Context: As an ML engineer with a research background in computer vision and transformers, this project demonstrates my approach to building production-ready APIs using modern cloud-native patterns learned through AWS Developer Associate training.
What this project shows:
- Building RESTful APIs with FastAPI
- Containerizing ML applications with Docker
- Handling multi-modal transformer models (BLIP-2)
- Implementing proper error handling and validation
- Designing scalable API architectures
- Bridging research ML expertise with software engineering practices
- Image Captioning: Generate natural language descriptions of images
- Visual Question Answering (VQA): Answer questions about image content
- Multi-format Support: JPEG, PNG, BMP, GIF
- Configurable Generation: Control caption length and quality parameters
- β RESTful API with OpenAPI/Swagger documentation
- β Request validation with Pydantic
- β Docker containerization with multi-stage builds
- β Health checks and monitoring endpoints
- β CORS support for frontend integration
- β Comprehensive error handling
- β Structured logging
- β Non-root container user for security
βββββββββββββββ
β Client β
β (Browser/ β
β App/cURL) β
ββββββββ¬βββββββ
β HTTP/REST
βΌ
βββββββββββββββββββββββββββββββββββ
β FastAPI Application β
β βββββββββββββββββββββββββββββ β
β β Endpoint Layer β β
β β /caption /vqa /health β β
β ββββββββββββββ¬βββββββββββββββ β
β β β
β ββββββββββββββΌβββββββββββββββ β
β β Validation Layer β β
β β (Pydantic Schemas) β β
β ββββββββββββββ¬βββββββββββββββ β
β β β
β ββββββββββββββΌβββββββββββββββ β
β β Model Layer β β
β β BLIP-2 (2.7B params) β β
β β - Processor β β
β β - Inference Engine β β
β βββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββ
Docker Container
1. API Layer (main.py)
- FastAPI application with route definitions
- Request/response handling
- Middleware (CORS, logging)
- Error handling
2. Model Layer (models.py)
- Model initialization and management
- Device detection (CUDA/MPS/CPU)
- Inference logic for captioning and VQA
- Memory management
3. Schema Layer (schema.py)
- Pydantic models for validation
- Response formatting
- Type safety
4. Containerization
- Multi-stage Docker build
- Security best practices (non-root user)
- Health checks
- Docker Compose for orchestration
- Python 3.10+
- Docker & Docker Compose
- 8GB+ RAM recommended (model is ~5GB)
- (Optional) GPU for faster inference
- Clone the repository
git clone git@github.com:coffeedrunkpanda/multimodal-api.git
cd multimodal-api- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Run the application
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000- Access the API
- API Documentation: http://localhost:8000/docs
- Alternative Docs: http://localhost:8000/redoc
- Health Check: http://localhost:8000/health
- Build and run with Docker Compose
docker-compose up --build- Or use Docker directly
docker build -t multimodal-api .
docker run -p 8000:8000 multimodal-apiRoot endpoint with API information
{
"status": "healthy",
"message": "Multi-Modal AI API is running",
"version": "1.0.0",
"endpoints": {...}
}Generate image caption
Request:
file: Image file (form-data)max_length: Maximum caption length (optional, default: 50)num_beams: Beam search beams (optional, default: 5)
Response:
{
"caption": "a dog sitting on a bench in a park",
"filename": "dog.jpg",
"image_size": [800, 600],
"max_length": 50,
"num_beams": 5
}cURL Example:
curl -X POST "http://localhost:8000/caption" \
-F "file=@dog.jpg" \
-F "max_length=50" \
-F "num_beams=5"Visual question answering
Request:
file: Image file (form-data)question: Question about the imagemax_length: Maximum answer length (optional, default: 50)
Response:
{
"question": "What color is the dog?",
"answer": "brown",
"filename": "dog.jpg",
"image_size": [800, 600],
"max_length": 50
}cURL Example:
curl -X POST "http://localhost:8000/vqa" \
-F "file=@dog.jpg" \
-F "question=What color is the dog?" \
-F "max_length=50"Health check endpoint
{
"status": "healthy",
"message": "All systems operational",
"model_loaded": true
}Get model information
{
"model_name": "Salesforce/blip2-opt-2.7b",
"model_type": "Multi-modal transformer (BLIP-2)",