Skip to content

LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

License

Notifications You must be signed in to change notification settings

Zipstack/unstract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,447 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Unstract

Turn Unstructured Documents into Structured Data

Documentation | Enterprise

License Tutorials Uptime Status Docker Pulls Ask DeepWiki CLA assistant

Python Version from PEP 621 TOML uv Vite Bun Biome

pre-commit.ci status Quality Gate Status Code Smells Duplicated Lines (%)

What is Unstract?

Unstract uses LLMs to extract structured JSON from documents — PDFs, images, scans, you name it. Define what you want to extract using natural language prompts, and deploy as an API or ETL pipeline.

Built for teams in finance, insurance, healthcare, KYC/compliance, and much more.

Current State vs. Unstract

Task Without Unstract With Unstract
Schema definition Write regex, build templates per vendor Write a prompt once, handles variations
New document type Days of development Minutes in Prompt Studio
LLM integration Build your own pipeline Plug in any provider (OpenAI, Anthropic, Bedrock, Ollama)
Deployment Custom infrastructure ./run-platform.sh or managed cloud
Output Unstructured text blobs Clean JSON, ready for your database

⭐ If Unstract helps you, star this repo!

Star Unstract

✨ Key Features

Prompt Studio — Define document extraction schemas with natural language. Docs →

Prompt Studio

API Deployment — Send a document over REST API, get JSON back. Docs →

API Deployment

ETL Pipeline — Pull documents from a folder, process them, load to your warehouse. Docs →

MCP Server — Connect to AI agents (Claude, etc.) via Model Context Protocol. Docs →

n8n Node — Drop into existing automation workflows. Docs →

🚀 Quickstart (~5 mins)

System Requirements & Prerequisites

  • Linux or macOS (Intel or M-series)
  • Docker & Docker Compose
  • 8 GB RAM minimum
  • Git

Run Locally

# Clone and start
git clone https://github.com/Zipstack/unstract.git
cd unstract
./run-platform.sh

That's it!

📦 Other Deployment Options

Docker Compose

# Pull and run entire Unstract platform with default env config.
./run-platform.sh

# Pull and run docker containers with a specific version tag.
./run-platform.sh -v v0.1.0

# Upgrade existing Unstract platform setup by pulling the latest available version.
./run-platform.sh -u

# Upgrade existing Unstract platform setup by pulling a specific version.
./run-platform.sh -u -v v0.2.0

# Build docker images locally as a specific version tag.
./run-platform.sh -b -v v0.1.0

# Build docker images locally from working branch as `current` version tag.
./run-platform.sh -b -v current

# Display the help information.
./run-platform.sh -h

# Only do setup of environment files.
./run-platform.sh -e

# Only do docker images pull with a specific version tag.
./run-platform.sh -p -v v0.1.0

# Only do docker images pull by building locally with a specific version tag.
./run-platform.sh -p -b -v v0.1.0

# Upgrade existing Unstract platform setup with docker images built locally from working branch as `current` version tag.
./run-platform.sh -u -b -v current

# Pull and run docker containers in detached mode.
./run-platform.sh -d -v v0.1.0

🔐 Backup Encryption Key

Warning

This key encrypts adapter credentials — losing it makes existing adapters inaccessible!

Copy the value of ENCRYPTION_KEY from backend/.env or platform-service/.env to a secure location.

🏗️ Unstract Architecture

┌────────────────────────────────────────────────────────────┐
│                          Unstract                          │
├─────────────┬─────────────┬─────────────┬──────────────────┤
│  Frontend   │   Backend   │   Worker    │ Platform Service │
│  (React)    │  (Django)   │  (Celery)   │   (FastAPI)      │
├─────────────┴─────────────┴─────────────┴──────────────────┤
│                      Cache (Redis)                         │
├────────────────────────────────────────────────────────────┤
│                  Message Queue (RabbitMQ)                  │
├────────────────────────────────────────────────────────────┤
│                   Database (PostgreSQL)                    │
├────────────────────────────────────────────────────────────┤
│  LLM Adapters    │  Vector DBs    │  Text Extractors       │
│  (OpenAI, etc.)  │ (Qdrant, etc.) │  (LLMWhisperer)        │
└────────────────────────────────────────────────────────────┘

Also see architecture.

📄 Document File Formats

Category Formats
Documents PDF, DOCX, DOC, ODT, TXT, CSV, JSON
Spreadsheets XLSX, XLS, ODS
Presentations PPTX, PPT, ODP
Images PNG, JPG, JPEG, TIFF, BMP, GIF, WEBP

🔌 Connectors & Adapters

LLM Providers

Provider Status Provider Status
OpenAI Azure OpenAI
Anthropic Claude Google Gemini
AWS Bedrock Mistral AI
Ollama (local) Anyscale

Vector Databases

Provider Status Provider Status
Qdrant Pinecone
Weaviate PostgreSQL
Milvus

Text Extractors

Provider Status
LLMWhisperer
Unstructured.io
LlamaIndex Parse

ETL Sources & Destinations

Sources: AWS S3, MinIO, Google Cloud Storage, Azure Blob, Google Drive, Dropbox, SFTP

Destinations: Snowflake, Amazon Redshift, Google BigQuery, PostgreSQL, MySQL, MariaDB, SQL Server, Oracle

Full Connector List

🛠️ Development

Change Default Credentials

Follow these steps to change the default username and password.

Local Development

# Install pre-commit hooks
./dev-env-cli.sh -p

# Run pre-commit checks
./dev-env-cli.sh -r

Local Development Guide

🏢 Use Cases by Industry

Finance & Banking → | Insurance → | Healthcare → | Income Tax →

☁️ Cloud & Enterprise

For teams that need managed infrastructure, advanced accuracy features, or compliance certifications.

  • LLMChallenge — dual-LLM verification
  • SinglePass & Summarized Extraction — reduce LLM token costs
  • Human-in-the-Loop — review interface with document highlighting
  • SSO & Enterprise RBAC — SAML/OIDC integration with granular role-based access control
  • SOC 2, HIPAA, ISO 27001, GDPR Compliant — third-party audited security certifications
  • Priority Support with SLA — dedicated support team with response time guarantees

Book a Demo

📚 Cookbooks

🤝 Contributing

We welcome contributions! The easiest way to start:

  1. Pick an issue tagged good first issue
  2. Submit a PR

Report Bug → | Request Feature →

👋 Community

Join the LLM-powered document automation community:

Blog LinkedIn Slack X

📊 A Note on Analytics

Unstract integrates Posthog to track minimal usage analytics. Disable by setting REACT_APP_ENABLE_POSTHOG=false in the frontend's .env file.

📜 License

Unstract is released under the AGPL-3.0 License.


Built with ❤️ by Zipstack

Website · Documentation · Pricing

About

LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors