Skip to content

Ahmed-El-Zainy/Document-AI-From-OCR-to-Agentic-Doc-Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ Document AI: From OCR to Agentic Document Extraction

Python License LandingAI AWS

A comprehensive learning journey through modern document intelligence techniques, from traditional OCR to advanced agentic extraction systems


πŸ“‘ Table of Contents


🎯 Overview

This repository contains a comprehensive course on Document AI and Intelligent Document Processing (IDP), covering the complete evolution from traditional OCR to modern agentic document extraction systems. Learn how to build production-ready document understanding pipelines using cutting-edge technologies.

What You'll Learn

  • πŸ” Traditional OCR: Understanding Tesseract and foundational OCR techniques
  • 🧠 Deep Learning OCR: PaddleOCR and neural network-based text recognition
  • πŸ“ Layout Analysis: LayoutLM and LayoutReader for document structure understanding
  • πŸ€– Agentic Extraction: LandingAI's ADE (Agentic Document Extraction)
  • ☁️ Cloud Deployment: Building RAG pipelines with AWS Bedrock and Lambda
  • πŸ’¬ Conversational AI: Creating document-based chatbots with Strands Agents

✨ Key Features

Feature Description
Multi-Modal Processing Handle PDFs, images, tables, and complex layouts
Visual Grounding Maintain bounding box information for precise chunk extraction
Production-Ready AWS Lambda integration for scalable document processing
RAG Pipeline Complete Retrieval-Augmented Generation system
Interactive Learning Jupyter notebooks with hands-on examples
Real-World Use Cases Medical documents, invoices, receipts, forms, and more

πŸ—οΈ Architecture

Overall System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Document AI Pipeline                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚      Input Document Processing          β”‚
        β”‚  (PDF, Images, Scanned Documents)       β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                                           β”‚
        β–Ό                                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Traditional OCR β”‚                    β”‚   Deep Learning  β”‚
β”‚    (Tesseract)   β”‚                    β”‚   OCR (Paddle)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                                           β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚       Layout Understanding              β”‚
        β”‚    (LayoutLM, LayoutReader)            β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚      Agentic Document Extraction        β”‚
        β”‚           (LandingAI ADE)               β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                                           β”‚
        β–Ό                                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Structured     β”‚                    β”‚   RAG Pipeline   β”‚
β”‚     Output       β”‚                    β”‚   (AWS Bedrock)  β”‚
β”‚  (Markdown, JSON)β”‚                    β”‚                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                  β”‚
                                                  β–Ό
                                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                        β”‚   Chatbot Agent  β”‚
                                        β”‚ (Strands Agents) β”‚
                                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

AWS RAG Pipeline Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   S3 Bucket │────────▢│   Lambda     │────────▢│   LandingAI     β”‚
β”‚ (PDF Upload)β”‚         β”‚  Function    β”‚         β”‚      ADE        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                           β”‚
                                                           β”‚ Process
                                                           β–Ό
                                                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                  β”‚  Extract Chunks β”‚
                                                  β”‚  + Grounding    β”‚
                                                  β”‚  + Metadata     β”‚
                                                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                           β”‚
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
                        β”‚                                  β”‚
                        β–Ό                                  β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚ Markdown Output β”‚              β”‚  Chunk JSONs    β”‚
              β”‚  (S3 Storage)   β”‚              β”‚ + Bounding Boxesβ”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                           β”‚
                                                           β–Ό
                                                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                  β”‚     Bedrock     β”‚
                                                  β”‚ Knowledge Base  β”‚
                                                  β”‚   (Vector DB)   β”‚
                                                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                           β”‚
                                                           β–Ό
                                                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                  β”‚ Strands Agents  β”‚
                                                  β”‚    Chatbot      β”‚
                                                  β”‚ + Visual Ground β”‚
                                                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Introduction

This repository contains a course or tutorial series focused on Document AI , progressing from basic OCR (Optical Character Recognition) to advanced Agentic Document Extraction (ADE) and RAG (Retrieval-Augmented Generation) pipelines.

Course Structure

The course is organized into several "Labs," each contained within its own subdirectory. The directory naming convention (L2, L4, etc.) suggests a step-by-step progression, though the internal Lab numbering differs slightly.

Directory Notebook File Lab Title Key Topics
L2 L2.ipynb Lab 1: Document Processing with OCR β€’ Basic OCR with Tesseract β€’ Parsing sample documentsβ€’ Using Regex for extraction β€’ Building a simple OCR agentβ€’ Limitations of basic OCR
L4 L4.ipynb Lab 2: Document Processing with PaddleOCR β€’ Advanced OCR with PaddleOCR (Deep Learning based)β€’ Text detection vs. recognitionβ€’ Layout detection``β€’ Handling tables and handwriting
L6 L6.ipynb Lab 3: Building Agentic Document Understanding β€’ LayoutReader for reading order β€’ Vision-Language Model (VLM) for charts/tablesβ€’ Building custom tools (AnalyzeChart, AnalyzeTable)``β€’ Assembling a LangChain agent
L8 L8.ipynb Lab 4: Agentic Document Extraction (Part I) β€’ Introduction to LandingAI's ADE framework β€’ Vision-First, Data-Centric, Agentic approachβ€’ Extracting Key-Value pairs``β€’ Handling "difficult" documents (charts, handwritten forms)
L9 L9.ipynb Lab 4: Agentic Document Extraction (Part II) β€’ Processing multiple document types β€’ Document categorization schemasβ€’ Validation logic for extractions``β€’ Building a full processing pipeline
L11 L11.ipynb Lab 5: Agentic Document Extraction for RAG β€’ RAG (Retrieval-Augmented Generation) with documents β€’ Preprocessing, Retrieval, Generation phasesβ€’ Vector database setup (ChromaDB)``β€’ Visual grounding in RAG

Key Components

  • Notebooks (.ipynb) : The core interactive lessons containing code, explanations, and exercises.

helper.py** : A shared utility file present in multiple directories (and root), likely containing helper functions for image processing, visualization, or API interaction.

  • rag_pipeline_aws/ : A separate directory likely containing a more production-oriented or cloud-based RAG implementation (based on the name).
  • Images/Assets : Each Lab directory contains sample images (invoice.png,receipt.jpg,apple_10k.pdf) used for testing the document processing pipelines.

Getting Started

To begin, it is recommended to start with Lab 1 (

L2/L2.ipynb) to understand the basics of OCR and agent construction before moving to more advanced topics like PaddleOCR and Agentic Extraction.

πŸ“š Course Structure

The course is organized into progressive lessons, each building upon previous concepts:

Lesson Topic Technologies Difficulty
L1 Introduction to OCR Tesseract ⭐ Beginner
L2 Document Processing Tesseract, PaddleOCR ⭐⭐ Beginner
L3 Layout Analysis LayoutLM ⭐⭐ Intermediate
L4 Advanced OCR PaddleOCR ⭐⭐ Intermediate
L6 Reading Order LayoutReader ⭐⭐⭐ Intermediate
L8 Agentic Extraction LandingAI ADE ⭐⭐⭐ Advanced
L9 Batch Processing LandingAI ADE ⭐⭐⭐ Advanced
L11 RAG with ChromaDB ChromaDB, LangChain ⭐⭐⭐⭐ Advanced
Lab 6 AWS RAG Pipeline AWS Bedrock, Lambda, Strands ⭐⭐⭐⭐⭐ Expert

πŸ”§ Prerequisites

System Requirements

  • Python: Version 3.10 (recommended)
  • OS: Linux, macOS, or Windows (Linux x86_64 recommended for AWS Lambda)
  • Memory: 8GB RAM minimum (16GB recommended)
  • Storage: 5GB free space

Required Accounts

  1. LandingAI Account (Free tier available)

    • Sign up at LandingAI
    • Get your Vision Agent API key
  2. AWS Account (for Lab 6 only)

    • Required services: S3, Lambda, Bedrock, IAM
    • Estimated cost: ~$5-10/month for testing
  3. OpenAI Account (optional, for advanced features)


πŸ“¦ Installation

Step 1: Clone the Repository

git clone https://github.com/Ahmed-El-Zainy/Document-AI-From-OCR-to-Agentic-Doc-Extraction.git
cd document_ai_from_OCR_to_agentic_doc_extraction

Step 2: Create Virtual Environment

# Using venv
python3.10.6 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Or using conda
conda create -n docai python=3.10.6
conda activate docai

Step 3: Install Dependencies

# Install core dependencies
pip install -r requirements.txt

# For AWS Lab 6 (optional)
pip install boto3 bedrock-agentcore strands-agents

Step 4: Install System Dependencies

For Tesseract OCR (L1, L2):

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki

For PaddleOCR (L2, L4):

pip install paddlepaddle==3.0.0 paddleocr

Step 5: Configure Environment Variables

Create a .env file in the project root:

# Mosted Used in notebooks
####  LandingAI Configuration
VISION_AGENT_API_KEY= your_landingai_api_key_here
HF_TOKEN = your_huggingface_api_key # chat models & emebedding models
GROQ_API_KEY = your_groq_api_key # vlm models


# AWS Configuration (for Lab 6)
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=us-west-2
S3_BUCKET=your-bucket-name
BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-5-20250929-v1:0
BEDROCK_KB_ID=your_knowledge_base_id

πŸš€ Quick Start

Example 1: Basic OCR with Tesseract (Lesson 2)

import pytesseract
from PIL import Image

# Load and process image
image = Image.open("L2/invoice.png")
text = pytesseract.image_to_string(image)
print(text)

Example 2: Advanced OCR with PaddleOCR (Lesson 4)

from paddleocr import PaddleOCR

# Initialize OCR
ocr = PaddleOCR(use_angle_cls=True, lang='en')

# Process document
result = ocr.ocr('L4/bank_statement.png', cls=True)

# Extract text
for line in result:
    for word_info in line:
        print(word_info[1][0])  # Extracted text

Example 3: Agentic Document Extraction (Lesson 8)

from landingai.ade import ADEClient

# Initialize ADE client
client = ADEClient(api_key="your_api_key")

# Parse document with visual grounding
response = client.parse(
    document_path="document.pdf",
    extract_tables=True,
    extract_figures=True
)

# Access structured output
markdown_content = response.markdown
groundings = response.grounding  # Bounding box information

print(markdown_content)

Example 4: RAG Pipeline Query (Lesson 11)

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Load vector database
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=OpenAIEmbeddings()
)

# Query documents
results = vectorstore.similarity_search(
    "What are the company's revenue figures?",
    k=5
)

for doc in results:
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}\n")

πŸ“– Lessons Overview

Lesson 2: Introduction to Document Processing

Focus: Traditional OCR with Tesseract and PaddleOCR

πŸ“ Directory: L2/

  • L2.ipynb - Main tutorial notebook
  • invoice.png, receipt.jpg, table.png - Sample documents

Key Concepts:

  • Image preprocessing techniques
  • Text extraction from various document types
  • Handling tables and forms
  • Comparing Tesseract vs. PaddleOCR performance

Use Cases:

  • βœ… Simple invoices and receipts
  • βœ… Clean, scanned documents
  • ⚠️ Limited table structure recognition
  • ❌ Complex layouts not well supported

Lesson 4: Advanced OCR with PaddleOCR

Focus: Deep learning-based OCR with better accuracy

πŸ“ Directory: L4/

  • L4.ipynb - Advanced OCR techniques
  • bank_statement.png, handwritten.jpg - Complex documents

Key Concepts:

  • Neural network-based text detection
  • Multi-language support
  • Angle classification for rotated text
  • Confidence scoring

Improvements over L2:

  • βœ… Better handling of curved or rotated text
  • βœ… Improved accuracy on low-quality scans
  • βœ… Multi-language text recognition
  • βœ… Handwriting recognition support

Lesson 6: Layout Analysis with LayoutReader

Focus: Understanding document structure and reading order

πŸ“ Directory: L6/

  • L6.ipynb - Layout understanding tutorial
  • layoutreader/ - LayoutReader implementation
  • architecture.png, report_layout.png - Visualization examples

Key Concepts:

  • Document layout analysis
  • Reading order determination
  • Relationship between text blocks
  • Visual structure recognition

Architecture:

Document Image
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Layout Model β”‚  (LayoutLM/LayoutLMv2)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Bounding   β”‚  (Text blocks, tables, figures)
β”‚    Boxes     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Reading Orderβ”‚  (Sequence prediction)
β”‚ Determinationβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
Structured Output

Lesson 8: Agentic Document Extraction

Focus: Modern AI-powered document understanding with LandingAI ADE

πŸ“ Directory: L8/

  • L8.ipynb - ADE comprehensive tutorial
  • helper.py - Visualization utilities
  • utility_example/ - Advanced examples
  • difficult_examples/ - Edge cases

Key Concepts:

  • Agentic approach to document extraction
  • Automatic chunk detection (text, tables, figures)
  • Visual grounding with bounding boxes
  • Markdown output with preserved structure
  • Confidence scoring for extractions

Chunk Types:

  • πŸ“ chunkText - Regular text paragraphs
  • πŸ“Š chunkTable - Structured tables
  • πŸ–ΌοΈ chunkFigure - Images and diagrams
  • 🏷️ chunkLogo - Company logos
  • πŸ“‡ chunkCard - Business cards
  • ✍️ chunkAttestation - Signatures
  • πŸ“± chunkScanCode - QR/Barcodes
  • πŸ“‹ chunkForm - Form fields

Visualization Example:

from helper import draw_bounding_boxes

# Parse document
response = ade_client.parse("document.pdf")

# Draw bounding boxes on chunks
draw_bounding_boxes(response, "document.pdf")

Output: Color-coded bounding boxes showing:

  • 🟒 Green: Text chunks
  • πŸ”΅ Blue: Tables
  • 🟣 Purple: Marginalia
  • 🟠 Orange: Cards

Lesson 9: Batch Processing with ADE

Focus: Processing multiple documents efficiently

πŸ“ Directory: L9/

  • L9.ipynb - Batch processing workflow
  • input_folder/ - Sample documents for batch processing
  • results/ - Processed outputs
  • results_extracted/ - Extracted structured data

Key Concepts:

  • Batch document processing
  • Parallel processing strategies
  • Error handling and logging
  • Output organization
  • Performance optimization

Workflow:

import os
from pathlib import Path
from landingai.ade import ADEClient

client = ADEClient(api_key=os.getenv("VISION_AGENT_API_KEY"))
input_dir = Path("input_folder")
output_dir = Path("results")

for doc_path in input_dir.glob("*.pdf"):
    try:
        response = client.parse(doc_path)
  
        # Save markdown
        (output_dir / f"{doc_path.stem}.md").write_text(response.markdown)
  
        # Save grounding data
        # ... (save JSON with bounding boxes)
  
        print(f"βœ… Processed: {doc_path.name}")
    except Exception as e:
        print(f"❌ Failed: {doc_path.name} - {e}")

Lesson 11: RAG with ChromaDB

Focus: Building a Retrieval-Augmented Generation system

πŸ“ Directory: L11/

  • L11.ipynb - RAG implementation tutorial
  • apple_10k.pdf - Sample financial document
  • chroma_db/ - Vector database storage
  • ade_outputs/ - Processed document chunks

Key Concepts:

  • Document chunking strategies
  • Vector embeddings
  • Semantic search
  • Context retrieval
  • LLM integration for Q&A

RAG Pipeline Flow:

Document (PDF)
      β”‚
      β–Ό
  ADE Parse  ──────> Markdown + Grounding
      β”‚
      β–Ό
  Chunking   ──────> Semantic segments
      β”‚
      β–Ό
  Embeddings ──────> Vector representations
      β”‚
      β–Ό
  ChromaDB   ──────> Vector storage
      β”‚
      β–Ό
  User Query
      β”‚
      β–Ό
 Similarity  ──────> Retrieve relevant chunks
   Search
      β”‚
      β–Ό
  LLM + Context ───> Generate answer

Example Query Flow:

# 1. Load and parse document
response = ade_client.parse("apple_10k.pdf")

# 2. Create chunks
chunks = create_semantic_chunks(response.markdown)

# 3. Store in vector DB
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(),
    persist_directory="./chroma_db"
)

# 4. Query
query = "What was Apple's revenue in 2023?"
docs = vectorstore.similarity_search(query, k=5)

# 5. Generate answer with LLM
context = "\n\n".join([doc.page_content for doc in docs])
answer = llm.invoke(f"Context: {context}\n\nQuestion: {query}")

☁️ RAG Pipeline with AWS

Overview

Lab 6 demonstrates a production-ready document intelligence system using AWS services.

πŸ“ Directory: rag_pipeline_aws/

Architecture Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    AWS Cloud Infrastructure                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  User Upload β”‚
β”‚  PDF to S3   β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              S3 Bucket Structure                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  input/                                             β”‚
β”‚    └── medical/                                     β”‚
β”‚         └── research_papers.pdf                     β”‚
β”‚                                                     β”‚
β”‚  output/                                            β”‚
β”‚    β”œβ”€β”€ medical/                  (Markdown)         β”‚
β”‚    β”œβ”€β”€ medical_grounding/        (Bounding boxes)   β”‚
β”‚    β”œβ”€β”€ medical_chunks/           (Chunk JSONs)      β”‚
β”‚    └── medical_chunk_images/     (Cropped images)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β”‚ S3 Event Trigger
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Lambda Function                β”‚
β”‚      (ade_s3_handler.py)                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β€’ Triggered on S3 upload                β”‚
β”‚  β€’ Calls LandingAI ADE API               β”‚
β”‚  β€’ Processes document                    β”‚
β”‚  β€’ Creates chunk JSONs                   β”‚
β”‚  β€’ Saves to S3 output/                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚       AWS Bedrock Knowledge Base         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β€’ Indexes chunk JSONs                   β”‚
β”‚  β€’ Maintains metadata                    β”‚
β”‚  β€’ Vector embeddings                     β”‚
β”‚  β€’ Semantic search                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Strands Agent Framework          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β€’ Orchestrates conversation             β”‚
β”‚  β€’ Queries Knowledge Base                β”‚
β”‚  β€’ Visual grounding tool                 β”‚
β”‚  β€’ Bedrock Memory Service                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          User Interaction                β”‚
β”‚  β€’ Ask questions about documents         β”‚
β”‚  β€’ Get answers with source citations     β”‚
β”‚  β€’ View highlighted document regions     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Files

  • Lab-6.ipynb - Main tutorial notebook
  • ade_s3_handler.py - Lambda function for document processing
  • lambda_helpers.py - Deployment utilities
  • visual_grounding_helper.py - Chunk image extraction
  • medical/ - Sample medical research papers

Quick Setup

# 1. Configure AWS credentials
aws configure

# 2. Create S3 bucket
aws s3 mb s3://your-doc-bucket
aws s3api put-object --bucket your-doc-bucket --key input/
aws s3api put-object --bucket your-doc-bucket --key output/

# 3. Deploy Lambda (see Lab-6.ipynb for details)
# 4. Create Bedrock Knowledge Base
# 5. Upload documents and start chatting!

For detailed setup instructions, see rag_pipeline_aws/README.md


πŸ› οΈ Helper Utilities

The helper.py file provides essential utilities for document visualization and processing.

Key Functions

1. Document Display

from helper import print_document

# Display PDF or image in notebook
print_document("document.pdf")
print_document("image.png")

2. Bounding Box Visualization

from helper import draw_bounding_boxes

# Draw color-coded bounding boxes
parse_response = ade_client.parse("document.pdf")
annotated_image = draw_bounding_boxes(parse_response, "document.pdf")

Color Scheme:

  • 🟒 Green (40, 167, 69): Text chunks (chunkText)
  • πŸ”΅ Blue (0, 123, 255): Tables (chunkTable)
  • 🟣 Purple (111, 66, 193): Marginalia (chunkMarginalia)
  • 🟑 Magenta (255, 0, 255): Figures (chunkFigure)
  • 🟒 Light Green (144, 238, 144): Logos (chunkLogo)
  • 🟠 Orange (255, 165, 0): Cards (chunkCard)
  • πŸ”΅ Cyan (0, 255, 255): Attestations (chunkAttestation)
  • 🟑 Yellow (255, 193, 7): Scan codes (chunkScanCode)
  • πŸ”΄ Red (220, 20, 60): Forms (chunkForm)

πŸ“‚ Project Structure

document_ai_from_OCR_to_agentic_doc_extraction/
β”‚
β”œβ”€β”€ README.md                      # This comprehensive guide
β”œβ”€β”€ requirements.txt               # Python dependencies
β”œβ”€β”€ helper.py                      # Global utility functions
β”œβ”€β”€ .env                          # Environment variables (not in git)
β”œβ”€β”€ .gitignore                    # Git ignore rules
β”‚
β”œβ”€β”€ L2/                           # Lesson 2: Basic OCR
β”‚   β”œβ”€β”€ L2.ipynb                  # Jupyter notebook
β”‚   β”œβ”€β”€ l2_doc_processing.py      # Python utilities
β”‚   β”œβ”€β”€ invoice.png               # Sample invoice
β”‚   β”œβ”€β”€ receipt.jpg               # Sample receipt
β”‚   β”œβ”€β”€ table.png                 # Sample table
β”‚   └── requirements.txt          # Lesson-specific deps
β”‚
β”œβ”€β”€ L4/                           # Lesson 4: PaddleOCR
β”‚   β”œβ”€β”€ L4.ipynb
β”‚   β”œβ”€β”€ l4_doc_parsing_paddleocr.py
β”‚   β”œβ”€β”€ bank_statement.png
β”‚   β”œβ”€β”€ handwritten.jpg
β”‚   └── article.jpg
β”‚
β”œβ”€β”€ L6/                           # Lesson 6: Layout Analysis
β”‚   β”œβ”€β”€ L6.ipynb
β”‚   β”œβ”€β”€ architecture.png
β”‚   β”œβ”€β”€ report_layout.png
β”‚   └── layoutreader/             # LayoutReader implementation
β”‚       β”œβ”€β”€ README.md
β”‚       β”œβ”€β”€ main.py
β”‚       └── tools.py
β”‚
β”œβ”€β”€ L8/                           # Lesson 8: Agentic Extraction
β”‚   β”œβ”€β”€ L8.ipynb
β”‚   β”œβ”€β”€ helper.py
β”‚   β”œβ”€β”€ difficult_examples/       # Complex document samples
β”‚   └── utility_example/
β”‚
β”œβ”€β”€ L9/                           # Lesson 9: Batch Processing
β”‚   β”œβ”€β”€ L9.ipynb
β”‚   β”œβ”€β”€ helper.py
β”‚   β”œβ”€β”€ input_folder/             # Documents to process
β”‚   β”œβ”€β”€ results/                  # Markdown outputs
β”‚   └── results_extracted/        # Structured extractions
β”‚
β”œβ”€β”€ L11/                          # Lesson 11: RAG Pipeline
β”‚   β”œβ”€β”€ L11.ipynb
β”‚   β”œβ”€β”€ helper.py
β”‚   β”œβ”€β”€ apple_10k.pdf             # Sample financial document
β”‚   β”œβ”€β”€ ade_outputs/
β”‚   └── chroma_db/                # Vector database
β”‚
└── rag_pipeline_aws/             # Lab 6: AWS RAG System
    β”œβ”€β”€ Lab-6.ipynb
    β”œβ”€β”€ README.md                 # Detailed lab guide
    β”œβ”€β”€ ade_s3_handler.py         # Lambda function
    β”œβ”€β”€ lambda_helpers.py         # Deployment tools
    β”œβ”€β”€ visual_grounding_helper.py # Chunk extraction
    └── medical/                   # Sample medical PDFs

πŸ“š Technical Resources

Core Technologies

OCR Engines

Layout Understanding

LandingAI Platform

AWS Services

Python Libraries


πŸ” Troubleshooting

Common Issues

1. Tesseract Not Found

Error: TesseractNotFoundError

Solution:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install tesseract-ocr

# Verify installation
tesseract --version

# If needed, specify path
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'

2. PaddlePaddle Installation Issues

Error: No module named 'paddle'

Solution:

# Uninstall any existing version
pip uninstall paddlepaddle paddlepaddle-gpu

# Install correct version
pip install paddlepaddle==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/

# For GPU (CUDA 11.2)
pip install paddlepaddle-gpu==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu112/

3. LandingAI API Key Issues

Error: Authentication failed

Solution:

# Verify .env file exists
cat .env | grep VISION_AGENT_API_KEY

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Or set directly (not recommended for production)
import os
os.environ['VISION_AGENT_API_KEY'] = 'your_key_here'

4. AWS Lambda Timeout

Error: Task timed out after 3.00 seconds

Solution:

# Increase Lambda timeout
lambda_client.update_function_configuration(
    FunctionName='doc-processor',
    Timeout=900,  # 15 minutes
    MemorySize=1024  # 1GB RAM
)

🀝 Contributing

We welcome contributions! Here's how you can help:

Ways to Contribute

  1. πŸ› Report Bugs - Use GitHub Issues
  2. ✨ Suggest Features - Propose new lessons or examples
  3. πŸ“– Improve Documentation - Fix typos, add clarifications
  4. πŸ’» Submit Code - Fork, create feature branch, submit PR

Development Setup

# Fork and clone
git clone https://github.com/YOUR_USERNAME/document_ai_from_OCR_to_agentic_doc_extraction.git
cd document_ai_from_OCR_to_agentic_doc_extraction

# Create branch
git checkout -b feature/your-feature-name

# Make changes and commit
git commit -m "Add: Brief description of changes"

# Push and create PR
git push origin feature/your-feature-name

πŸ“„ License

This project is licensed under the MIT License.


πŸ™ Acknowledgments

  • DeepLearning.AI for the course structure
  • LandingAI for the ADE platform and Vision Agent
  • AWS for cloud infrastructure support
  • PaddlePaddle team for PaddleOCR
  • Microsoft for LayoutLM research
  • Google for Tesseract OCR

⬆ Back to Top


Last Updated: February 2026