diff --git a/.agents/summary/.last_commit b/.agents/summary/.last_commit
new file mode 100644
index 00000000..f8cbf9ca
--- /dev/null
+++ b/.agents/summary/.last_commit
@@ -0,0 +1 @@
+8d6102bc644641c94f5a695a32ea50c19b3c8d68
diff --git a/.agents/summary/architecture.md b/.agents/summary/architecture.md
new file mode 100644
index 00000000..f005f8c2
--- /dev/null
+++ b/.agents/summary/architecture.md
@@ -0,0 +1,411 @@
+# System Architecture
+
+## High-Level Overview
+
+The PDF Accessibility Solutions system provides two independent but complementary approaches to making PDF documents accessible:
+
+1. **PDF-to-PDF Remediation**: Maintains PDF format while adding accessibility features
+2. **PDF-to-HTML Remediation**: Converts PDFs to accessible HTML
+
+Both solutions are serverless, event-driven architectures deployed on AWS.
+
+## Architecture Diagram
+
+```mermaid
+graph TB
+ subgraph "PDF-to-PDF Solution"
+ S3_PDF[S3 Bucket
pdf/ folder]
+ Splitter[Lambda: PDF Splitter]
+ StepFn[Step Functions
Orchestrator]
+ Adobe[ECS: Adobe Autotag]
+ AltText[ECS: Alt Text Generator]
+ TitleGen[Lambda: Title Generator]
+ PreCheck[Lambda: Pre-Check]
+ PostCheck[Lambda: Post-Check]
+ Merger[Lambda: PDF Merger
Java]
+ S3_Result[S3: result/ folder]
+
+ S3_PDF -->|S3 Event| Splitter
+ Splitter -->|Trigger| StepFn
+ StepFn --> PreCheck
+ PreCheck --> Adobe
+ Adobe --> AltText
+ AltText --> TitleGen
+ TitleGen --> PostCheck
+ PostCheck --> Merger
+ Merger --> S3_Result
+ end
+
+ subgraph "PDF-to-HTML Solution"
+ S3_HTML[S3 Bucket
uploads/ folder]
+ Lambda_HTML[Lambda: PDF2HTML
Container]
+ BDA[Bedrock Data
Automation]
+ Bedrock[Bedrock
Nova Pro]
+ S3_Remediated[S3: remediated/ folder]
+
+ S3_HTML -->|S3 Event| Lambda_HTML
+ Lambda_HTML --> BDA
+ BDA -->|Parse PDF| Lambda_HTML
+ Lambda_HTML -->|Audit| Lambda_HTML
+ Lambda_HTML -->|Remediate| Bedrock
+ Bedrock -->|AI Fixes| Lambda_HTML
+ Lambda_HTML --> S3_Remediated
+ end
+
+ subgraph "Shared Services"
+ CW[CloudWatch
Metrics & Logs]
+ Secrets[Secrets Manager
Adobe Credentials]
+ Tagger[Lambda: S3 Tagger
User Attribution]
+ end
+
+ StepFn -.->|Metrics| CW
+ Lambda_HTML -.->|Metrics| CW
+ Adobe -.->|Credentials| Secrets
+ S3_PDF -.->|Tag Objects| Tagger
+ S3_HTML -.->|Tag Objects| Tagger
+```
+
+## PDF-to-PDF Solution Architecture
+
+### Workflow
+
+```mermaid
+sequenceDiagram
+ participant User
+ participant S3
+ participant Splitter
+ participant StepFn
+ participant Adobe
+ participant AltText
+ participant TitleGen
+ participant Merger
+
+ User->>S3: Upload PDF to pdf/ folder
+ S3->>Splitter: S3 Event Notification
+ Splitter->>S3: Split into chunks (temp/)
+ Splitter->>StepFn: Start workflow
+
+ loop For each chunk
+ StepFn->>Adobe: ECS Task (Fargate)
+ Adobe->>Adobe: Auto-tag PDF structure
+ Adobe->>AltText: Pass tagged PDF
+ AltText->>AltText: Generate alt text (Bedrock)
+ AltText->>S3: Save processed chunk
+ end
+
+ StepFn->>TitleGen: Generate document title
+ TitleGen->>Merger: Merge all chunks
+ Merger->>S3: Save to result/ folder
+ S3->>User: Download compliant PDF
+```
+
+### Components
+
+#### 1. PDF Splitter Lambda
+- **Runtime**: Python 3.12
+- **Trigger**: S3 PUT event on `pdf/` folder
+- **Function**: Splits large PDFs into manageable chunks (pages)
+- **Output**: Individual page PDFs in `temp/` folder
+- **Metrics**: Pages processed, file sizes
+
+#### 2. Step Functions Orchestrator
+- **Type**: Standard workflow
+- **Purpose**: Coordinates parallel processing of PDF chunks
+- **Features**:
+ - Parallel execution for multiple chunks
+ - Error handling and retries
+ - Progress tracking
+- **Timeout**: Configurable (default: 1 hour)
+
+#### 3. Adobe Autotag ECS Task
+- **Platform**: ECS Fargate
+- **Container**: Python-based
+- **Function**:
+ - Calls Adobe PDF Services API
+ - Adds PDF structure tags (headings, lists, tables)
+ - Extracts images and metadata
+- **API**: Adobe PDF Extract API
+- **Credentials**: Stored in Secrets Manager
+
+#### 4. Alt Text Generator ECS Task
+- **Platform**: ECS Fargate
+- **Container**: Node.js-based
+- **Function**:
+ - Generates alt text for images using Bedrock
+ - Embeds alt text into PDF structure
+ - Uses vision-capable models
+- **Model**: Amazon Nova Pro (multimodal)
+
+#### 5. Title Generator Lambda
+- **Runtime**: Python 3.12
+- **Function**: Generates descriptive PDF title using Bedrock
+- **Model**: Amazon Nova Pro
+- **Input**: PDF text content
+- **Output**: Metadata with generated title
+
+#### 6. PDF Merger Lambda
+- **Runtime**: Java 11
+- **Function**: Merges processed chunks into single PDF
+- **Library**: Apache PDFBox
+- **Output**: Final compliant PDF with "COMPLIANT" prefix
+
+#### 7. Accessibility Checkers
+- **Pre-Remediation**: Audits original PDF
+- **Post-Remediation**: Validates compliance
+- **Output**: JSON reports with WCAG issues
+
+### Infrastructure
+
+#### VPC Configuration
+- **Subnets**: Public and Private with NAT Gateway
+- **VPC Endpoints**: ECR, ECR Docker, S3 (reduces cold start by 10-15s)
+- **Security**: Private subnets for ECS tasks
+
+#### ECS Cluster
+- **Launch Type**: Fargate
+- **CPU**: 2 vCPU (configurable)
+- **Memory**: 4 GB (configurable)
+- **Networking**: Private subnets with NAT egress
+
+#### S3 Bucket Structure
+```
+pdfaccessibility-{id}/
+├── pdf/ # Input PDFs (trigger)
+├── temp/ # Intermediate chunks
+└── result/ # Final compliant PDFs
+```
+
+## PDF-to-HTML Solution Architecture
+
+### Workflow
+
+```mermaid
+sequenceDiagram
+ participant User
+ participant S3
+ participant Lambda
+ participant BDA
+ participant Bedrock
+
+ User->>S3: Upload PDF to uploads/
+ S3->>Lambda: S3 Event Notification
+ Lambda->>BDA: Create parsing job
+ BDA->>BDA: Parse PDF structure
+ BDA->>Lambda: Return structured data
+ Lambda->>Lambda: Convert to HTML
+ Lambda->>Lambda: Audit accessibility
+
+ loop For each issue
+ Lambda->>Bedrock: Generate fix
+ Bedrock->>Lambda: AI-generated solution
+ Lambda->>Lambda: Apply remediation
+ end
+
+ Lambda->>Lambda: Generate report
+ Lambda->>S3: Save to remediated/
+ S3->>User: Download ZIP file
+```
+
+### Components
+
+#### 1. PDF2HTML Lambda Function
+- **Runtime**: Python 3.12 (container)
+- **Trigger**: S3 PUT event on `uploads/` folder
+- **Timeout**: 15 minutes
+- **Memory**: 3 GB
+- **Container**: Custom Docker image with dependencies
+
+#### 2. Bedrock Data Automation (BDA)
+- **Service**: AWS Bedrock Data Automation
+- **Function**:
+ - Parses PDF structure (text, images, tables)
+ - Extracts layout information
+ - Identifies document elements
+- **Output**: Structured JSON with page elements
+
+#### 3. Accessibility Auditor
+- **Module**: `audit/auditor.py`
+- **Checks**:
+ - WCAG 2.1 Level AA criteria
+ - Document structure (headings, landmarks)
+ - Images (alt text, decorative vs. informative)
+ - Forms (labels, fieldsets)
+ - Tables (headers, captions, scope)
+ - Links (descriptive text)
+ - Color contrast
+- **Output**: Detailed issue list with locations
+
+#### 4. Remediation Engine
+- **Module**: `remediate/remediation_manager.py`
+- **Strategies**:
+ - Image remediation (alt text generation)
+ - Heading hierarchy fixes
+ - Table structure improvements
+ - Form label associations
+ - Landmark additions (main, nav, header, footer)
+ - Link text improvements
+ - Color contrast adjustments
+- **AI Integration**: Bedrock Nova Pro for complex fixes
+
+#### 5. Report Generator
+- **Formats**: HTML, JSON, CSV, TXT
+- **Content**:
+ - Issues found and fixed
+ - WCAG criteria mapping
+ - Before/after comparisons
+ - Usage statistics (tokens, API calls, costs)
+
+### Infrastructure
+
+#### Lambda Container
+- **Base Image**: `public.ecr.aws/lambda/python:3.12`
+- **Dependencies**:
+ - `beautifulsoup4`, `lxml` (HTML parsing)
+ - `boto3` (AWS SDK)
+ - `Pillow` (image processing)
+ - Custom accessibility utility library
+
+#### S3 Bucket Structure
+```
+pdf2html-bucket-{id}/
+├── uploads/ # Input PDFs (trigger)
+├── output/ # Temporary processing files
+└── remediated/ # Final ZIP files
+ └── final_{filename}.zip
+ ├── remediated.html
+ ├── result.html
+ ├── images/
+ ├── remediation_report.html
+ └── usage_data.json
+```
+
+## Shared Infrastructure
+
+### CloudWatch Monitoring
+
+```mermaid
+graph LR
+ Lambda[Lambda Functions] -->|Logs| CW[CloudWatch Logs]
+ ECS[ECS Tasks] -->|Logs| CW
+ Lambda -->|Metrics| CWM[CloudWatch Metrics]
+ ECS -->|Metrics| CWM
+ CWM --> Dashboard[Usage Dashboard]
+
+ Dashboard --> Pages[Pages Processed]
+ Dashboard --> Costs[Cost Estimates]
+ Dashboard --> Errors[Error Rates]
+ Dashboard --> Tokens[Token Usage]
+```
+
+#### Custom Metrics Namespace: `PDFAccessibility`
+
+**Metrics Published**:
+- `PagesProcessed`: Total pages remediated
+- `AdobeAPICalls`: Adobe API invocations
+- `BedrockInvocations`: Bedrock API calls
+- `BedrockTokensUsed`: Input/output tokens
+- `ProcessingDuration`: End-to-end time
+- `ErrorCount`: Failures by type
+- `FileSizeBytes`: Input/output file sizes
+- `EstimatedCost`: Per-user cost tracking
+
+**Dimensions**:
+- `Solution`: `PDF2PDF` or `PDF2HTML`
+- `UserId`: Cognito user ID (from S3 tags)
+- `Operation`: Specific operation type
+
+### S3 Object Tagging
+- **Lambda**: `s3_object_tagger`
+- **Purpose**: Attribute usage to individual users
+- **Tags**: `user-id`, `upload-timestamp`
+- **Integration**: Cognito user pools (when UI deployed)
+
+### Secrets Manager
+- **Secret**: `adobe-pdf-services-credentials`
+- **Contents**:
+ - `client_id`: Adobe API client ID
+ - `client_secret`: Adobe API client secret
+- **Access**: Adobe Autotag ECS task only
+
+## Design Patterns
+
+### Event-Driven Architecture
+- S3 events trigger processing pipelines
+- Loose coupling between components
+- Asynchronous processing
+
+### Serverless-First
+- Lambda for lightweight operations
+- ECS Fargate for heavy processing
+- No server management
+
+### Infrastructure as Code
+- AWS CDK for all resources
+- Version-controlled infrastructure
+- Repeatable deployments
+
+### Observability
+- Comprehensive CloudWatch logging
+- Custom metrics for business KPIs
+- Cost tracking per user
+
+### Security
+- Least privilege IAM roles
+- VPC isolation for ECS tasks
+- Secrets Manager for credentials
+- SSL/TLS enforcement on S3
+
+## Scalability Considerations
+
+### PDF-to-PDF
+- **Parallel Processing**: Step Functions processes chunks concurrently
+- **ECS Auto-scaling**: Fargate scales based on task count
+- **Bottleneck**: Adobe API rate limits
+
+### PDF-to-HTML
+- **Lambda Concurrency**: Configurable (default: 10)
+- **BDA Limits**: Project-level quotas
+- **Bedrock Throttling**: Model-specific limits
+
+### Cost Optimization
+- **VPC Endpoints**: Reduce data transfer costs
+- **zstd Compression**: Faster container startup (2-3x vs gzip)
+- **Spot Instances**: Not used (Fargate on-demand for reliability)
+- **S3 Lifecycle**: Automatic cleanup of temp files (optional)
+
+## Deployment Architecture
+
+```mermaid
+graph TB
+ Dev[Developer] -->|git push| Repo[GitHub Repo]
+ Repo -->|webhook| CodeBuild[CodeBuild Project]
+ CodeBuild -->|cdk synth| CFN[CloudFormation]
+ CFN -->|deploy| Stack1[PDF-to-PDF Stack]
+ CFN -->|deploy| Stack2[PDF-to-HTML Stack]
+ CFN -->|deploy| Stack3[Metrics Stack]
+
+ CodeBuild -->|docker build| ECR[ECR Repositories]
+ ECR --> Stack1
+ ECR --> Stack2
+```
+
+### Deployment Options
+1. **One-Click**: `deploy.sh` script (CloudShell)
+2. **CodeBuild**: Automated CI/CD pipeline
+3. **Manual**: `cdk deploy` commands
+4. **Local**: `deploy-local.sh` for development
+
+## Disaster Recovery
+
+### Backup Strategy
+- **S3 Versioning**: Enabled on all buckets
+- **CloudFormation**: Infrastructure recreatable from code
+- **Secrets**: Backed up in Secrets Manager
+
+### Recovery Time Objective (RTO)
+- **Infrastructure**: ~15 minutes (CDK redeploy)
+- **Data**: Immediate (S3 versioning)
+
+### Recovery Point Objective (RPO)
+- **Processing State**: Lost (stateless architecture)
+- **Input Files**: Zero data loss (S3 durability)
diff --git a/.agents/summary/codebase_info.md b/.agents/summary/codebase_info.md
new file mode 100644
index 00000000..91a19530
--- /dev/null
+++ b/.agents/summary/codebase_info.md
@@ -0,0 +1,110 @@
+# Codebase Information
+
+## Overview
+
+**Project**: PDF Accessibility Solutions
+**Organization**: Arizona State University's AI Cloud Innovation Center (AI CIC)
+**Purpose**: Automated PDF accessibility remediation using AWS services and generative AI
+
+## Statistics
+
+- **Total Files**: 140
+- **Lines of Code**: 27,949
+- **Primary Languages**: Python (95 files), JavaScript (3 files), Java (2 files), Shell (2 files)
+- **Size Category**: Medium (M)
+
+## Language Distribution
+
+| Language | Files | Functions | Classes | LOC |
+|----------|-------|-----------|---------|-----|
+| Python | 95 | 457 | 74 | ~25,000 |
+| JavaScript | 3 | 5 | 1 | ~700 |
+| Java | 2 | 7 | 2 | ~200 |
+| Shell | 2 | 16 | 0 | ~1,300 |
+
+## Technology Stack
+
+### Infrastructure & Deployment
+- **AWS CDK** (Python & JavaScript): Infrastructure as Code
+- **AWS CloudFormation**: Stack deployment
+- **CodeBuild**: CI/CD pipeline
+
+### AWS Services
+- **Compute**: Lambda, ECS Fargate, Step Functions
+- **Storage**: S3
+- **AI/ML**: Bedrock (Nova Pro model), Bedrock Data Automation
+- **Monitoring**: CloudWatch, CloudWatch Dashboards
+- **Security**: Secrets Manager, IAM
+- **Networking**: VPC, VPC Endpoints
+
+### External Services
+- **Adobe PDF Services API**: PDF auto-tagging and extraction
+
+### Python Dependencies
+- `aws-cdk-lib==2.147.2`
+- `boto3`: AWS SDK
+- `beautifulsoup4`: HTML parsing
+- `lxml`: XML/HTML processing
+- `pypdf`: PDF manipulation
+- `PyMuPDF (fitz)`: PDF text extraction
+- `Pillow`: Image processing
+
+### JavaScript Dependencies
+- `@aws-cdk/aws-lambda-python-alpha`: Lambda Python constructs
+- `pdf-lib`: PDF manipulation
+- `@aws-sdk/client-bedrock-runtime`: Bedrock API client
+
+## Repository Structure
+
+```
+PDF_Accessibility/
+├── .agents/ # AI assistant documentation
+├── cdk/ # CDK infrastructure (Python)
+│ ├── usage_metrics_stack.py # CloudWatch metrics dashboard
+│ └── cdk_stack.py # Base stack definition
+├── lambda/ # Lambda functions
+│ ├── pdf-splitter-lambda/ # Splits PDFs into chunks
+│ ├── pdf-merger-lambda/ # Merges processed PDFs (Java)
+│ ├── title-generator-lambda/ # Generates PDF titles
+│ ├── pre-remediation-accessibility-checker/
+│ ├── post-remediation-accessibility-checker/
+│ ├── s3_object_tagger/ # Tags S3 objects with user metadata
+│ └── shared/ # Shared utilities (metrics)
+├── pdf2html/ # PDF-to-HTML solution
+│ ├── cdk/ # CDK infrastructure (JavaScript)
+│ ├── content_accessibility_utility_on_aws/ # Core library
+│ │ ├── audit/ # Accessibility auditing
+│ │ ├── remediate/ # Accessibility remediation
+│ │ ├── pdf2html/ # PDF to HTML conversion
+│ │ ├── batch/ # Batch processing
+│ │ └── utils/ # Utilities
+│ ├── lambda_function.py # Lambda entry point
+│ └── Dockerfile # Lambda container image
+├── adobe-autotag-container/ # ECS container for Adobe API
+├── alt-text-generator-container/ # ECS container for alt text (Node.js)
+├── docs/ # Documentation
+├── app.py # Main CDK app (PDF-to-PDF)
+├── deploy.sh # Unified deployment script
+└── deploy-local.sh # Local deployment script
+
+```
+
+## Supported Standards
+
+- **WCAG 2.1 Level AA**: Web Content Accessibility Guidelines
+- **PDF/UA**: PDF Universal Accessibility (ISO 14289)
+
+## Development Environment
+
+- **Python**: 3.9+ (Lambda runtime: 3.12)
+- **Node.js**: 18+ (for JavaScript Lambda and CDK)
+- **Java**: 11+ (for PDF merger Lambda)
+- **Docker**: Required for container builds
+- **AWS CLI**: Required for deployment
+
+## Build Artifacts
+
+- **CDK Output**: `cdk.out/` directory
+- **Docker Images**: Pushed to ECR
+- **Lambda Packages**: Zipped and uploaded to S3
+- **CloudFormation Templates**: Generated in `cdk.out/`
diff --git a/.agents/summary/components.md b/.agents/summary/components.md
new file mode 100644
index 00000000..a95e9487
--- /dev/null
+++ b/.agents/summary/components.md
@@ -0,0 +1,634 @@
+# System Components
+
+## Component Catalog
+
+This document provides detailed information about each major component in the PDF Accessibility Solutions system.
+
+## PDF-to-PDF Solution Components
+
+### 1. PDF Splitter Lambda
+
+**Location**: `lambda/pdf-splitter-lambda/main.py`
+
+**Purpose**: Splits large PDF files into individual pages for parallel processing.
+
+**Key Functions**:
+- `lambda_handler()`: Entry point, processes S3 events
+- `split_pdf_into_pages()`: Splits PDF using pypdf library
+- `log_chunk_created()`: Tracks chunk creation metrics
+
+**Dependencies**:
+- `pypdf`: PDF manipulation
+- `boto3`: S3 operations
+- `metrics_helper`: CloudWatch metrics
+
+**Metrics Published**:
+- `PagesProcessed`: Number of pages split
+- `FileSizeBytes`: Input file size
+- `ProcessingDuration`: Split operation time
+
+**Triggers**: S3 PUT event on `pdf/` folder
+
+**Output**: Individual page PDFs in `temp/` folder with naming pattern: `{original_name}_page_{n}.pdf`
+
+**Error Handling**:
+- Retries with exponential backoff
+- Logs errors to CloudWatch
+- Publishes error metrics
+
+---
+
+### 2. Adobe Autotag Container
+
+**Location**: `adobe-autotag-container/adobe_autotag_processor.py`
+
+**Purpose**: Adds accessibility tags to PDFs using Adobe PDF Services API.
+
+**Key Functions**:
+- `main()`: Entry point for ECS task
+- `autotag_pdf_with_options()`: Calls Adobe API
+- `extract_api()`: Extracts images and structure
+- `add_toc_to_pdf()`: Adds table of contents
+- `set_language_comprehend()`: Detects document language
+- `extract_images_from_extract_api()`: Extracts images for alt text
+
+**Adobe API Operations**:
+- **Autotag**: Adds structure tags (headings, paragraphs, lists, tables)
+- **Extract**: Extracts images, text, and layout information
+
+**Dependencies**:
+- `adobe.pdfservices.operation`: Adobe SDK
+- `boto3`: S3 and Secrets Manager
+- `openpyxl`: Excel parsing for image metadata
+- `sqlite3`: Image metadata database
+
+**Configuration**:
+- Credentials from Secrets Manager
+- Language detection via AWS Comprehend
+- Configurable tagging options
+
+**Metrics**:
+- Adobe API calls
+- Processing duration
+- File sizes
+- Error tracking
+
+**Container Specs**:
+- **Base Image**: `python:3.9-slim`
+- **CPU**: 2 vCPU
+- **Memory**: 4 GB
+- **Timeout**: 30 minutes
+
+---
+
+### 3. Alt Text Generator Container
+
+**Location**: `alt-text-generator-container/alt_text_generator.js`
+
+**Purpose**: Generates alt text for images using Amazon Bedrock.
+
+**Key Functions**:
+- `startProcess()`: Entry point
+- `modifyPDF()`: Embeds alt text into PDF
+- `generateAltText()`: Calls Bedrock for image description
+- `generateAltTextForLink()`: Handles linked images
+
+**AI Model**: Amazon Nova Pro (multimodal vision model)
+
+**Process**:
+1. Reads PDF with existing tags
+2. Identifies images without alt text
+3. Extracts image context (surrounding text)
+4. Generates descriptive alt text via Bedrock
+5. Embeds alt text into PDF structure
+6. Saves modified PDF
+
+**Dependencies**:
+- `pdf-lib`: PDF manipulation
+- `@aws-sdk/client-bedrock-runtime`: Bedrock API
+- `@aws-sdk/client-s3`: S3 operations
+
+**Prompt Engineering**:
+- Includes image context from surrounding text
+- Distinguishes decorative vs. informative images
+- Generates concise, descriptive alt text
+
+**Container Specs**:
+- **Base Image**: `node:18-alpine`
+- **CPU**: 2 vCPU
+- **Memory**: 4 GB
+- **Timeout**: 30 minutes
+
+---
+
+### 4. Title Generator Lambda
+
+**Location**: `lambda/title-generator-lambda/title_generator.py`
+
+**Purpose**: Generates descriptive PDF titles using AI.
+
+**Key Functions**:
+- `lambda_handler()`: Entry point
+- `generate_title()`: Calls Bedrock for title generation
+- `extract_text_from_pdf()`: Extracts text using PyMuPDF
+- `set_custom_metadata()`: Embeds title in PDF metadata
+
+**AI Model**: Amazon Nova Pro
+
+**Process**:
+1. Extracts first few pages of text
+2. Sends to Bedrock with prompt
+3. Receives generated title
+4. Embeds in PDF metadata
+5. Saves updated PDF
+
+**Prompt**: Instructs model to create concise, descriptive title based on content
+
+**Dependencies**:
+- `pymupdf (fitz)`: PDF text extraction
+- `pypdf`: PDF metadata modification
+- `boto3`: S3 and Bedrock
+
+**Metrics**: Bedrock invocations, token usage, processing time
+
+---
+
+### 5. PDF Merger Lambda
+
+**Location**: `lambda/pdf-merger-lambda/PDFMergerLambda/src/main/java/com/example/App.java`
+
+**Purpose**: Merges processed PDF chunks into single compliant PDF.
+
+**Key Functions**:
+- `handleRequest()`: Lambda entry point
+- `downloadPDF()`: Downloads chunks from S3
+- `mergePDFs()`: Merges using Apache PDFBox
+- `uploadPDF()`: Uploads final PDF
+
+**Technology**: Java 11 with Apache PDFBox
+
+**Process**:
+1. Receives list of processed chunks
+2. Downloads all chunks from S3
+3. Merges in correct page order
+4. Adds "COMPLIANT" prefix to filename
+5. Uploads to `result/` folder
+
+**Dependencies**:
+- `org.apache.pdfbox:pdfbox`: PDF merging
+- `com.amazonaws:aws-lambda-java-core`: Lambda runtime
+- `software.amazon.awssdk:s3`: S3 operations
+
+**Memory**: 1 GB
+**Timeout**: 5 minutes
+
+---
+
+### 6. Pre/Post Remediation Accessibility Checkers
+
+**Locations**:
+- `lambda/pre-remediation-accessibility-checker/main.py`
+- `lambda/post-remediation-accessibility-checker/main.py`
+
+**Purpose**: Audit PDF accessibility before and after remediation.
+
+**Key Functions**:
+- `lambda_handler()`: Entry point
+- Calls external accessibility checking service/library
+- Generates JSON report with WCAG issues
+
+**Output**: JSON file with:
+- List of accessibility issues
+- WCAG criteria violations
+- Issue severity levels
+- Suggested fixes
+
+**Use Case**:
+- Pre-check: Baseline audit
+- Post-check: Validation of remediation
+
+---
+
+### 7. Step Functions Orchestrator
+
+**Definition**: Defined in `app.py` CDK stack
+
+**Purpose**: Coordinates parallel processing of PDF chunks.
+
+**Workflow**:
+```mermaid
+graph TD
+ Start[Start] --> PreCheck[Pre-Remediation Check]
+ PreCheck --> Map[Map State: Process Chunks]
+ Map --> Adobe[Adobe Autotag Task]
+ Adobe --> AltText[Alt Text Generator Task]
+ AltText --> MapEnd[Map Complete]
+ MapEnd --> TitleGen[Title Generator]
+ TitleGen --> PostCheck[Post-Remediation Check]
+ PostCheck --> Merge[PDF Merger]
+ Merge --> End[End]
+```
+
+**Features**:
+- **Map State**: Parallel execution of chunks
+- **Error Handling**: Retry logic with exponential backoff
+- **Timeouts**: Configurable per task
+- **Logging**: CloudWatch Logs integration
+
+**Configuration**:
+- Max concurrency: 10 (configurable)
+- Retry attempts: 3
+- Backoff rate: 2.0
+
+---
+
+## PDF-to-HTML Solution Components
+
+### 8. PDF2HTML Lambda Function
+
+**Location**: `pdf2html/lambda_function.py`
+
+**Purpose**: Converts PDFs to accessible HTML with full remediation.
+
+**Key Functions**:
+- `lambda_handler()`: Entry point, orchestrates entire pipeline
+- Calls `process_pdf_accessibility()` from main API
+
+**Pipeline Stages**:
+1. **Conversion**: PDF → HTML via Bedrock Data Automation
+2. **Audit**: Identify accessibility issues
+3. **Remediation**: Fix issues using AI
+4. **Report Generation**: Create detailed reports
+5. **Packaging**: ZIP all outputs
+
+**Dependencies**:
+- `content_accessibility_utility_on_aws`: Core library
+- `boto3`: AWS services
+- `beautifulsoup4`, `lxml`: HTML processing
+
+**Container**: Custom Docker image with all dependencies
+
+**Timeout**: 15 minutes
+**Memory**: 3 GB
+
+---
+
+### 9. Bedrock Data Automation Client
+
+**Location**: `pdf2html/content_accessibility_utility_on_aws/pdf2html/services/bedrock_client.py`
+
+**Purpose**: Interface to AWS Bedrock Data Automation for PDF parsing.
+
+**Key Classes**:
+- `BDAClient`: Base client for BDA operations
+- `ExtendedBDAClient`: Enhanced client with additional features
+
+**Key Functions**:
+- `create_project()`: Creates BDA project
+- `process_and_retrieve()`: Submits PDF and retrieves results
+- `_extract_html_from_result_json()`: Parses BDA output
+
+**BDA Capabilities**:
+- PDF structure parsing
+- Text extraction with layout preservation
+- Image extraction
+- Table detection
+- Element positioning
+
+**Output**: Structured JSON with page elements and HTML fragments
+
+---
+
+### 10. Accessibility Auditor
+
+**Location**: `pdf2html/content_accessibility_utility_on_aws/audit/auditor.py`
+
+**Purpose**: Comprehensive WCAG 2.1 Level AA accessibility audit.
+
+**Key Class**: `AccessibilityAuditor`
+
+**Key Functions**:
+- `audit()`: Main audit entry point
+- `_audit_page()`: Audits single HTML page
+- `_check_text_alternatives()`: Image alt text checks
+- `_generate_report()`: Creates audit report
+
+**Audit Checks** (from `audit/checks/`):
+
+#### Image Checks
+- Missing alt text
+- Empty alt text
+- Generic alt text (e.g., "image", "picture")
+- Long alt text (>150 characters)
+- Decorative image identification
+- Figure structure (figcaption)
+
+#### Heading Checks
+- Missing H1
+- Skipped heading levels
+- Empty heading content
+- Heading hierarchy
+
+#### Table Checks
+- Missing headers
+- Missing caption
+- Missing scope attributes
+- Irregular header structure
+- Missing thead/tbody
+
+#### Form Checks
+- Missing labels
+- Missing fieldsets for radio/checkbox groups
+- Missing required field indicators
+
+#### Link Checks
+- Empty link text
+- Generic link text ("click here", "read more")
+- URL as link text
+- New window without warning
+
+#### Structure Checks
+- Missing document language
+- Missing document title
+- Missing landmarks (main, nav, header, footer)
+- Missing skip links
+
+#### Color Contrast Checks
+- Insufficient contrast ratios
+- WCAG AA compliance (4.5:1 normal, 3:1 large text)
+
+**Output**: `AuditReport` object with:
+- List of issues with locations
+- WCAG criteria mapping
+- Severity levels (critical, serious, moderate, minor)
+- Element selectors for precise location
+
+---
+
+### 11. Remediation Manager
+
+**Location**: `pdf2html/content_accessibility_utility_on_aws/remediate/remediation_manager.py`
+
+**Purpose**: Applies fixes to accessibility issues.
+
+**Key Class**: `RemediationManager`
+
+**Key Functions**:
+- `remediate_issues()`: Processes all issues
+- `remediate_issue()`: Fixes single issue
+- `_get_remediation_strategies()`: Maps issues to strategies
+
+**Remediation Strategies** (from `remediate/remediation_strategies/`):
+
+#### Image Remediation
+- `remediate_missing_alt_text()`: Generates alt text via Bedrock
+- `remediate_empty_alt_text()`: Adds descriptive alt text
+- `remediate_generic_alt_text()`: Improves generic descriptions
+- `remediate_long_alt_text()`: Shortens verbose alt text
+- `_is_decorative_image()`: Identifies decorative images
+
+#### Heading Remediation
+- `remediate_missing_h1()`: Adds H1 based on content
+- `remediate_skipped_heading_level()`: Fixes hierarchy
+- `remediate_empty_heading_content()`: Adds content or removes
+- `remediate_missing_headings()`: Adds structure
+
+#### Table Remediation
+- `remediate_table_missing_headers()`: Adds th elements
+- `remediate_table_missing_caption()`: Generates caption
+- `remediate_table_missing_scope()`: Adds scope attributes
+- `remediate_table_missing_thead()`: Adds thead structure
+- `remediate_table_irregular_headers()`: Fixes complex tables
+
+#### Form Remediation
+- `remediate_missing_form_labels()`: Associates labels
+- `remediate_missing_fieldsets()`: Groups related fields
+- `remediate_missing_required_indicators()`: Adds required markers
+
+#### Link Remediation
+- `remediate_empty_link_text()`: Adds descriptive text
+- `remediate_generic_link_text()`: Improves link text
+- `remediate_url_as_link_text()`: Replaces URLs with descriptions
+- `remediate_new_window_link_no_warning()`: Adds warnings
+
+#### Landmark Remediation
+- `remediate_missing_main_landmark()`: Adds main element
+- `remediate_missing_navigation_landmark()`: Adds nav
+- `remediate_missing_header_landmark()`: Adds header
+- `remediate_missing_footer_landmark()`: Adds footer
+- `remediate_missing_skip_link()`: Adds skip navigation
+
+#### Document Structure Remediation
+- `remediate_missing_document_title()`: Generates title
+- `remediate_missing_language()`: Adds lang attribute
+
+#### Color Contrast Remediation
+- `remediate_insufficient_color_contrast()`: Adjusts colors
+
+**AI Integration**:
+- Uses Bedrock Nova Pro for complex remediations
+- Prompt engineering for context-aware fixes
+- Fallback to rule-based fixes
+
+---
+
+### 12. Bedrock Client (Remediation)
+
+**Location**: `pdf2html/content_accessibility_utility_on_aws/remediate/services/bedrock_client.py`
+
+**Purpose**: Interface to Amazon Bedrock for AI-powered remediation.
+
+**Key Class**: `BedrockClient`
+
+**Key Functions**:
+- `generate_text()`: Text generation for fixes
+- `generate_alt_text_for_image()`: Image description generation
+
+**Models Used**:
+- Amazon Nova Pro (default)
+- Configurable model selection
+
+**Prompt Engineering**:
+- Context-aware prompts
+- Element context inclusion
+- WCAG criteria guidance
+
+---
+
+### 13. Report Generator
+
+**Location**: `pdf2html/content_accessibility_utility_on_aws/utils/report_generator.py`
+
+**Purpose**: Generates comprehensive accessibility reports.
+
+**Key Functions**:
+- `generate_report()`: Main entry point
+- `generate_html_report()`: Interactive HTML report
+- `generate_json_report()`: Machine-readable JSON
+- `generate_csv_report()`: Spreadsheet format
+- `generate_text_report()`: Plain text summary
+
+**HTML Report Features**:
+- Issue summary with counts
+- WCAG criteria breakdown
+- Before/after comparisons
+- Interactive filtering
+- Severity color coding
+
+**Report Contents**:
+- Total issues found
+- Issues fixed automatically
+- Issues requiring manual review
+- WCAG 2.1 criteria mapping
+- Element locations with selectors
+- Remediation actions taken
+- Usage statistics (tokens, costs)
+
+---
+
+### 14. Usage Tracker
+
+**Location**: `pdf2html/content_accessibility_utility_on_aws/utils/usage_tracker.py`
+
+**Purpose**: Tracks API usage and estimates costs.
+
+**Key Class**: `SessionUsageTracker` (Singleton)
+
+**Tracked Metrics**:
+- Bedrock invocations
+- Token usage (input/output)
+- BDA processing time
+- API call counts
+- Estimated costs
+
+**Cost Estimation**:
+- Bedrock: $0.0008/1K input tokens, $0.0032/1K output tokens
+- BDA: Per-page pricing
+- Lambda: Per-GB-second
+- S3: Storage and requests
+
+**Output**: `usage_data.json` with detailed breakdown
+
+---
+
+## Shared Components
+
+### 15. Metrics Helper
+
+**Location**: `lambda/shared/metrics_helper.py`
+
+**Purpose**: Centralized CloudWatch metrics publishing.
+
+**Key Class**: `MetricsContext` (Context Manager)
+
+**Key Functions**:
+- `emit_metric()`: Publishes metric to CloudWatch
+- `track_pages_processed()`: Pages metric
+- `track_adobe_api_call()`: Adobe API tracking
+- `track_bedrock_invocation()`: Bedrock tracking
+- `track_processing_duration()`: Timing metrics
+- `track_error()`: Error tracking
+- `track_file_size()`: File size metrics
+- `estimate_cost()`: Cost calculation
+
+**Usage Pattern**:
+```python
+with MetricsContext(user_id="user123", solution="PDF2PDF") as metrics:
+ metrics.track_pages_processed(10)
+ metrics.track_adobe_api_call()
+ # ... processing ...
+ metrics.estimate_cost(adobe_calls=1, pages=10)
+```
+
+**Namespace**: `PDFAccessibility`
+
+**Dimensions**: `Solution`, `UserId`, `Operation`
+
+---
+
+### 16. S3 Object Tagger
+
+**Location**: `lambda/s3_object_tagger/main.py`
+
+**Purpose**: Tags S3 objects with user metadata for attribution.
+
+**Key Functions**:
+- `lambda_handler()`: Processes S3 events
+- Tags objects with `user-id` and `upload-timestamp`
+
+**Integration**: Cognito user pools (when UI deployed)
+
+**Use Case**: Per-user usage tracking and cost allocation
+
+---
+
+### 17. CloudWatch Dashboard
+
+**Location**: `cdk/usage_metrics_stack.py`
+
+**Purpose**: Visualizes usage metrics and costs.
+
+**Dashboard Name**: `PDF-Accessibility-Usage-Metrics`
+
+**Widgets**:
+- Pages processed over time
+- Adobe API calls
+- Bedrock invocations
+- Token usage (input/output)
+- Error rates by type
+- Estimated costs by user
+- Processing duration percentiles
+
+**Refresh**: Real-time (1-minute intervals)
+
+---
+
+## Component Dependencies
+
+```mermaid
+graph TD
+ Splitter[PDF Splitter] --> StepFn[Step Functions]
+ StepFn --> Adobe[Adobe Autotag]
+ StepFn --> AltText[Alt Text Generator]
+ StepFn --> TitleGen[Title Generator]
+ StepFn --> Merger[PDF Merger]
+
+ Adobe --> Secrets[Secrets Manager]
+ Adobe --> Metrics[Metrics Helper]
+ AltText --> Bedrock[Bedrock]
+ AltText --> Metrics
+ TitleGen --> Bedrock
+ TitleGen --> Metrics
+
+ PDF2HTML[PDF2HTML Lambda] --> BDA[BDA Client]
+ PDF2HTML --> Auditor[Auditor]
+ PDF2HTML --> Remediator[Remediation Manager]
+
+ Auditor --> Checks[Audit Checks]
+ Remediator --> Strategies[Remediation Strategies]
+ Remediator --> BedrockClient[Bedrock Client]
+
+ PDF2HTML --> Reporter[Report Generator]
+ PDF2HTML --> UsageTracker[Usage Tracker]
+
+ Metrics --> CloudWatch[CloudWatch]
+ UsageTracker --> CloudWatch
+```
+
+## Component Communication
+
+### Synchronous
+- Lambda → S3 (direct API calls)
+- Lambda → Bedrock (direct API calls)
+- Lambda → Secrets Manager (direct API calls)
+
+### Asynchronous
+- S3 → Lambda (event notifications)
+- Step Functions → ECS (task invocation)
+- Lambda → CloudWatch (metrics/logs)
+
+### Data Flow
+- **Input**: S3 buckets
+- **Processing**: Lambda/ECS
+- **Output**: S3 buckets
+- **Monitoring**: CloudWatch
diff --git a/.agents/summary/data_models.md b/.agents/summary/data_models.md
new file mode 100644
index 00000000..fde475d5
--- /dev/null
+++ b/.agents/summary/data_models.md
@@ -0,0 +1,629 @@
+# Data Models and Structures
+
+## Core Data Models
+
+### 1. Audit Models
+
+#### AuditReport
+**Location**: `pdf2html/content_accessibility_utility_on_aws/utils/report_models.py`
+
+```python
+@dataclass
+class AuditReport(BaseReport):
+ summary: AuditSummary
+ issues: List[AuditIssue]
+ wcag_summary: Dict[str, Dict[str, Any]]
+ config: Config
+```
+
+**Fields**:
+- `summary`: High-level statistics
+- `issues`: List of accessibility issues found
+- `wcag_summary`: Issues grouped by WCAG criteria
+- `config`: Audit configuration used
+
+---
+
+#### AuditSummary
+```python
+@dataclass
+class AuditSummary(BaseSummary):
+ total_issues: int
+ by_severity: Dict[Severity, int]
+ by_wcag_level: Dict[str, int]
+ pages_audited: int
+ elements_checked: int
+```
+
+**Severity Levels**:
+- `CRITICAL`: Blocks accessibility (e.g., missing alt text)
+- `SERIOUS`: Major barrier (e.g., skipped heading levels)
+- `MODERATE`: Significant issue (e.g., generic link text)
+- `MINOR`: Minor improvement (e.g., missing lang attribute)
+
+---
+
+#### AuditIssue
+```python
+@dataclass
+class AuditIssue(BaseIssue):
+ id: str
+ type: str
+ severity: Severity
+ wcag_criteria: List[str]
+ element: str
+ selector: str
+ location: Location
+ message: str
+ suggestion: str
+ context: Optional[str]
+ status: IssueStatus
+```
+
+**Issue Types**:
+- `missing_alt_text`
+- `empty_alt_text`
+- `generic_alt_text`
+- `long_alt_text`
+- `missing_h1`
+- `skipped_heading_level`
+- `empty_heading_content`
+- `table_missing_headers`
+- `table_missing_caption`
+- `table_missing_scope`
+- `form_missing_label`
+- `form_missing_fieldset`
+- `empty_link_text`
+- `generic_link_text`
+- `url_as_link_text`
+- `missing_main_landmark`
+- `missing_document_language`
+- `insufficient_color_contrast`
+
+---
+
+#### Location
+```python
+@dataclass
+class Location:
+ page: int
+ line: Optional[int]
+ column: Optional[int]
+ xpath: Optional[str]
+```
+
+---
+
+### 2. Remediation Models
+
+#### RemediationReport
+```python
+@dataclass
+class RemediationReport(BaseReport):
+ summary: RemediationSummary
+ fixes: List[RemediationFix]
+ manual_review_items: List[ManualReviewItem]
+ config: Config
+```
+
+---
+
+#### RemediationSummary
+```python
+@dataclass
+class RemediationSummary(BaseSummary):
+ total_issues: int
+ fixed_automatically: int
+ requires_manual_review: int
+ failed: int
+ by_method: Dict[str, int] # ai_generated, rule_based, manual
+```
+
+---
+
+#### RemediationFix
+```python
+@dataclass
+class RemediationFix:
+ issue_id: str
+ issue_type: str
+ status: RemediationStatus
+ method: str
+ original_element: str
+ fixed_element: str
+ details: RemediationDetails
+ timestamp: datetime
+```
+
+**RemediationStatus**:
+- `FIXED`: Successfully remediated
+- `FAILED`: Remediation failed
+- `MANUAL_REVIEW`: Requires human review
+- `SKIPPED`: Intentionally skipped
+
+---
+
+#### RemediationDetails
+```python
+@dataclass
+class RemediationDetails:
+ ai_prompt: Optional[str]
+ ai_response: Optional[str]
+ ai_model: Optional[str]
+ tokens_used: Optional[int]
+ confidence: Optional[float]
+ fallback_used: bool
+ error_message: Optional[str]
+```
+
+---
+
+#### ManualReviewItem
+```python
+@dataclass
+class ManualReviewItem:
+ issue_id: str
+ issue_type: str
+ reason: str
+ element: str
+ selector: str
+ suggestion: str
+ priority: str # high, medium, low
+```
+
+---
+
+### 3. Configuration Models
+
+#### Config
+```python
+@dataclass
+class Config:
+ wcag_level: str = "AA" # AA or AAA
+ include_warnings: bool = True
+ check_color_contrast: bool = True
+ auto_remediate: bool = True
+ use_ai: bool = True
+ bedrock_model: str = "amazon.nova-pro-v1:0"
+ max_retries: int = 3
+ timeout_seconds: int = 300
+ output_formats: List[str] = field(default_factory=lambda: ["html", "json"])
+```
+
+---
+
+### 4. Usage Tracking Models
+
+#### UsageData
+**Location**: `pdf2html/content_accessibility_utility_on_aws/utils/usage_tracker.py`
+
+```python
+{
+ "session_id": "uuid",
+ "user_id": "user123",
+ "solution": "PDF2HTML",
+ "timestamp": "2026-03-02T15:00:00Z",
+ "pdf_info": {
+ "filename": "document.pdf",
+ "size_bytes": 1024000,
+ "pages": 10
+ },
+ "bedrock_usage": {
+ "invocations": 15,
+ "input_tokens": 5000,
+ "output_tokens": 2000,
+ "models_used": ["amazon.nova-pro-v1:0"]
+ },
+ "bda_usage": {
+ "pages_processed": 10,
+ "processing_time_seconds": 45
+ },
+ "processing_metrics": {
+ "total_duration_seconds": 120,
+ "conversion_time": 45,
+ "audit_time": 20,
+ "remediation_time": 55
+ },
+ "cost_estimates": {
+ "bedrock": 0.0224,
+ "bda": 0.50,
+ "lambda": 0.0015,
+ "s3": 0.0001,
+ "total": 0.524
+ }
+}
+```
+
+---
+
+### 5. BDA Models
+
+#### BDAElement
+**Location**: `pdf2html/content_accessibility_utility_on_aws/remediate/bda_integration/element_parser.py`
+
+```python
+{
+ "id": "element-001",
+ "type": "text" | "image" | "table" | "heading",
+ "page": 1,
+ "content": "Element content",
+ "bounding_box": {
+ "x": 100,
+ "y": 200,
+ "width": 300,
+ "height": 50
+ },
+ "confidence": 0.95,
+ "attributes": {
+ "font_size": 12,
+ "font_family": "Arial",
+ "color": "#000000"
+ },
+ "children": [] # Nested elements
+}
+```
+
+---
+
+#### BDAPage
+```python
+{
+ "page_number": 1,
+ "width": 612,
+ "height": 792,
+ "elements": [BDAElement],
+ "images": [
+ {
+ "id": "img-001",
+ "s3_path": "s3://bucket/images/img-001.png",
+ "bounding_box": {...},
+ "alt_text": None
+ }
+ ]
+}
+```
+
+---
+
+### 6. Metrics Models
+
+#### MetricData
+**Location**: `lambda/shared/metrics_helper.py`
+
+```python
+{
+ "namespace": "PDFAccessibility",
+ "metric_name": "PagesProcessed",
+ "value": 10,
+ "unit": "Count",
+ "timestamp": "2026-03-02T15:00:00Z",
+ "dimensions": [
+ {"name": "Solution", "value": "PDF2PDF"},
+ {"name": "UserId", "value": "user123"},
+ {"name": "Operation", "value": "adobe_autotag"}
+ ]
+}
+```
+
+---
+
+### 7. Step Functions State
+
+#### ChunkProcessingState
+```python
+{
+ "chunk_id": "chunk-001",
+ "s3_key": "temp/document_page_1.pdf",
+ "page_number": 1,
+ "status": "processing" | "completed" | "failed",
+ "adobe_output": "temp/document_page_1_tagged.pdf",
+ "alttext_output": "temp/document_page_1_final.pdf",
+ "errors": []
+}
+```
+
+---
+
+#### WorkflowState
+```python
+{
+ "execution_id": "exec-uuid",
+ "original_file": "pdf/document.pdf",
+ "user_id": "user123",
+ "chunks": [ChunkProcessingState],
+ "pre_check_results": {...},
+ "post_check_results": {...},
+ "final_output": "result/COMPLIANT_document.pdf",
+ "metrics": {
+ "total_pages": 10,
+ "processing_time": 120,
+ "adobe_calls": 10,
+ "bedrock_calls": 50
+ }
+}
+```
+
+---
+
+## WCAG Criteria Mapping
+
+### WCAG 2.1 Level AA Criteria
+
+```python
+WCAG_CRITERIA = {
+ "1.1.1": {
+ "name": "Non-text Content",
+ "level": "A",
+ "description": "All non-text content has text alternative",
+ "issue_types": ["missing_alt_text", "empty_alt_text"]
+ },
+ "1.3.1": {
+ "name": "Info and Relationships",
+ "level": "A",
+ "description": "Information, structure, and relationships can be programmatically determined",
+ "issue_types": ["table_missing_headers", "form_missing_label", "missing_headings"]
+ },
+ "1.3.2": {
+ "name": "Meaningful Sequence",
+ "level": "A",
+ "description": "Correct reading sequence can be programmatically determined",
+ "issue_types": ["skipped_heading_level"]
+ },
+ "1.4.3": {
+ "name": "Contrast (Minimum)",
+ "level": "AA",
+ "description": "Text has contrast ratio of at least 4.5:1",
+ "issue_types": ["insufficient_color_contrast"]
+ },
+ "2.4.1": {
+ "name": "Bypass Blocks",
+ "level": "A",
+ "description": "Mechanism to bypass blocks of repeated content",
+ "issue_types": ["missing_skip_link"]
+ },
+ "2.4.2": {
+ "name": "Page Titled",
+ "level": "A",
+ "description": "Web pages have titles that describe topic or purpose",
+ "issue_types": ["missing_document_title"]
+ },
+ "2.4.4": {
+ "name": "Link Purpose (In Context)",
+ "level": "A",
+ "description": "Purpose of each link can be determined from link text",
+ "issue_types": ["empty_link_text", "generic_link_text", "url_as_link_text"]
+ },
+ "2.4.6": {
+ "name": "Headings and Labels",
+ "level": "AA",
+ "description": "Headings and labels describe topic or purpose",
+ "issue_types": ["empty_heading_content", "generic_heading_text"]
+ },
+ "3.1.1": {
+ "name": "Language of Page",
+ "level": "A",
+ "description": "Default human language can be programmatically determined",
+ "issue_types": ["missing_document_language"]
+ },
+ "4.1.2": {
+ "name": "Name, Role, Value",
+ "level": "A",
+ "description": "Name and role can be programmatically determined",
+ "issue_types": ["missing_aria_labels", "invalid_aria_attributes"]
+ }
+}
+```
+
+---
+
+## File Formats
+
+### 1. Audit Report JSON
+```json
+{
+ "version": "1.0",
+ "timestamp": "2026-03-02T15:00:00Z",
+ "html_file": "document.html",
+ "summary": {
+ "total_issues": 42,
+ "by_severity": {
+ "critical": 5,
+ "serious": 15,
+ "moderate": 18,
+ "minor": 4
+ },
+ "by_wcag_level": {
+ "A": 25,
+ "AA": 17
+ },
+ "pages_audited": 10,
+ "elements_checked": 523
+ },
+ "issues": [
+ {
+ "id": "img-001",
+ "type": "missing_alt_text",
+ "severity": "critical",
+ "wcag_criteria": ["1.1.1"],
+ "element": "
",
+ "selector": "body > div.content > img:nth-child(3)",
+ "location": {
+ "page": 1,
+ "line": 45,
+ "column": 12
+ },
+ "message": "Image is missing alt attribute",
+ "suggestion": "Add descriptive alt text that conveys the purpose of the image",
+ "context": "Surrounding text: Lorem ipsum...",
+ "status": "open"
+ }
+ ],
+ "wcag_summary": {
+ "1.1.1": {
+ "count": 5,
+ "description": "Non-text Content",
+ "level": "A"
+ }
+ },
+ "config": {
+ "wcag_level": "AA",
+ "include_warnings": true,
+ "check_color_contrast": true
+ }
+}
+```
+
+---
+
+### 2. Remediation Report JSON
+```json
+{
+ "version": "1.0",
+ "timestamp": "2026-03-02T15:00:00Z",
+ "html_file": "document_remediated.html",
+ "summary": {
+ "total_issues": 42,
+ "fixed_automatically": 35,
+ "requires_manual_review": 7,
+ "failed": 0,
+ "by_method": {
+ "ai_generated": 20,
+ "rule_based": 15,
+ "manual": 0
+ }
+ },
+ "fixes": [
+ {
+ "issue_id": "img-001",
+ "issue_type": "missing_alt_text",
+ "status": "fixed",
+ "method": "ai_generated",
+ "original_element": "
",
+ "fixed_element": "
",
+ "details": {
+ "ai_prompt": "Generate alt text for this image...",
+ "ai_response": "A graph showing sales trends over time",
+ "ai_model": "amazon.nova-pro-v1:0",
+ "tokens_used": 150,
+ "confidence": 0.92,
+ "fallback_used": false
+ },
+ "timestamp": "2026-03-02T15:01:23Z"
+ }
+ ],
+ "manual_review_items": [
+ {
+ "issue_id": "table-005",
+ "issue_type": "table_irregular_headers",
+ "reason": "Complex table structure with merged cells",
+ "element": "
",
+ "selector": "body > table:nth-child(5)",
+ "suggestion": "Manually verify header associations and add scope attributes",
+ "priority": "high"
+ }
+ ]
+}
+```
+
+---
+
+### 3. Usage Data JSON
+```json
+{
+ "session_id": "550e8400-e29b-41d4-a716-446655440000",
+ "user_id": "user123",
+ "solution": "PDF2HTML",
+ "timestamp": "2026-03-02T15:00:00Z",
+ "pdf_info": {
+ "filename": "document.pdf",
+ "size_bytes": 1024000,
+ "pages": 10
+ },
+ "bedrock_usage": {
+ "invocations": 15,
+ "input_tokens": 5000,
+ "output_tokens": 2000,
+ "models_used": ["amazon.nova-pro-v1:0"]
+ },
+ "bda_usage": {
+ "pages_processed": 10,
+ "processing_time_seconds": 45
+ },
+ "processing_metrics": {
+ "total_duration_seconds": 120,
+ "conversion_time": 45,
+ "audit_time": 20,
+ "remediation_time": 55
+ },
+ "cost_estimates": {
+ "bedrock": 0.0224,
+ "bda": 0.50,
+ "lambda": 0.0015,
+ "s3": 0.0001,
+ "total": 0.524
+ }
+}
+```
+
+---
+
+## Database Schemas
+
+### Image Metadata SQLite (Adobe Container)
+
+**Table**: `image_metadata`
+
+```sql
+CREATE TABLE image_metadata (
+ id INTEGER PRIMARY KEY AUTOINCREMENT,
+ image_path TEXT NOT NULL,
+ page_number INTEGER,
+ bounding_box TEXT, -- JSON string
+ alt_text TEXT,
+ is_decorative BOOLEAN DEFAULT 0,
+ context TEXT,
+ confidence REAL,
+ created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+```
+
+**Usage**: Stores extracted image information from Adobe Extract API for alt text generation.
+
+---
+
+## Enumerations
+
+### Severity
+```python
+class Severity(Enum):
+ CRITICAL = "critical"
+ SERIOUS = "serious"
+ MODERATE = "moderate"
+ MINOR = "minor"
+```
+
+### IssueStatus
+```python
+class IssueStatus(Enum):
+ OPEN = "open"
+ FIXED = "fixed"
+ MANUAL_REVIEW = "manual_review"
+ SKIPPED = "skipped"
+```
+
+### RemediationStatus
+```python
+class RemediationStatus(Enum):
+ FIXED = "fixed"
+ FAILED = "failed"
+ MANUAL_REVIEW = "manual_review"
+ SKIPPED = "skipped"
+```
+
+### RemediationMethod
+```python
+class RemediationMethod(Enum):
+ AI_GENERATED = "ai_generated"
+ RULE_BASED = "rule_based"
+ MANUAL = "manual"
+```
diff --git a/.agents/summary/dependencies.md b/.agents/summary/dependencies.md
new file mode 100644
index 00000000..1dd13c72
--- /dev/null
+++ b/.agents/summary/dependencies.md
@@ -0,0 +1,540 @@
+# Dependencies and External Services
+
+## External Service Dependencies
+
+### 1. Adobe PDF Services API
+
+**Purpose**: PDF structure tagging and content extraction
+
+**Service Type**: Third-party REST API
+
+**Authentication**: OAuth 2.0 client credentials
+
+**Required Credentials**:
+- Client ID
+- Client Secret
+
+**Pricing Model**: Enterprise contract or trial account
+
+**Rate Limits**: Contract-dependent
+
+**APIs Used**:
+- **Autotag API**: Adds accessibility tags
+- **Extract API**: Extracts images and structure
+
+**Failure Impact**:
+- **Critical**: PDF-to-PDF solution cannot function without it
+- **Mitigation**: Retry logic with exponential backoff
+
+**Documentation**: https://developer.adobe.com/document-services/docs/overview/pdf-services-api/
+
+---
+
+### 2. AWS Bedrock
+
+**Purpose**: AI-powered content generation
+
+**Service Type**: AWS managed service
+
+**Authentication**: IAM role-based
+
+**Models Used**:
+- **Amazon Nova Pro** (`amazon.nova-pro-v1:0`)
+ - Multimodal (text + vision)
+ - Alt text generation
+ - Title generation
+ - Remediation suggestions
+
+**Pricing**:
+- Input tokens: $0.0008 per 1K tokens
+- Output tokens: $0.0032 per 1K tokens
+
+**Rate Limits**:
+- Requests per minute: Model-dependent
+- Tokens per minute: Model-dependent
+
+**Failure Impact**:
+- **High**: AI-powered features unavailable
+- **Mitigation**: Fall back to rule-based fixes
+
+**Required Permissions**:
+```json
+{
+ "Effect": "Allow",
+ "Action": [
+ "bedrock:InvokeModel"
+ ],
+ "Resource": "arn:aws:bedrock:*::foundation-model/amazon.nova-pro-v1:0"
+}
+```
+
+---
+
+### 3. AWS Bedrock Data Automation
+
+**Purpose**: PDF parsing and structure extraction
+
+**Service Type**: AWS managed service
+
+**Authentication**: IAM role-based
+
+**Pricing**: Per-page processing fee
+
+**Rate Limits**: Project-level quotas
+
+**Failure Impact**:
+- **Critical**: PDF-to-HTML solution cannot function without it
+- **Mitigation**: Retry logic, timeout handling
+
+**Required Permissions**:
+```json
+{
+ "Effect": "Allow",
+ "Action": [
+ "bedrock:CreateDataAutomationProject",
+ "bedrock:InvokeDataAutomationAsync",
+ "bedrock:GetDataAutomationStatus"
+ ],
+ "Resource": "*"
+}
+```
+
+---
+
+## AWS Service Dependencies
+
+### 4. Amazon S3
+
+**Purpose**: Object storage for PDFs and outputs
+
+**Pricing**:
+- Storage: $0.023 per GB/month (Standard)
+- PUT requests: $0.005 per 1,000 requests
+- GET requests: $0.0004 per 1,000 requests
+
+**Features Used**:
+- Event notifications
+- Versioning
+- Server-side encryption (SSE-S3)
+- Object tagging
+- Lifecycle policies (optional)
+
+**Required Permissions**:
+```json
+{
+ "Effect": "Allow",
+ "Action": [
+ "s3:GetObject",
+ "s3:PutObject",
+ "s3:DeleteObject",
+ "s3:ListBucket",
+ "s3:PutObjectTagging"
+ ],
+ "Resource": [
+ "arn:aws:s3:::bucket-name",
+ "arn:aws:s3:::bucket-name/*"
+ ]
+}
+```
+
+---
+
+### 5. AWS Lambda
+
+**Purpose**: Serverless compute for lightweight operations
+
+**Runtimes Used**:
+- Python 3.12
+- Java 11
+- Node.js 18 (via container)
+
+**Pricing**:
+- Requests: $0.20 per 1M requests
+- Duration: $0.0000166667 per GB-second
+
+**Limits**:
+- Timeout: 15 minutes (max)
+- Memory: 10 GB (max)
+- Deployment package: 250 MB (unzipped)
+- Container image: 10 GB
+
+**Functions Deployed**:
+- PDF Splitter (Python)
+- PDF Merger (Java)
+- Title Generator (Python)
+- Pre/Post Accessibility Checkers (Python)
+- S3 Object Tagger (Python)
+- PDF2HTML Pipeline (Python container)
+
+---
+
+### 6. Amazon ECS Fargate
+
+**Purpose**: Containerized compute for heavy processing
+
+**Pricing**:
+- vCPU: $0.04048 per vCPU per hour
+- Memory: $0.004445 per GB per hour
+
+**Configuration**:
+- CPU: 2 vCPU
+- Memory: 4 GB
+- Platform: Linux/AMD64
+
+**Containers Deployed**:
+- Adobe Autotag Processor (Python)
+- Alt Text Generator (Node.js)
+
+**Cold Start Optimization**:
+- VPC endpoints for ECR (reduces 10-15s)
+- zstd compression (2-3x faster than gzip)
+
+---
+
+### 7. AWS Step Functions
+
+**Purpose**: Workflow orchestration
+
+**Pricing**:
+- State transitions: $0.025 per 1,000 transitions
+
+**Features Used**:
+- Map state (parallel execution)
+- Error handling and retries
+- CloudWatch integration
+
+**Workflow**: PDF-to-PDF chunk processing
+
+---
+
+### 8. Amazon ECR
+
+**Purpose**: Container image registry
+
+**Pricing**:
+- Storage: $0.10 per GB/month
+- Data transfer: Standard AWS rates
+
+**Images Stored**:
+- Adobe Autotag container
+- Alt Text Generator container
+- PDF2HTML Lambda container
+
+---
+
+### 9. AWS Secrets Manager
+
+**Purpose**: Secure credential storage
+
+**Pricing**:
+- Secret: $0.40 per secret per month
+- API calls: $0.05 per 10,000 calls
+
+**Secrets Stored**:
+- Adobe PDF Services credentials
+
+---
+
+### 10. Amazon CloudWatch
+
+**Purpose**: Monitoring, logging, and metrics
+
+**Pricing**:
+- Logs ingestion: $0.50 per GB
+- Logs storage: $0.03 per GB/month
+- Custom metrics: $0.30 per metric per month
+- Dashboard: $3.00 per dashboard per month
+
+**Features Used**:
+- Log groups for all Lambda/ECS
+- Custom metrics namespace: `PDFAccessibility`
+- Usage metrics dashboard
+
+---
+
+### 11. Amazon VPC
+
+**Purpose**: Network isolation for ECS tasks
+
+**Pricing**:
+- NAT Gateway: $0.045 per hour + $0.045 per GB processed
+- VPC Endpoints: $0.01 per hour per AZ
+
+**Configuration**:
+- 2 Availability Zones
+- Public and private subnets
+- NAT Gateway for egress
+- VPC endpoints for ECR and S3
+
+---
+
+### 12. AWS IAM
+
+**Purpose**: Access control and permissions
+
+**Pricing**: Free
+
+**Roles Created**:
+- Lambda execution roles
+- ECS task roles
+- ECS task execution roles
+- Step Functions execution role
+
+---
+
+### 13. AWS CodeBuild
+
+**Purpose**: CI/CD pipeline for deployment
+
+**Pricing**:
+- Build minutes: $0.005 per minute (general1.small)
+
+**Usage**: Automated deployment via `deploy.sh`
+
+---
+
+## Python Dependencies
+
+### Core Libraries
+
+#### boto3
+- **Version**: Latest
+- **Purpose**: AWS SDK for Python
+- **Used By**: All Python components
+- **License**: Apache 2.0
+
+#### aws-cdk-lib
+- **Version**: 2.147.2
+- **Purpose**: AWS CDK framework
+- **Used By**: Infrastructure code
+- **License**: Apache 2.0
+
+### PDF Processing
+
+#### pypdf
+- **Version**: 4.3.1
+- **Purpose**: PDF manipulation
+- **Used By**: PDF Splitter, Title Generator
+- **License**: BSD
+
+#### PyMuPDF (fitz)
+- **Version**: 1.24.14
+- **Purpose**: PDF text extraction
+- **Used By**: Title Generator
+- **License**: AGPL
+
+### HTML Processing
+
+#### beautifulsoup4
+- **Version**: Latest
+- **Purpose**: HTML parsing
+- **Used By**: PDF2HTML, Auditor, Remediator
+- **License**: MIT
+
+#### lxml
+- **Version**: Latest
+- **Purpose**: XML/HTML processing
+- **Used By**: PDF2HTML, Auditor
+- **License**: BSD
+
+### Image Processing
+
+#### Pillow
+- **Version**: Latest
+- **Purpose**: Image manipulation
+- **Used By**: PDF2HTML, Alt Text Generator
+- **License**: HPND
+
+### Adobe SDK
+
+#### pdfservices-sdk
+- **Version**: 4.1.0
+- **Purpose**: Adobe PDF Services API client
+- **Used By**: Adobe Autotag container
+- **License**: Proprietary (Adobe)
+
+### Utilities
+
+#### openpyxl
+- **Version**: Latest
+- **Purpose**: Excel file parsing
+- **Used By**: Adobe Autotag container
+- **License**: MIT
+
+#### requests
+- **Version**: 2.31.0
+- **Purpose**: HTTP client
+- **Used By**: Adobe SDK, BDA client
+- **License**: Apache 2.0
+
+---
+
+## JavaScript Dependencies
+
+### AWS SDK
+
+#### @aws-sdk/client-bedrock-runtime
+- **Version**: Latest
+- **Purpose**: Bedrock API client
+- **Used By**: Alt Text Generator
+- **License**: Apache 2.0
+
+#### @aws-sdk/client-s3
+- **Version**: Latest
+- **Purpose**: S3 API client
+- **Used By**: Alt Text Generator, PDF2HTML CDK
+- **License**: Apache 2.0
+
+### PDF Processing
+
+#### pdf-lib
+- **Version**: Latest
+- **Purpose**: PDF manipulation
+- **Used By**: Alt Text Generator
+- **License**: MIT
+
+### CDK
+
+#### aws-cdk-lib
+- **Version**: Latest
+- **Purpose**: AWS CDK framework
+- **Used By**: PDF2HTML CDK stack
+- **License**: Apache 2.0
+
+#### @aws-cdk/aws-lambda-python-alpha
+- **Version**: Latest
+- **Purpose**: Python Lambda constructs
+- **Used By**: PDF2HTML CDK stack
+- **License**: Apache 2.0
+
+---
+
+## Java Dependencies
+
+### PDF Processing
+
+#### org.apache.pdfbox:pdfbox
+- **Version**: Latest
+- **Purpose**: PDF merging
+- **Used By**: PDF Merger Lambda
+- **License**: Apache 2.0
+
+### AWS SDK
+
+#### software.amazon.awssdk:s3
+- **Version**: Latest
+- **Purpose**: S3 operations
+- **Used By**: PDF Merger Lambda
+- **License**: Apache 2.0
+
+#### com.amazonaws:aws-lambda-java-core
+- **Version**: Latest
+- **Purpose**: Lambda runtime
+- **Used By**: PDF Merger Lambda
+- **License**: Apache 2.0
+
+---
+
+## Development Dependencies
+
+### Python
+
+#### pytest
+- **Purpose**: Testing framework
+- **License**: MIT
+
+#### black
+- **Purpose**: Code formatting
+- **License**: MIT
+
+#### mypy
+- **Purpose**: Type checking
+- **License**: MIT
+
+### Node.js
+
+#### eslint
+- **Purpose**: Linting
+- **License**: MIT
+
+#### prettier
+- **Purpose**: Code formatting
+- **License**: MIT
+
+---
+
+## Dependency Management
+
+### Python
+- **File**: `requirements.txt`
+- **Tool**: pip
+- **Virtual Environment**: venv
+
+### JavaScript
+- **File**: `package.json`, `package-lock.json`
+- **Tool**: npm
+
+### Java
+- **File**: `pom.xml`
+- **Tool**: Maven
+
+---
+
+## Security Considerations
+
+### Dependency Scanning
+- Regular updates for security patches
+- Vulnerability scanning with AWS Inspector
+- Dependabot alerts (GitHub)
+
+### License Compliance
+- All dependencies use permissive licenses
+- Adobe SDK requires enterprise contract
+- AGPL license (PyMuPDF) - consider alternatives for commercial use
+
+### Supply Chain Security
+- Pin dependency versions
+- Use official package repositories
+- Verify package signatures
+
+---
+
+## Version Compatibility
+
+### Python
+- **Minimum**: 3.9
+- **Recommended**: 3.12
+- **Lambda Runtime**: 3.12
+
+### Node.js
+- **Minimum**: 18
+- **Recommended**: 18 LTS
+- **Lambda Runtime**: 18
+
+### Java
+- **Minimum**: 11
+- **Recommended**: 11
+- **Lambda Runtime**: 11
+
+### AWS CDK
+- **Version**: 2.147.2
+- **Compatibility**: AWS CDK v2
+
+---
+
+## Dependency Update Strategy
+
+### Regular Updates
+- Monthly security patch review
+- Quarterly minor version updates
+- Annual major version updates
+
+### Testing
+- Unit tests after updates
+- Integration tests with AWS services
+- End-to-end workflow validation
+
+### Rollback Plan
+- Version pinning in requirements files
+- CDK snapshot testing
+- Blue/green deployment for major changes
diff --git a/.agents/summary/index.md b/.agents/summary/index.md
new file mode 100644
index 00000000..33eeef08
--- /dev/null
+++ b/.agents/summary/index.md
@@ -0,0 +1,454 @@
+# PDF Accessibility Solutions - Knowledge Base Index
+
+## 🤖 Instructions for AI Assistants
+
+This index serves as your primary entry point for understanding the PDF Accessibility Solutions codebase. Each document below contains rich metadata and detailed information about specific aspects of the system.
+
+**How to Use This Index**:
+1. **Start Here**: Read the summaries below to understand which documents contain relevant information
+2. **Navigate Efficiently**: Use the metadata tags to quickly find specific topics
+3. **Deep Dive**: Reference the full documents only when you need detailed implementation information
+4. **Cross-Reference**: Documents are interconnected - follow references between them
+
+**Key Principle**: This index contains sufficient metadata for you to answer most questions without reading full documents. Only access detailed documents when you need specific implementation details, code examples, or technical specifications.
+
+---
+
+## 📚 Document Catalog
+
+### 1. Codebase Information
+**File**: `codebase_info.md`
+**Purpose**: High-level overview of the codebase structure, statistics, and technology stack
+**When to Use**: Understanding project scope, technology choices, repository organization
+
+**Key Topics**:
+- Project statistics (140 files, 27,949 LOC)
+- Language distribution (Python 95 files, JavaScript 3, Java 2, Shell 2)
+- Technology stack (AWS CDK, Lambda, ECS, Bedrock, S3)
+- Repository structure and organization
+- Development environment requirements
+- Supported standards (WCAG 2.1 Level AA, PDF/UA)
+
+**Metadata Tags**: `#overview` `#statistics` `#technology-stack` `#repository-structure`
+
+**Quick Facts**:
+- Two main solutions: PDF-to-PDF and PDF-to-HTML
+- Built by Arizona State University's AI Cloud Innovation Center
+- Serverless architecture on AWS
+- Python 3.12, Node.js 18, Java 11 runtimes
+
+---
+
+### 2. Architecture
+**File**: `architecture.md`
+**Purpose**: System architecture, component interactions, and design patterns
+**When to Use**: Understanding system design, data flow, infrastructure, scalability
+
+**Key Topics**:
+- High-level architecture diagrams (Mermaid)
+- PDF-to-PDF solution workflow (S3 → Lambda → Step Functions → ECS → Merger)
+- PDF-to-HTML solution workflow (S3 → Lambda → BDA → Bedrock → Remediation)
+- VPC configuration and networking
+- ECS Fargate setup
+- CloudWatch monitoring architecture
+- Deployment architecture
+- Scalability and cost optimization strategies
+
+**Metadata Tags**: `#architecture` `#design-patterns` `#infrastructure` `#workflows` `#scalability`
+
+**Key Diagrams**:
+- Overall system architecture
+- PDF-to-PDF sequence diagram
+- PDF-to-HTML sequence diagram
+- Monitoring architecture
+- Deployment flow
+
+**Design Patterns**:
+- Event-driven architecture
+- Serverless-first
+- Infrastructure as Code (CDK)
+- Observability-first
+
+---
+
+### 3. Components
+**File**: `components.md`
+**Purpose**: Detailed descriptions of all system components, their responsibilities, and interactions
+**When to Use**: Understanding specific components, debugging, extending functionality
+
+**Key Topics**:
+- **PDF-to-PDF Components**:
+ - PDF Splitter Lambda (splits PDFs into pages)
+ - Adobe Autotag Container (adds accessibility tags)
+ - Alt Text Generator Container (generates image descriptions)
+ - Title Generator Lambda (creates document titles)
+ - PDF Merger Lambda (Java, merges processed chunks)
+ - Accessibility Checkers (pre/post validation)
+ - Step Functions Orchestrator (workflow coordination)
+
+- **PDF-to-HTML Components**:
+ - PDF2HTML Lambda Function (main pipeline)
+ - Bedrock Data Automation Client (PDF parsing)
+ - Accessibility Auditor (WCAG compliance checking)
+ - Remediation Manager (fixes accessibility issues)
+ - Report Generator (creates detailed reports)
+ - Usage Tracker (cost and metrics tracking)
+
+- **Shared Components**:
+ - Metrics Helper (CloudWatch metrics)
+ - S3 Object Tagger (user attribution)
+ - CloudWatch Dashboard (visualization)
+
+**Metadata Tags**: `#components` `#lambda` `#ecs` `#step-functions` `#auditing` `#remediation`
+
+**Component Dependencies**: Includes dependency graph showing relationships between components
+
+---
+
+### 4. Interfaces and APIs
+**File**: `interfaces.md`
+**Purpose**: API specifications, data contracts, and integration points
+**When to Use**: Integrating with the system, understanding API contracts, debugging API calls
+
+**Key Topics**:
+- **External APIs**:
+ - Adobe PDF Services API (Autotag, Extract)
+ - AWS Bedrock API (Nova Pro model)
+ - AWS Bedrock Data Automation API
+
+- **Internal APIs**:
+ - Content Accessibility Utility API
+ - Audit API
+ - Remediation API
+
+- **AWS Service Interfaces**:
+ - S3 operations
+ - CloudWatch metrics and logs
+ - Secrets Manager
+ - Step Functions
+
+- **Data Models**: AuditReport, RemediationReport, AuditIssue, RemediationFix
+- **Event Schemas**: S3 events, Step Functions input/output
+- **Error Responses**: Standard error format and codes
+
+**Metadata Tags**: `#apis` `#interfaces` `#data-contracts` `#integration` `#events`
+
+**API Examples**: Includes request/response examples for all major APIs
+
+---
+
+### 5. Data Models
+**File**: `data_models.md`
+**Purpose**: Data structures, schemas, and type definitions
+**When to Use**: Understanding data formats, implementing new features, parsing outputs
+
+**Key Topics**:
+- **Audit Models**: AuditReport, AuditSummary, AuditIssue, Location
+- **Remediation Models**: RemediationReport, RemediationSummary, RemediationFix, RemediationDetails
+- **Configuration Models**: Config with WCAG levels and options
+- **Usage Tracking Models**: UsageData with cost estimates
+- **BDA Models**: BDAElement, BDAPage
+- **Metrics Models**: MetricData with dimensions
+- **WCAG Criteria Mapping**: Complete mapping of WCAG 2.1 Level AA criteria
+- **File Formats**: JSON schemas for reports and usage data
+- **Database Schemas**: SQLite schema for image metadata
+- **Enumerations**: Severity, IssueStatus, RemediationStatus
+
+**Metadata Tags**: `#data-models` `#schemas` `#types` `#wcag` `#reports`
+
+**Issue Types**: Complete list of 20+ accessibility issue types with WCAG mappings
+
+---
+
+### 6. Workflows
+**File**: `workflows.md`
+**Purpose**: End-to-end process flows and operational procedures
+**When to Use**: Understanding process flows, troubleshooting, optimizing performance
+
+**Key Topics**:
+- **PDF-to-PDF Workflow**: 8-step process from upload to compliant PDF
+ - Upload → Split → Pre-check → Parallel Processing → Title → Post-check → Merge → Output
+ - Processing time: 3-60 minutes depending on size
+
+- **PDF-to-HTML Workflow**: 7-step process from upload to remediated HTML
+ - Upload → BDA Conversion → Audit → Remediation → Report → Package → Output
+ - Processing time: 1-20 minutes depending on size
+
+- **Deployment Workflow**: One-click and manual deployment processes
+- **Error Handling Workflows**: Retry logic and recovery procedures
+- **Monitoring Workflow**: Metrics collection and log aggregation
+- **Cost Tracking Workflow**: Per-user cost attribution
+
+**Metadata Tags**: `#workflows` `#processes` `#deployment` `#error-handling` `#monitoring`
+
+**Timing Information**: Detailed timing for each workflow step
+
+---
+
+### 7. Dependencies
+**File**: `dependencies.md`
+**Purpose**: External services, libraries, and version requirements
+**When to Use**: Setting up development environment, troubleshooting dependency issues, updating versions
+
+**Key Topics**:
+- **External Services**:
+ - Adobe PDF Services API (enterprise contract required)
+ - AWS Bedrock (IAM-based, Nova Pro model)
+ - AWS Bedrock Data Automation (per-page pricing)
+
+- **AWS Services**: S3, Lambda, ECS, Step Functions, ECR, Secrets Manager, CloudWatch, VPC, IAM, CodeBuild
+
+- **Python Dependencies**: boto3, aws-cdk-lib, pypdf, PyMuPDF, beautifulsoup4, lxml, Pillow, pdfservices-sdk
+
+- **JavaScript Dependencies**: AWS SDK, pdf-lib, CDK libraries
+
+- **Java Dependencies**: Apache PDFBox, AWS SDK
+
+- **Version Compatibility**: Python 3.9+, Node.js 18+, Java 11+
+
+- **Security Considerations**: Dependency scanning, license compliance, supply chain security
+
+**Metadata Tags**: `#dependencies` `#libraries` `#versions` `#external-services` `#security`
+
+**Pricing Information**: Detailed pricing for all AWS services and external APIs
+
+---
+
+### 8. Review Notes
+**File**: `review_notes.md`
+**Purpose**: Documentation quality assessment, identified gaps, and recommendations
+**When to Use**: Understanding documentation completeness, planning improvements
+
+**Key Topics**:
+- **Consistency Check**: Identified inconsistencies (language diversity, metrics duplication)
+- **Completeness Check**: Well-documented areas and gaps
+- **Language Support**: All languages fully supported
+- **Documentation Quality**: Strengths and areas for improvement
+- **Recommendations**: Short, medium, and long-term improvements
+- **Validation Checklist**: Coverage assessment
+- **Priority Gaps**: Testing strategy, security best practices, troubleshooting
+
+**Metadata Tags**: `#review` `#quality` `#gaps` `#recommendations` `#maintenance`
+
+**Action Items**: Prioritized list of documentation improvements needed
+
+---
+
+## 🔍 Quick Reference Guide
+
+### For Understanding the System
+1. Start with **Codebase Information** for overview
+2. Read **Architecture** for system design
+3. Review **Components** for detailed component information
+
+### For Development
+1. Check **Dependencies** for setup requirements
+2. Review **Components** for implementation details
+3. Reference **Data Models** for data structures
+4. Follow **Workflows** for process understanding
+
+### For Integration
+1. Read **Interfaces and APIs** for API contracts
+2. Review **Data Models** for data formats
+3. Check **Dependencies** for external service requirements
+
+### For Operations
+1. Review **Workflows** for operational procedures
+2. Check **Architecture** for infrastructure details
+3. Reference **Components** for troubleshooting
+
+### For Troubleshooting
+1. Check **Review Notes** for known issues
+2. Review **Workflows** for error handling
+3. Reference **Components** for component-specific issues
+4. Check **Dependencies** for version compatibility
+
+---
+
+## 🏷️ Metadata Tag Index
+
+### By Topic
+- **Architecture**: `architecture.md`
+- **Components**: `components.md`
+- **APIs**: `interfaces.md`
+- **Data**: `data_models.md`
+- **Processes**: `workflows.md`
+- **Dependencies**: `dependencies.md`
+- **Quality**: `review_notes.md`
+
+### By Technology
+- **AWS Services**: `architecture.md`, `components.md`, `dependencies.md`
+- **Python**: `codebase_info.md`, `components.md`, `dependencies.md`
+- **JavaScript**: `codebase_info.md`, `components.md`, `dependencies.md`
+- **Java**: `codebase_info.md`, `components.md`, `dependencies.md`
+
+### By Use Case
+- **Development**: `codebase_info.md`, `components.md`, `data_models.md`, `dependencies.md`
+- **Operations**: `architecture.md`, `workflows.md`, `components.md`
+- **Integration**: `interfaces.md`, `data_models.md`, `dependencies.md`
+- **Troubleshooting**: `review_notes.md`, `workflows.md`, `components.md`
+
+---
+
+## 📊 Key Statistics
+
+- **Total Files**: 140
+- **Lines of Code**: 27,949
+- **Components**: 17 major components
+- **AWS Services**: 13 services used
+- **External APIs**: 3 (Adobe, Bedrock, BDA)
+- **Supported Languages**: Python, JavaScript, Java, Shell
+- **WCAG Criteria**: 10+ Level AA criteria supported
+- **Issue Types**: 20+ accessibility issue types
+
+---
+
+## 🔗 Cross-References
+
+### Architecture ↔ Components
+- Architecture describes high-level design
+- Components provide implementation details
+- Both reference same component names
+
+### Components ↔ Interfaces
+- Components describe functionality
+- Interfaces define API contracts
+- Both use same data models
+
+### Interfaces ↔ Data Models
+- Interfaces reference data models
+- Data Models define structures used in APIs
+- Both include JSON examples
+
+### Workflows ↔ Components
+- Workflows describe process flows
+- Components implement workflow steps
+- Both reference same operations
+
+### Dependencies ↔ All Documents
+- Dependencies lists all external requirements
+- All documents reference dependencies
+- Version compatibility documented
+
+---
+
+## 💡 Tips for AI Assistants
+
+### Answering Architecture Questions
+→ Start with `architecture.md` for system design
+→ Reference `components.md` for specific component details
+→ Check `workflows.md` for process flows
+
+### Answering API Questions
+→ Start with `interfaces.md` for API specifications
+→ Reference `data_models.md` for data structures
+→ Check `components.md` for implementation details
+
+### Answering Development Questions
+→ Start with `codebase_info.md` for overview
+→ Reference `dependencies.md` for setup requirements
+→ Check `components.md` for implementation guidance
+
+### Answering Operational Questions
+→ Start with `workflows.md` for procedures
+→ Reference `architecture.md` for infrastructure
+→ Check `review_notes.md` for known issues
+
+### Answering Troubleshooting Questions
+→ Start with `review_notes.md` for known issues
+→ Reference `workflows.md` for error handling
+→ Check `components.md` for component-specific details
+
+---
+
+## 📝 Document Relationships
+
+```mermaid
+graph TD
+ Index[index.md
YOU ARE HERE] --> Info[codebase_info.md
Overview]
+ Index --> Arch[architecture.md
System Design]
+ Index --> Comp[components.md
Implementation]
+ Index --> API[interfaces.md
APIs]
+ Index --> Data[data_models.md
Data Structures]
+ Index --> Work[workflows.md
Processes]
+ Index --> Deps[dependencies.md
Requirements]
+ Index --> Review[review_notes.md
Quality]
+
+ Arch --> Comp
+ Comp --> API
+ API --> Data
+ Work --> Comp
+ Deps --> Comp
+ Review --> Index
+```
+
+---
+
+## 🎯 Common Questions and Where to Find Answers
+
+**Q: How does the PDF-to-PDF solution work?**
+→ `architecture.md` (high-level) → `workflows.md` (detailed process) → `components.md` (implementation)
+
+**Q: What accessibility checks are performed?**
+→ `components.md` (Auditor section) → `data_models.md` (issue types) → `interfaces.md` (API)
+
+**Q: How do I deploy the system?**
+→ `workflows.md` (deployment workflow) → `dependencies.md` (requirements) → `codebase_info.md` (overview)
+
+**Q: What AWS services are used?**
+→ `dependencies.md` (complete list) → `architecture.md` (how they're used) → `components.md` (specific usage)
+
+**Q: How is cost tracked?**
+→ `workflows.md` (cost tracking workflow) → `components.md` (Usage Tracker) → `data_models.md` (UsageData)
+
+**Q: What are the data models?**
+→ `data_models.md` (complete definitions) → `interfaces.md` (API usage) → `components.md` (implementation)
+
+**Q: How do I troubleshoot errors?**
+→ `review_notes.md` (known issues) → `workflows.md` (error handling) → `components.md` (component details)
+
+**Q: What external APIs are used?**
+→ `dependencies.md` (service list) → `interfaces.md` (API specs) → `components.md` (usage)
+
+---
+
+## 📅 Documentation Metadata
+
+**Generated**: 2026-03-02
+**Generator**: AI Documentation System
+**Codebase Version**: Git commit `8d6102bc644641c94f5a695a32ea50c19b3c8d68`
+**Documentation Version**: 1.0
+**Last Updated**: 2026-03-02
+**Next Review**: Recommended within 30 days
+
+---
+
+## 🚀 Getting Started Paths
+
+### Path 1: New Developer
+1. Read `codebase_info.md` - Understand the project
+2. Read `architecture.md` - Learn the system design
+3. Review `dependencies.md` - Set up your environment
+4. Explore `components.md` - Understand the code
+
+### Path 2: Integration Developer
+1. Read `interfaces.md` - Understand the APIs
+2. Review `data_models.md` - Learn the data formats
+3. Check `dependencies.md` - Understand requirements
+4. Reference `workflows.md` - Understand processes
+
+### Path 3: Operations Engineer
+1. Read `architecture.md` - Understand infrastructure
+2. Review `workflows.md` - Learn operational procedures
+3. Check `components.md` - Understand components
+4. Reference `review_notes.md` - Know the issues
+
+### Path 4: AI Assistant
+1. Read this index completely
+2. Use metadata tags to navigate
+3. Reference specific documents only when needed
+4. Cross-reference between documents for complete answers
+
+---
+
+**Remember**: This index is designed to minimize the need to read full documents. Use the summaries and metadata to answer questions efficiently, and only dive into detailed documents when you need specific implementation information.
diff --git a/.agents/summary/interfaces.md b/.agents/summary/interfaces.md
new file mode 100644
index 00000000..a059982b
--- /dev/null
+++ b/.agents/summary/interfaces.md
@@ -0,0 +1,661 @@
+# Interfaces and APIs
+
+## External APIs
+
+### 1. Adobe PDF Services API
+
+**Purpose**: PDF structure tagging and content extraction
+
+**Authentication**: OAuth 2.0 with client credentials
+
+**Credentials Storage**: AWS Secrets Manager (`adobe-pdf-services-credentials`)
+
+**Operations Used**:
+
+#### Autotag API
+- **Endpoint**: Adobe PDF Services REST API
+- **Method**: POST
+- **Function**: Adds accessibility tags to PDF
+- **Input**: PDF file
+- **Output**: Tagged PDF with structure tree
+- **Tags Added**:
+ - Headings (H1-H6)
+ - Paragraphs (P)
+ - Lists (L, LI)
+ - Tables (Table, TR, TH, TD)
+ - Figures (Figure)
+ - Links (Link)
+
+**Options**:
+```python
+{
+ "generate_report": True,
+ "shift_headings": False
+}
+```
+
+#### Extract API
+- **Endpoint**: Adobe PDF Services REST API
+- **Method**: POST
+- **Function**: Extracts content and structure
+- **Input**: PDF file
+- **Output**: ZIP file containing:
+ - `structuredData.json`: Document structure
+ - `images/`: Extracted images
+ - Excel file with image metadata
+
+**Rate Limits**: Enterprise contract dependent
+
+**Error Handling**:
+- Exponential backoff retry
+- CloudWatch error logging
+- Fallback to basic processing
+
+---
+
+### 2. AWS Bedrock API
+
+**Purpose**: AI-powered content generation and image analysis
+
+**Authentication**: IAM role-based
+
+**Models Used**:
+
+#### Amazon Nova Pro
+- **Model ID**: `amazon.nova-pro-v1:0`
+- **Capabilities**:
+ - Text generation
+ - Image analysis (multimodal)
+ - Context understanding
+- **Use Cases**:
+ - Alt text generation
+ - Title generation
+ - Remediation suggestions
+ - Table caption generation
+
+**API Operations**:
+
+#### InvokeModel
+```python
+{
+ "modelId": "amazon.nova-pro-v1:0",
+ "contentType": "application/json",
+ "accept": "application/json",
+ "body": {
+ "messages": [
+ {
+ "role": "user",
+ "content": [
+ {"text": "prompt"},
+ {"image": {"source": {"bytes": image_bytes}}}
+ ]
+ }
+ ],
+ "inferenceConfig": {
+ "max_new_tokens": 512,
+ "temperature": 0.7
+ }
+ }
+}
+```
+
+**Response**:
+```json
+{
+ "output": {
+ "message": {
+ "content": [{"text": "generated text"}]
+ }
+ },
+ "usage": {
+ "inputTokens": 100,
+ "outputTokens": 50
+ }
+}
+```
+
+**Pricing**:
+- Input: $0.0008 per 1K tokens
+- Output: $0.0032 per 1K tokens
+
+**Rate Limits**:
+- Requests per minute: Model-dependent
+- Tokens per minute: Model-dependent
+
+---
+
+### 3. AWS Bedrock Data Automation API
+
+**Purpose**: PDF parsing and structure extraction
+
+**Authentication**: IAM role-based
+
+**Operations**:
+
+#### CreateDataAutomationProject
+```python
+{
+ "projectName": "pdf-accessibility-project",
+ "projectStage": "LIVE"
+}
+```
+
+#### InvokeDataAutomationAsync
+```python
+{
+ "projectArn": "arn:aws:bedrock:region:account:data-automation-project/name",
+ "inputConfiguration": {
+ "s3Uri": "s3://bucket/input.pdf"
+ },
+ "outputConfiguration": {
+ "s3Uri": "s3://bucket/output/"
+ }
+}
+```
+
+**Output Structure**:
+```json
+{
+ "pages": [
+ {
+ "pageNumber": 1,
+ "elements": [
+ {
+ "type": "text",
+ "content": "...",
+ "boundingBox": {...},
+ "confidence": 0.95
+ },
+ {
+ "type": "image",
+ "s3Path": "s3://...",
+ "boundingBox": {...}
+ }
+ ]
+ }
+ ]
+}
+```
+
+**Capabilities**:
+- Text extraction with layout
+- Image extraction
+- Table detection
+- Element positioning
+- Confidence scores
+
+---
+
+## Internal APIs
+
+### 4. Content Accessibility Utility API
+
+**Location**: `pdf2html/content_accessibility_utility_on_aws/api.py`
+
+**Purpose**: Main entry point for PDF accessibility processing
+
+#### process_pdf_accessibility()
+```python
+def process_pdf_accessibility(
+ pdf_path: str,
+ output_dir: str,
+ config: Optional[Dict] = None
+) -> Dict[str, Any]
+```
+
+**Parameters**:
+- `pdf_path`: Path to input PDF
+- `output_dir`: Directory for outputs
+- `config`: Configuration options
+
+**Returns**:
+```python
+{
+ "html_path": "path/to/remediated.html",
+ "report_path": "path/to/report.html",
+ "audit_results": {...},
+ "remediation_results": {...},
+ "usage_data": {...}
+}
+```
+
+**Process Flow**:
+1. Convert PDF to HTML
+2. Audit HTML for accessibility
+3. Remediate issues
+4. Generate reports
+5. Package outputs
+
+---
+
+#### convert_pdf_to_html()
+```python
+def convert_pdf_to_html(
+ pdf_path: str,
+ output_dir: str,
+ bda_project_arn: Optional[str] = None
+) -> str
+```
+
+**Purpose**: Converts PDF to HTML using BDA
+
+**Returns**: Path to generated HTML file
+
+---
+
+#### audit_html_accessibility()
+```python
+def audit_html_accessibility(
+ html_path: str,
+ output_dir: Optional[str] = None
+) -> AuditReport
+```
+
+**Purpose**: Audits HTML for WCAG compliance
+
+**Returns**: `AuditReport` object with issues
+
+---
+
+#### remediate_html_accessibility()
+```python
+def remediate_html_accessibility(
+ html_path: str,
+ audit_report: AuditReport,
+ output_dir: str,
+ config: Optional[Dict] = None
+) -> RemediationReport
+```
+
+**Purpose**: Fixes accessibility issues
+
+**Returns**: `RemediationReport` object with fixes applied
+
+---
+
+#### generate_remediation_report()
+```python
+def generate_remediation_report(
+ audit_report: AuditReport,
+ remediation_report: RemediationReport,
+ output_path: str,
+ format: str = "html"
+) -> str
+```
+
+**Purpose**: Generates accessibility report
+
+**Formats**: `html`, `json`, `csv`, `txt`
+
+**Returns**: Path to generated report
+
+---
+
+### 5. Audit API
+
+**Location**: `pdf2html/content_accessibility_utility_on_aws/audit/api.py`
+
+#### audit_html_accessibility()
+```python
+def audit_html_accessibility(
+ html_path: str,
+ config: Optional[Dict] = None
+) -> AuditReport
+```
+
+**Configuration Options**:
+```python
+{
+ "wcag_level": "AA", # AA or AAA
+ "include_warnings": True,
+ "check_color_contrast": True,
+ "context_lines": 3
+}
+```
+
+**AuditReport Structure**:
+```python
+{
+ "summary": {
+ "total_issues": 42,
+ "critical": 5,
+ "serious": 15,
+ "moderate": 18,
+ "minor": 4
+ },
+ "issues": [
+ {
+ "id": "img-001",
+ "type": "missing_alt_text",
+ "severity": "critical",
+ "wcag_criteria": ["1.1.1"],
+ "element": "
",
+ "selector": "body > div > img:nth-child(2)",
+ "location": {"page": 1, "line": 45},
+ "message": "Image missing alt attribute",
+ "suggestion": "Add descriptive alt text"
+ }
+ ],
+ "wcag_summary": {
+ "1.1.1": {"count": 5, "description": "Non-text Content"},
+ "1.3.1": {"count": 8, "description": "Info and Relationships"}
+ }
+}
+```
+
+---
+
+### 6. Remediation API
+
+**Location**: `pdf2html/content_accessibility_utility_on_aws/remediate/api.py`
+
+#### remediate_html_accessibility()
+```python
+def remediate_html_accessibility(
+ html_path: str,
+ audit_report: AuditReport,
+ output_path: str,
+ config: Optional[Dict] = None
+) -> RemediationReport
+```
+
+**Configuration Options**:
+```python
+{
+ "auto_remediate": True,
+ "use_ai": True,
+ "bedrock_model": "amazon.nova-pro-v1:0",
+ "max_retries": 3,
+ "skip_manual_review": False
+}
+```
+
+**RemediationReport Structure**:
+```python
+{
+ "summary": {
+ "total_issues": 42,
+ "fixed_automatically": 35,
+ "requires_manual_review": 7,
+ "failed": 0
+ },
+ "fixes": [
+ {
+ "issue_id": "img-001",
+ "status": "fixed",
+ "method": "ai_generated",
+ "original": "
",
+ "fixed": "
",
+ "ai_prompt": "...",
+ "ai_response": "..."
+ }
+ ],
+ "manual_review_items": [
+ {
+ "issue_id": "table-005",
+ "reason": "Complex table structure",
+ "suggestion": "Manually verify header associations"
+ }
+ ]
+}
+```
+
+---
+
+## AWS Service Interfaces
+
+### 7. S3 Interface
+
+**Operations Used**:
+
+#### GetObject
+```python
+s3_client.get_object(
+ Bucket='bucket-name',
+ Key='path/to/file.pdf'
+)
+```
+
+#### PutObject
+```python
+s3_client.put_object(
+ Bucket='bucket-name',
+ Key='path/to/output.pdf',
+ Body=file_content,
+ ServerSideEncryption='AES256',
+ Metadata={'user-id': 'user123'}
+)
+```
+
+#### PutObjectTagging
+```python
+s3_client.put_object_tagging(
+ Bucket='bucket-name',
+ Key='path/to/file.pdf',
+ Tagging={
+ 'TagSet': [
+ {'Key': 'user-id', 'Value': 'user123'},
+ {'Key': 'upload-timestamp', 'Value': '2026-03-02T15:00:00Z'}
+ ]
+ }
+)
+```
+
+---
+
+### 8. CloudWatch Interface
+
+**Metrics**:
+
+#### PutMetricData
+```python
+cloudwatch_client.put_metric_data(
+ Namespace='PDFAccessibility',
+ MetricData=[
+ {
+ 'MetricName': 'PagesProcessed',
+ 'Value': 10,
+ 'Unit': 'Count',
+ 'Timestamp': datetime.utcnow(),
+ 'Dimensions': [
+ {'Name': 'Solution', 'Value': 'PDF2PDF'},
+ {'Name': 'UserId', 'Value': 'user123'}
+ ]
+ }
+ ]
+)
+```
+
+**Logs**:
+
+#### PutLogEvents
+```python
+logs_client.put_log_events(
+ logGroupName='/aws/lambda/function-name',
+ logStreamName='stream-name',
+ logEvents=[
+ {
+ 'timestamp': int(time.time() * 1000),
+ 'message': 'Processing PDF: file.pdf'
+ }
+ ]
+)
+```
+
+---
+
+### 9. Secrets Manager Interface
+
+#### GetSecretValue
+```python
+secrets_client.get_secret_value(
+ SecretId='adobe-pdf-services-credentials'
+)
+```
+
+**Response**:
+```json
+{
+ "SecretString": "{\"client_id\":\"...\",\"client_secret\":\"...\"}"
+}
+```
+
+---
+
+### 10. Step Functions Interface
+
+#### StartExecution
+```python
+sfn_client.start_execution(
+ stateMachineArn='arn:aws:states:...',
+ input=json.dumps({
+ 'bucket': 'bucket-name',
+ 'key': 'path/to/file.pdf',
+ 'chunks': ['chunk1.pdf', 'chunk2.pdf']
+ })
+)
+```
+
+---
+
+## Data Models
+
+### AuditReport
+```python
+@dataclass
+class AuditReport:
+ summary: AuditSummary
+ issues: List[AuditIssue]
+ wcag_summary: Dict[str, WCAGCriterion]
+ timestamp: datetime
+ html_path: str
+```
+
+### AuditIssue
+```python
+@dataclass
+class AuditIssue:
+ id: str
+ type: str
+ severity: Severity # CRITICAL, SERIOUS, MODERATE, MINOR
+ wcag_criteria: List[str]
+ element: str
+ selector: str
+ location: Location
+ message: str
+ suggestion: str
+ context: Optional[str]
+```
+
+### RemediationReport
+```python
+@dataclass
+class RemediationReport:
+ summary: RemediationSummary
+ fixes: List[RemediationFix]
+ manual_review_items: List[ManualReviewItem]
+ timestamp: datetime
+ html_path: str
+```
+
+### RemediationFix
+```python
+@dataclass
+class RemediationFix:
+ issue_id: str
+ status: RemediationStatus # FIXED, FAILED, MANUAL_REVIEW
+ method: str # ai_generated, rule_based, manual
+ original: str
+ fixed: str
+ ai_prompt: Optional[str]
+ ai_response: Optional[str]
+ error: Optional[str]
+```
+
+---
+
+## Event Schemas
+
+### S3 Event (Lambda Trigger)
+```json
+{
+ "Records": [
+ {
+ "eventVersion": "2.1",
+ "eventSource": "aws:s3",
+ "eventName": "ObjectCreated:Put",
+ "s3": {
+ "bucket": {
+ "name": "bucket-name"
+ },
+ "object": {
+ "key": "pdf/document.pdf",
+ "size": 1024000
+ }
+ }
+ }
+ ]
+}
+```
+
+### Step Functions Input
+```json
+{
+ "bucket": "pdfaccessibility-bucket",
+ "original_key": "pdf/document.pdf",
+ "chunks": [
+ "temp/document_page_1.pdf",
+ "temp/document_page_2.pdf",
+ "temp/document_page_3.pdf"
+ ],
+ "user_id": "user123",
+ "timestamp": "2026-03-02T15:00:00Z"
+}
+```
+
+### Step Functions Output
+```json
+{
+ "status": "SUCCESS",
+ "result_key": "result/COMPLIANT_document.pdf",
+ "pages_processed": 3,
+ "audit_results": {
+ "pre_remediation": {...},
+ "post_remediation": {...}
+ },
+ "metrics": {
+ "adobe_api_calls": 3,
+ "bedrock_invocations": 15,
+ "processing_duration_seconds": 120
+ }
+}
+```
+
+---
+
+## Error Responses
+
+### Standard Error Format
+```json
+{
+ "error": {
+ "code": "ERROR_CODE",
+ "message": "Human-readable error message",
+ "details": {
+ "file": "document.pdf",
+ "operation": "adobe_autotag",
+ "timestamp": "2026-03-02T15:00:00Z"
+ },
+ "retry_after": 60
+ }
+}
+```
+
+### Common Error Codes
+- `INVALID_PDF`: PDF file is corrupted or invalid
+- `ADOBE_API_ERROR`: Adobe API call failed
+- `BEDROCK_THROTTLING`: Bedrock rate limit exceeded
+- `BDA_TIMEOUT`: BDA processing timeout
+- `INSUFFICIENT_PERMISSIONS`: IAM permissions issue
+- `S3_ACCESS_DENIED`: S3 access error
+- `PROCESSING_TIMEOUT`: Overall timeout exceeded
diff --git a/.agents/summary/review_notes.md b/.agents/summary/review_notes.md
new file mode 100644
index 00000000..287420f1
--- /dev/null
+++ b/.agents/summary/review_notes.md
@@ -0,0 +1,330 @@
+# Documentation Review Notes
+
+## Consistency Check Results
+
+### ✅ Consistent Areas
+
+1. **Architecture Patterns**
+ - Event-driven architecture consistently applied
+ - Serverless-first approach throughout
+ - IAM role-based security model
+
+2. **Naming Conventions**
+ - S3 bucket naming: `{solution}-{resource}-{id}`
+ - Lambda functions: Descriptive names with hyphens
+ - Metrics namespace: `PDFAccessibility`
+
+3. **Error Handling**
+ - Exponential backoff retry logic
+ - CloudWatch error logging
+ - Metrics publishing for failures
+
+4. **Monitoring**
+ - CloudWatch Logs for all components
+ - Custom metrics with consistent dimensions
+ - Usage tracking across both solutions
+
+### ⚠️ Inconsistencies Found
+
+1. **Language Diversity**
+ - **Issue**: PDF Merger uses Java while other Lambdas use Python
+ - **Impact**: Different deployment processes, dependencies
+ - **Recommendation**: Consider migrating to Python for consistency
+ - **Justification**: Apache PDFBox (Java) may offer better PDF merging capabilities
+
+2. **Container Base Images**
+ - **Issue**: Adobe container uses `python:3.9-slim`, Alt Text uses `node:18-alpine`
+ - **Impact**: Different security patching schedules
+ - **Recommendation**: Standardize on specific base image versions
+
+3. **Metrics Helper Duplication**
+ - **Issue**: `metrics_helper.py` exists in multiple locations:
+ - `lambda/shared/metrics_helper.py`
+ - `lambda/shared/python/metrics_helper.py`
+ - `adobe-autotag-container/metrics_helper.py`
+ - `pdf2html/metrics_helper.py`
+ - **Impact**: Maintenance burden, potential version drift
+ - **Recommendation**: Consolidate into single shared module
+
+4. **Configuration Management**
+ - **Issue**: PDF-to-PDF uses environment variables, PDF-to-HTML uses config files
+ - **Impact**: Different configuration approaches
+ - **Recommendation**: Standardize on configuration method
+
+---
+
+## Completeness Check Results
+
+### ✅ Well-Documented Areas
+
+1. **Architecture**: Comprehensive diagrams and explanations
+2. **Components**: Detailed component descriptions
+3. **Workflows**: Clear process flows
+4. **APIs**: Well-defined interfaces
+5. **Data Models**: Complete structure definitions
+
+### 📝 Areas Needing More Detail
+
+#### 1. Testing Strategy
+- **Gap**: No documentation on testing approach
+- **Missing**:
+ - Unit test structure
+ - Integration test scenarios
+ - End-to-end test procedures
+ - Test data requirements
+- **Recommendation**: Add `testing.md` with:
+ - Test framework setup
+ - Sample test cases
+ - Mocking strategies for AWS services
+ - CI/CD test integration
+
+#### 2. Security Best Practices
+- **Gap**: Limited security documentation
+- **Missing**:
+ - IAM policy details
+ - Encryption at rest/in transit
+ - Secret rotation procedures
+ - Security audit procedures
+- **Recommendation**: Add `security.md` with:
+ - Least privilege IAM policies
+ - Encryption configuration
+ - Secret management best practices
+ - Security checklist
+
+#### 3. Performance Optimization
+- **Gap**: Limited performance tuning guidance
+- **Missing**:
+ - Lambda memory optimization
+ - ECS task sizing guidelines
+ - Bedrock prompt optimization
+ - Cost optimization strategies
+- **Recommendation**: Add `performance.md` with:
+ - Benchmarking results
+ - Tuning recommendations
+ - Cost vs. performance tradeoffs
+
+#### 4. Disaster Recovery
+- **Gap**: Basic DR mentioned but not detailed
+- **Missing**:
+ - Backup procedures
+ - Recovery testing
+ - Failover scenarios
+ - Data retention policies
+- **Recommendation**: Add `disaster_recovery.md` with:
+ - Backup schedules
+ - Recovery procedures
+ - RTO/RPO definitions
+ - DR testing plan
+
+#### 5. Troubleshooting Guide
+- **Gap**: README has basic troubleshooting, needs expansion
+- **Missing**:
+ - Common error messages and solutions
+ - Debug logging procedures
+ - Performance issue diagnosis
+ - Support escalation paths
+- **Recommendation**: Expand existing troubleshooting docs
+
+#### 6. API Rate Limiting
+- **Gap**: Rate limits mentioned but not detailed
+- **Missing**:
+ - Adobe API rate limit specifics
+ - Bedrock throttling handling
+ - BDA quota management
+ - Backpressure strategies
+- **Recommendation**: Add rate limiting section to interfaces.md
+
+#### 7. Multi-Region Deployment
+- **Gap**: No documentation on multi-region setup
+- **Missing**:
+ - Cross-region replication
+ - Regional failover
+ - Latency optimization
+- **Recommendation**: Add if multi-region support is planned
+
+#### 8. Monitoring and Alerting
+- **Gap**: Metrics documented but alerting not detailed
+- **Missing**:
+ - Alert thresholds
+ - Notification channels
+ - On-call procedures
+ - Runbook for common alerts
+- **Recommendation**: Add `monitoring.md` with:
+ - Alert definitions
+ - Response procedures
+ - Dashboard usage guide
+
+---
+
+## Language Support Limitations
+
+### Supported Languages
+- **Python**: Fully supported (95 files)
+ - Comprehensive analysis
+ - All functions and classes documented
+- **JavaScript**: Fully supported (3 files)
+ - Complete coverage
+- **Java**: Fully supported (2 files)
+ - Complete coverage
+- **Shell**: Fully supported (2 files)
+ - All functions documented
+
+### No Gaps Identified
+All languages in the codebase are well-supported and documented.
+
+---
+
+## Documentation Quality Assessment
+
+### Strengths
+1. **Comprehensive Coverage**: All major components documented
+2. **Visual Aids**: Mermaid diagrams for architecture and workflows
+3. **Structured Organization**: Clear hierarchy and navigation
+4. **Practical Examples**: Code snippets and data structures
+5. **WCAG Compliance**: Detailed accessibility standards mapping
+
+### Areas for Improvement
+
+#### 1. Code Examples
+- **Current**: Limited inline code examples
+- **Recommendation**: Add more code snippets showing:
+ - Lambda handler patterns
+ - Bedrock API calls
+ - Error handling examples
+ - Configuration examples
+
+#### 2. Deployment Variations
+- **Current**: Focuses on one-click deployment
+- **Recommendation**: Document:
+ - Local development setup
+ - CI/CD pipeline configuration
+ - Multi-account deployment
+ - Environment-specific configurations
+
+#### 3. Migration Guide
+- **Current**: No migration documentation
+- **Recommendation**: Add guide for:
+ - Upgrading between versions
+ - Migrating from other solutions
+ - Data migration procedures
+
+#### 4. API Versioning
+- **Current**: No versioning strategy documented
+- **Recommendation**: Define:
+ - API version scheme
+ - Backward compatibility policy
+ - Deprecation process
+
+#### 5. Contribution Guidelines
+- **Current**: Basic "Contributing" section in README
+- **Recommendation**: Expand with:
+ - Code style guide
+ - PR review process
+ - Development workflow
+ - Testing requirements
+
+---
+
+## Recommendations for Documentation Maintenance
+
+### Short-Term (1-3 months)
+1. Add testing documentation
+2. Expand security best practices
+3. Create troubleshooting runbook
+4. Add code examples to existing docs
+
+### Medium-Term (3-6 months)
+1. Create performance optimization guide
+2. Document disaster recovery procedures
+3. Add monitoring and alerting guide
+4. Create migration guide
+
+### Long-Term (6-12 months)
+1. Establish documentation review cycle
+2. Create video tutorials
+3. Build interactive documentation site
+4. Develop certification program
+
+---
+
+## Documentation Gaps by Priority
+
+### High Priority
+1. **Testing Strategy**: Critical for development workflow
+2. **Security Best Practices**: Essential for production deployment
+3. **Troubleshooting Guide**: Needed for operational support
+
+### Medium Priority
+1. **Performance Optimization**: Important for cost management
+2. **Monitoring and Alerting**: Needed for production operations
+3. **API Rate Limiting**: Important for reliability
+
+### Low Priority
+1. **Multi-Region Deployment**: Only if required
+2. **Migration Guide**: Needed when versions diverge
+3. **API Versioning**: Future consideration
+
+---
+
+## Validation Checklist
+
+### Architecture Documentation
+- [x] High-level overview
+- [x] Component diagrams
+- [x] Data flow diagrams
+- [x] Deployment architecture
+- [ ] Multi-region architecture (if applicable)
+
+### Component Documentation
+- [x] All major components described
+- [x] Dependencies documented
+- [x] Configuration options listed
+- [ ] Performance characteristics
+- [ ] Scaling considerations
+
+### API Documentation
+- [x] External APIs documented
+- [x] Internal APIs documented
+- [x] Data models defined
+- [x] Error responses documented
+- [ ] Rate limits detailed
+- [ ] API versioning strategy
+
+### Operational Documentation
+- [x] Deployment procedures
+- [x] Monitoring setup
+- [ ] Alerting configuration
+- [ ] Troubleshooting procedures
+- [ ] Disaster recovery plan
+- [ ] Security procedures
+
+### Development Documentation
+- [x] Repository structure
+- [x] Technology stack
+- [x] Dependencies
+- [ ] Development setup
+- [ ] Testing procedures
+- [ ] Contribution guidelines
+
+---
+
+## Next Steps
+
+1. **Review with Team**: Share documentation with development team for feedback
+2. **Prioritize Gaps**: Determine which gaps to address first
+3. **Assign Owners**: Assign documentation tasks to team members
+4. **Set Timeline**: Create schedule for documentation completion
+5. **Establish Process**: Define ongoing documentation maintenance process
+
+---
+
+## Feedback and Updates
+
+**Last Review**: 2026-03-02
+**Reviewer**: AI Documentation Generator
+**Next Review**: Recommended within 30 days
+
+**How to Provide Feedback**:
+- Create GitHub issue with label `documentation`
+- Email: ai-cic@amazon.com
+- Submit PR with documentation improvements
diff --git a/.agents/summary/workflows.md b/.agents/summary/workflows.md
new file mode 100644
index 00000000..92faf4f0
--- /dev/null
+++ b/.agents/summary/workflows.md
@@ -0,0 +1,481 @@
+# Key Workflows and Processes
+
+## PDF-to-PDF Remediation Workflow
+
+### End-to-End Process
+
+```mermaid
+flowchart TD
+ Start([User Uploads PDF]) --> S3Upload[PDF saved to S3 pdf/ folder]
+ S3Upload --> S3Event[S3 Event Notification]
+ S3Event --> Splitter[PDF Splitter Lambda]
+
+ Splitter --> Split{Split into
pages}
+ Split --> Chunk1[Page 1 PDF]
+ Split --> Chunk2[Page 2 PDF]
+ Split --> ChunkN[Page N PDF]
+
+ Chunk1 & Chunk2 & ChunkN --> StepFn[Step Functions
Orchestrator]
+
+ StepFn --> PreCheck[Pre-Remediation
Accessibility Check]
+ PreCheck --> MapState[Map State:
Parallel Processing]
+
+ MapState --> Adobe1[Adobe Autotag
ECS Task 1]
+ MapState --> Adobe2[Adobe Autotag
ECS Task 2]
+ MapState --> AdobeN[Adobe Autotag
ECS Task N]
+
+ Adobe1 --> Alt1[Alt Text Generator
ECS Task 1]
+ Adobe2 --> Alt2[Alt Text Generator
ECS Task 2]
+ AdobeN --> AltN[Alt Text Generator
ECS Task N]
+
+ Alt1 & Alt2 & AltN --> MapComplete[All Chunks
Processed]
+
+ MapComplete --> TitleGen[Title Generator
Lambda]
+ TitleGen --> PostCheck[Post-Remediation
Accessibility Check]
+ PostCheck --> Merger[PDF Merger
Lambda]
+ Merger --> Result[Save to S3
result/ folder]
+ Result --> End([User Downloads
Compliant PDF])
+```
+
+### Detailed Steps
+
+#### 1. Upload and Trigger (0-5 seconds)
+- User uploads PDF to S3 `pdf/` folder
+- S3 generates PUT event notification
+- Event triggers PDF Splitter Lambda
+- S3 Object Tagger adds user metadata
+
+#### 2. PDF Splitting (5-30 seconds)
+- Lambda downloads PDF from S3
+- Splits PDF into individual pages using pypdf
+- Uploads each page to `temp/` folder
+- Publishes metrics (pages processed, file size)
+- Triggers Step Functions with chunk list
+
+#### 3. Pre-Remediation Check (10-20 seconds)
+- Lambda downloads original PDF
+- Runs accessibility audit
+- Generates baseline report
+- Saves report to S3
+
+#### 4. Parallel Chunk Processing (2-10 minutes per chunk)
+
+**Map State Configuration**:
+- Max concurrency: 10
+- Retry attempts: 3
+- Timeout: 30 minutes per chunk
+
+**For Each Chunk**:
+
+##### 4a. Adobe Autotag (1-5 minutes)
+- ECS Fargate task starts
+- Downloads chunk from S3
+- Retrieves Adobe credentials from Secrets Manager
+- Calls Adobe Autotag API
+ - Adds structure tags (headings, lists, tables)
+ - Identifies reading order
+- Calls Adobe Extract API
+ - Extracts images
+ - Generates image metadata Excel file
+- Creates SQLite database with image info
+- Uploads tagged PDF to S3
+- Publishes metrics (API calls, duration)
+
+##### 4b. Alt Text Generation (1-5 minutes)
+- ECS Fargate task starts
+- Downloads tagged PDF and image metadata
+- For each image:
+ - Extracts surrounding text context
+ - Determines if decorative or informative
+ - If informative:
+ - Encodes image as base64
+ - Calls Bedrock Nova Pro with image + context
+ - Receives AI-generated alt text
+ - Embeds alt text in PDF structure
+- Uploads final PDF to S3
+- Publishes metrics (Bedrock calls, tokens)
+
+#### 5. Title Generation (30-60 seconds)
+- Lambda downloads first processed chunk
+- Extracts text from first few pages
+- Calls Bedrock Nova Pro with prompt
+- Receives generated title
+- Embeds title in PDF metadata
+- Saves updated PDF
+
+#### 6. Post-Remediation Check (10-20 seconds)
+- Lambda downloads processed PDF
+- Runs accessibility audit
+- Compares with pre-check results
+- Generates compliance report
+- Saves report to S3
+
+#### 7. PDF Merging (30-120 seconds)
+- Java Lambda starts
+- Downloads all processed chunks
+- Merges in correct page order using Apache PDFBox
+- Adds "COMPLIANT" prefix to filename
+- Uploads to `result/` folder
+- Publishes completion metrics
+
+#### 8. Notification and Cleanup
+- User receives notification (if UI deployed)
+- Temporary files remain in `temp/` folder
+- Optional: S3 lifecycle policy cleans up temp files after 7 days
+
+### Total Processing Time
+- **Small PDF (1-10 pages)**: 3-8 minutes
+- **Medium PDF (11-50 pages)**: 8-20 minutes
+- **Large PDF (51-200 pages)**: 20-60 minutes
+
+---
+
+## PDF-to-HTML Remediation Workflow
+
+### End-to-End Process
+
+```mermaid
+flowchart TD
+ Start([User Uploads PDF]) --> S3Upload[PDF saved to S3
uploads/ folder]
+ S3Upload --> S3Event[S3 Event Notification]
+ S3Event --> Lambda[PDF2HTML Lambda]
+
+ Lambda --> BDACreate[Create BDA Job]
+ BDACreate --> BDAProcess[BDA Parses PDF]
+ BDAProcess --> BDAWait{Wait for
Completion}
+ BDAWait -->|Polling| BDACheck[Check Status]
+ BDACheck -->|Processing| BDAWait
+ BDACheck -->|Complete| BDAResult[Retrieve Results]
+
+ BDAResult --> Convert[Convert to HTML]
+ Convert --> Audit[Audit Accessibility]
+
+ Audit --> IssueLoop{For Each
Issue}
+ IssueLoop --> CheckType{Issue Type}
+
+ CheckType -->|Simple| RuleBased[Rule-Based Fix]
+ CheckType -->|Complex| AIFix[AI-Generated Fix]
+
+ AIFix --> Bedrock[Call Bedrock
Nova Pro]
+ Bedrock --> ApplyFix[Apply Fix to HTML]
+ RuleBased --> ApplyFix
+
+ ApplyFix --> MoreIssues{More
Issues?}
+ MoreIssues -->|Yes| IssueLoop
+ MoreIssues -->|No| Report[Generate Reports]
+
+ Report --> Package[Package Outputs]
+ Package --> ZIP[Create ZIP File]
+ ZIP --> S3Save[Save to S3
remediated/ folder]
+ S3Save --> End([User Downloads ZIP])
+```
+
+### Detailed Steps
+
+#### 1. Upload and Trigger (0-5 seconds)
+- User uploads PDF to S3 `uploads/` folder
+- S3 generates PUT event notification
+- Event triggers PDF2HTML Lambda (container)
+- S3 Object Tagger adds user metadata
+
+#### 2. PDF to HTML Conversion (30-120 seconds)
+
+##### 2a. BDA Job Creation
+- Lambda calls Bedrock Data Automation API
+- Creates async parsing job
+- Receives job ID
+
+##### 2b. BDA Processing
+- BDA parses PDF structure
+- Extracts text with layout information
+- Identifies images, tables, headings
+- Generates structured JSON output
+- Saves to S3 output location
+
+##### 2c. Status Polling
+- Lambda polls BDA job status every 5 seconds
+- Timeout: 5 minutes
+- On completion, retrieves results
+
+##### 2d. HTML Generation
+- Lambda processes BDA JSON output
+- Builds HTML structure from elements
+- Preserves layout and styling
+- Copies images to output directory
+- Saves initial HTML to `output/result.html`
+
+#### 3. Accessibility Audit (10-30 seconds)
+
+##### 3a. HTML Parsing
+- Loads HTML with BeautifulSoup
+- Builds DOM tree
+
+##### 3b. Check Execution
+- Runs all accessibility checks:
+ - Image checks (alt text)
+ - Heading checks (hierarchy)
+ - Table checks (headers, captions)
+ - Form checks (labels, fieldsets)
+ - Link checks (descriptive text)
+ - Structure checks (landmarks, language)
+ - Color contrast checks
+
+##### 3c. Issue Collection
+- Collects all issues with:
+ - Element selector
+ - WCAG criteria
+ - Severity level
+ - Suggested fix
+- Generates audit report
+
+#### 4. Remediation (1-5 minutes)
+
+##### 4a. Issue Prioritization
+- Groups issues by type
+- Prioritizes critical issues
+- Determines remediation strategy
+
+##### 4b. Rule-Based Fixes (Simple Issues)
+**Examples**:
+- Add missing `lang` attribute
+- Add `main` landmark
+- Fix heading hierarchy
+- Add table `scope` attributes
+- Associate form labels
+
+**Process**:
+- Apply predefined transformation
+- Update HTML DOM
+- Mark issue as fixed
+
+##### 4c. AI-Generated Fixes (Complex Issues)
+**Examples**:
+- Generate alt text for images
+- Create table captions
+- Improve link text
+- Generate document title
+
+**Process**:
+1. Extract element and context
+2. Build AI prompt with:
+ - Issue description
+ - Element HTML
+ - Surrounding context
+ - WCAG guidance
+3. Call Bedrock Nova Pro
+4. Parse AI response
+5. Apply fix to HTML
+6. Validate fix
+7. Mark issue as fixed or manual review
+
+##### 4d. Manual Review Items
+**Flagged for Manual Review**:
+- Complex table structures
+- Ambiguous image context
+- Color contrast requiring design changes
+- Structural changes affecting layout
+
+#### 5. Report Generation (5-15 seconds)
+
+##### 5a. HTML Report
+- Interactive report with:
+ - Summary statistics
+ - Issue breakdown by severity
+ - WCAG criteria mapping
+ - Before/after comparisons
+ - Manual review items
+- Styled with CSS
+- JavaScript for filtering
+
+##### 5b. JSON Report
+- Machine-readable format
+- Complete issue details
+- Remediation actions
+- Usage statistics
+
+##### 5c. Usage Data
+- Bedrock invocations and tokens
+- BDA processing time
+- Cost estimates
+- Processing metrics
+
+#### 6. Packaging and Output (5-10 seconds)
+
+##### 6a. File Collection
+- `remediated.html`: Final accessible HTML
+- `result.html`: Original conversion (before remediation)
+- `images/`: Extracted images with alt text
+- `remediation_report.html`: Detailed report
+- `usage_data.json`: Usage statistics
+
+##### 6b. ZIP Creation
+- Creates `final_{filename}.zip`
+- Includes all output files
+- Preserves directory structure
+
+##### 6c. S3 Upload
+- Uploads ZIP to `remediated/` folder
+- Sets appropriate metadata
+- Publishes completion metrics
+
+#### 7. Cleanup
+- Removes temporary files
+- Logs completion
+- Returns success response
+
+### Total Processing Time
+- **Small PDF (1-10 pages)**: 1-3 minutes
+- **Medium PDF (11-50 pages)**: 3-8 minutes
+- **Large PDF (51-200 pages)**: 8-20 minutes
+
+---
+
+## Deployment Workflow
+
+### One-Click Deployment (deploy.sh)
+
+```mermaid
+flowchart TD
+ Start([Run deploy.sh]) --> Check[Check Prerequisites]
+ Check --> Region[Select AWS Region]
+ Region --> Solution{Select Solution}
+
+ Solution -->|PDF-to-PDF| Adobe[Enter Adobe Credentials]
+ Solution -->|PDF-to-HTML| BDA[Check BDA Access]
+ Solution -->|Both| Adobe
+
+ Adobe --> Secrets[Store in Secrets Manager]
+ BDA --> Project[Create BDA Project]
+ Secrets & Project --> CodeBuild[Create CodeBuild Project]
+
+ CodeBuild --> Build[Start Build]
+ Build --> CDKSynth[CDK Synth]
+ CDKSynth --> CDKDeploy[CDK Deploy]
+
+ CDKDeploy --> Stack1[Deploy PDF-to-PDF Stack]
+ CDKDeploy --> Stack2[Deploy PDF-to-HTML Stack]
+ CDKDeploy --> Stack3[Deploy Metrics Stack]
+
+ Stack1 & Stack2 & Stack3 --> Verify[Verify Deployment]
+ Verify --> UI{Deploy UI?}
+
+ UI -->|Yes| UIBuild[Build UI Stack]
+ UI -->|No| Complete
+ UIBuild --> Complete[Deployment Complete]
+ Complete --> End([Show Testing Instructions])
+```
+
+### Manual Deployment
+
+```mermaid
+flowchart TD
+ Start([Developer]) --> Clone[Clone Repository]
+ Clone --> Install[Install Dependencies]
+ Install --> Config[Configure AWS CLI]
+ Config --> Secrets[Create Secrets]
+ Secrets --> Synth[cdk synth]
+ Synth --> Deploy[cdk deploy --all]
+ Deploy --> Verify[Verify Resources]
+ Verify --> Test[Run Tests]
+ Test --> End([Deployment Complete])
+```
+
+---
+
+## Error Handling Workflows
+
+### Retry Logic
+
+```mermaid
+flowchart TD
+ Start[Operation Starts] --> Try[Attempt Operation]
+ Try --> Success{Success?}
+ Success -->|Yes| End([Complete])
+ Success -->|No| CheckRetries{Retries
Remaining?}
+ CheckRetries -->|Yes| Wait[Exponential Backoff]
+ Wait --> Retry[Retry Attempt]
+ Retry --> Try
+ CheckRetries -->|No| Error[Log Error]
+ Error --> Metric[Publish Error Metric]
+ Metric --> Fail([Fail])
+```
+
+**Retry Configuration**:
+- Max attempts: 3
+- Backoff rate: 2.0
+- Initial delay: 1 second
+- Max delay: 60 seconds
+
+### Error Recovery
+
+#### Adobe API Failure
+1. Log error to CloudWatch
+2. Publish error metric
+3. Retry with exponential backoff
+4. If all retries fail:
+ - Mark chunk as failed
+ - Continue with other chunks
+ - Generate partial result
+
+#### Bedrock Throttling
+1. Detect throttling error
+2. Implement exponential backoff
+3. Reduce request rate
+4. Retry operation
+5. If persistent:
+ - Fall back to rule-based fixes
+ - Flag for manual review
+
+#### BDA Timeout
+1. Cancel BDA job
+2. Retry with smaller page range
+3. If timeout persists:
+ - Process pages individually
+ - Combine results
+
+---
+
+## Monitoring Workflow
+
+### Metrics Collection
+
+```mermaid
+flowchart LR
+ Lambda[Lambda/ECS] --> Emit[Emit Metrics]
+ Emit --> CW[CloudWatch Metrics]
+ CW --> Dashboard[Dashboard]
+ CW --> Alarms[CloudWatch Alarms]
+ Alarms --> SNS[SNS Notifications]
+ SNS --> Email[Email/SMS]
+```
+
+### Log Aggregation
+
+```mermaid
+flowchart LR
+ Components[All Components] --> Logs[CloudWatch Logs]
+ Logs --> Insights[CloudWatch Insights]
+ Insights --> Queries[Custom Queries]
+ Queries --> Analysis[Analysis & Debugging]
+```
+
+---
+
+## Cost Tracking Workflow
+
+```mermaid
+flowchart TD
+ Upload[User Uploads PDF] --> Tag[S3 Object Tagger]
+ Tag --> Process[Processing Pipeline]
+ Process --> Track[Usage Tracker]
+ Track --> Metrics[Publish Cost Metrics]
+ Metrics --> Dashboard[Cost Dashboard]
+ Dashboard --> Report[Per-User Cost Report]
+```
+
+**Cost Attribution**:
+1. S3 object tagged with user ID
+2. All operations track user ID
+3. Metrics published with user dimension
+4. Dashboard aggregates by user
+5. Monthly cost reports generated
diff --git a/.gitignore b/.gitignore
index a140b69d..cba911f9 100644
--- a/.gitignore
+++ b/.gitignore
@@ -15,6 +15,7 @@ cdk.out/
__pycache__/
*.pyc
javascript_docker/node_modules
+lambda/title-generator-lambda/venv
lambda/add_title/venv
PDF_accessability_UI
# IDE and editor files
@@ -25,3 +26,9 @@ PDF_accessability_UI
# PDF UI (separate repo)
PDF_accessability_UI/
+
+# Stack export files
+existing-stack.json
+# Pipeline config files (may contain credentials)
+pipeline.conf
+*.conf
diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 00000000..889ad4d1
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,731 @@
+# PDF Accessibility Solutions - AI Assistant Guide
+
+**Version**: 1.0
+**Last Updated**: 2026-03-02
+**Codebase Commit**: `8d6102bc644641c94f5a695a32ea50c19b3c8d68`
+
+## Purpose
+
+This document provides AI coding assistants with essential context about the PDF Accessibility Solutions codebase. It focuses on information not typically found in README.md or CONTRIBUTING.md, including file organization, coding patterns, testing procedures, and package-specific guidance.
+
+---
+
+## Table of Contents
+
+1. [Project Overview](#project-overview)
+2. [Directory Structure](#directory-structure)
+3. [Coding Patterns and Conventions](#coding-patterns-and-conventions)
+4. [Development Workflow](#development-workflow)
+5. [Testing Guidelines](#testing-guidelines)
+6. [Package-Specific Guidance](#package-specific-guidance)
+7. [Common Tasks](#common-tasks)
+8. [Troubleshooting](#troubleshooting)
+
+---
+
+## Project Overview
+
+### What This Project Does
+
+PDF Accessibility Solutions provides two complementary approaches to making PDF documents accessible according to WCAG 2.1 Level AA standards:
+
+1. **PDF-to-PDF Remediation**: Maintains PDF format while adding accessibility features (tags, alt text, structure)
+2. **PDF-to-HTML Remediation**: Converts PDFs to accessible HTML with full WCAG compliance
+
+### Key Technologies
+
+- **Infrastructure**: AWS CDK (Python & JavaScript)
+- **Compute**: AWS Lambda (Python, Java, Node.js), ECS Fargate
+- **AI/ML**: Amazon Bedrock (Nova Pro), Bedrock Data Automation
+- **Storage**: Amazon S3
+- **Orchestration**: AWS Step Functions
+- **Monitoring**: CloudWatch Logs & Metrics
+
+### Architecture Pattern
+
+Event-driven, serverless architecture:
+- S3 events trigger processing pipelines
+- Step Functions orchestrate parallel processing
+- ECS Fargate handles heavy compute tasks
+- Lambda handles lightweight operations
+
+---
+
+## Directory Structure
+
+```
+PDF_Accessibility/
+├── .agents/summary/ # AI assistant documentation (this guide's source)
+├── cdk/ # CDK infrastructure (Python)
+│ ├── usage_metrics_stack.py
+│ └── cdk_stack.py
+├── lambda/ # Lambda functions
+│ ├── pdf-splitter-lambda/ # Python: Splits PDFs into pages
+│ ├── pdf-merger-lambda/ # Java: Merges processed PDFs
+│ ├── title-generator-lambda/ # Python: Generates titles
+│ ├── pre-remediation-accessibility-checker/ # Python
+│ ├── post-remediation-accessibility-checker/ # Python
+│ ├── s3_object_tagger/ # Python: Tags S3 objects
+│ └── shared/ # Shared utilities (metrics_helper.py)
+├── pdf2html/ # PDF-to-HTML solution
+│ ├── cdk/ # CDK infrastructure (JavaScript)
+│ ├── content_accessibility_utility_on_aws/ # Core library
+│ │ ├── audit/ # Accessibility auditing
+│ │ ├── remediate/ # Accessibility remediation
+│ │ ├── pdf2html/ # PDF to HTML conversion
+│ │ ├── batch/ # Batch processing
+│ │ └── utils/ # Utilities
+│ ├── lambda_function.py # Lambda entry point
+│ ├── metrics_helper.py # Metrics tracking
+│ └── Dockerfile # Lambda container image
+├── adobe-autotag-container/ # ECS: Adobe API integration (Python)
+├── alt-text-generator-container/ # ECS: Alt text generation (Node.js)
+├── docs/ # Documentation
+├── app.py # Main CDK app (PDF-to-PDF)
+├── deploy.sh # Unified deployment script
+└── deploy-local.sh # Local deployment script
+```
+
+### Key File Locations
+
+**Infrastructure**:
+- PDF-to-PDF CDK: `app.py`, `cdk/usage_metrics_stack.py`
+- PDF-to-HTML CDK: `pdf2html/cdk/lib/pdf2html-stack.js`
+
+**Core Logic**:
+- PDF-to-PDF: Lambda functions in `lambda/` + ECS containers
+- PDF-to-HTML: `pdf2html/content_accessibility_utility_on_aws/`
+
+**Shared Code**:
+- Metrics: `lambda/shared/metrics_helper.py` (duplicated in containers)
+- Configuration: `pdf2html/content_accessibility_utility_on_aws/utils/config.py`
+
+**Deployment**:
+- One-click: `deploy.sh`
+- Local: `deploy-local.sh`
+- CI/CD: `buildspec-unified.yml`
+
+---
+
+## Coding Patterns and Conventions
+
+### Python Code Style
+
+**Formatting**:
+- Follow PEP 8
+- Use 4 spaces for indentation
+- Max line length: 100 characters (flexible)
+- Use type hints where practical
+
+**Naming Conventions**:
+- Functions: `snake_case`
+- Classes: `PascalCase`
+- Constants: `UPPER_SNAKE_CASE`
+- Private methods: `_leading_underscore`
+
+**Example Pattern**:
+```python
+from typing import Dict, List, Optional
+import boto3
+from metrics_helper import MetricsContext
+
+def process_pdf_document(
+ bucket: str,
+ key: str,
+ user_id: Optional[str] = None
+) -> Dict[str, any]:
+ """Process a PDF document for accessibility.
+
+ Args:
+ bucket: S3 bucket name
+ key: S3 object key
+ user_id: Optional user identifier for metrics
+
+ Returns:
+ Dictionary with processing results
+ """
+ with MetricsContext(user_id=user_id, solution="PDF2PDF") as metrics:
+ try:
+ # Processing logic
+ metrics.track_pages_processed(page_count)
+ return {"status": "success"}
+ except Exception as e:
+ metrics.track_error(str(e))
+ raise
+```
+
+### JavaScript Code Style
+
+**Formatting**:
+- Use 2 spaces for indentation
+- Semicolons required
+- Use `const` by default, `let` when needed
+- Async/await for asynchronous code
+
+**Example Pattern**:
+```javascript
+const { S3Client, GetObjectCommand } = require('@aws-sdk/client-s3');
+const { BedrockRuntimeClient, InvokeModelCommand } = require('@aws-sdk/client-bedrock-runtime');
+
+async function generateAltText(imageBuffer, context) {
+ const client = new BedrockRuntimeClient({ region: process.env.AWS_REGION });
+
+ const payload = {
+ messages: [{
+ role: 'user',
+ content: [
+ { text: `Generate alt text for this image. Context: ${context}` },
+ { image: { source: { bytes: imageBuffer } } }
+ ]
+ }],
+ inferenceConfig: { maxTokens: 512, temperature: 0.7 }
+ };
+
+ const response = await client.send(new InvokeModelCommand({
+ modelId: 'amazon.nova-pro-v1:0',
+ body: JSON.stringify(payload)
+ }));
+
+ return JSON.parse(response.body).output.message.content[0].text;
+}
+```
+
+### Java Code Style
+
+**Formatting**:
+- Follow Google Java Style Guide
+- Use 4 spaces for indentation
+- Braces on same line
+
+**Example Pattern** (PDF Merger):
+```java
+public class App implements RequestHandler