Skip to content

feat(observability): Add usage metrics, per-user tracking, dashboard, and local deploy script#39

Open
AWS-fpenland wants to merge 52 commits intoASUCICREPO:mainfrom
AWS-fpenland:dev
Open

feat(observability): Add usage metrics, per-user tracking, dashboard, and local deploy script#39
AWS-fpenland wants to merge 52 commits intoASUCICREPO:mainfrom
AWS-fpenland:dev

Conversation

@AWS-fpenland
Copy link

Summary

Adds comprehensive observability to both PDF-to-PDF and PDF-to-HTML pipelines, including custom CloudWatch metrics, per-user usage attribution, a dedicated monitoring dashboard, and a local deployment script.

What's included

Observability (metrics_helper.py, usage_metrics_stack.py)

  • Shared Python metrics library emitting to the PDFAccessibility CloudWatch namespace
  • Tracks: pages processed, file size, Adobe API calls/Document Transactions, Bedrock invocations/tokens, processing duration, errors, and estimated cost
  • MetricsContext context manager for automatic duration and error tracking
  • Graceful degradation — metrics failures are caught and logged without interrupting processing

Per-user attribution (s3_object_tagger/, splitter + pdf2html Lambdas)

  • S3 object tagging with UserId from Cognito metadata on upload
  • All metrics include Service + UserId dimensions for per-user breakdown
  • Falls back to anonymous for direct uploads without Cognito

CloudWatch dashboard (PDFAccessibilityUsageMetrics stack)

  • Aggregate pages/files processed via SUM(SEARCH(...))
  • Per-user usage tables via Log Insights structured queries
  • Bedrock invocation and token usage graphs
  • Adobe Document Transaction tracking (quota visibility)
  • Processing performance and error monitoring

Instrumented components

Component Metrics added
PDF Splitter Lambda PagesProcessed, FileSize, structured logs
Adobe AutoTag (ECS) AdobeAPICalls, AdobeDocTransactions
Alt Text Generator (ECS) BedrockInvocations, token usage
Title Generator Lambda BedrockInvocations, token usage
PDF-to-HTML Lambda PagesProcessed, Bedrock usage, EstimatedCost

Deployment tooling

  • deploy-local.sh — deploy both pipelines from local repo without CodeBuild/GitHub
  • CORS policy added to buildspec-unified.yml for S3 bucket
  • cdk.json updated to use python3 explicitly

CDK changes (app.py)

  • Lambda Layer for metrics helper attached to all Python Lambdas
  • cloudwatch:PutMetricData added to ECS task role
  • s3:GetObjectTagging/s3:PutObjectTagging for splitter Lambda
  • USER_ID env var passed to ECS tasks via Step Functions
  • Bucket and log group names exposed for dashboard stack

Documentation

  • docs/OBSERVABILITY.md — full metrics reference, dimensions, per-user tracking flow, cost estimation, dashboard guide
  • README updated with observability section

What's NOT changed

  • No changes to deploy.sh IAM policies
  • No changes to pre/post remediation accessibility checkers
  • No changes to alt-text error handling or retry logic
  • All existing Bedrock model configurations preserved

- Add detailed observability analysis document
- Create UsageMetricsDashboard CDK stack with:
  * Pages processed tracking
  * Bedrock token usage metrics
  * Adobe API call tracking
  * Processing duration monitoring
  * Error tracking
  * Cost estimation per file/user
- Add metrics_helper.py utility for CloudWatch metrics emission
- Add integration guide for implementing metrics in existing code
- Support per-user usage tracking via S3 object tagging
- Include cost calculation formulas for both solutions
- Add metrics layer to all Lambda functions
- Update split_pdf Lambda with page tracking and file size metrics
- Update Adobe autotag script with API call tracking
- Update PDF2HTML Lambda with Bedrock token tracking and cost estimation
- Add user attribution via S3 object tags throughout pipeline
- Deploy UsageMetricsDashboard as part of main stack
- All metrics automatically tracked with MetricsContext for duration/errors
- Add self.bucket = bucket to make bucket accessible to UsageMetricsDashboard
- Fixes AttributeError when deploying observability stack
…ics helper

- Move metrics_helper.py to python/ subdirectory for proper Lambda layer structure
- Lambda layers require python/ directory for Python packages to be importable
…gregation

- Add metrics_helper.py to Docker container
- Update dashboard to use SEARCH expression for aggregating metrics across dimensions
- Fix import path in autotag.py for Docker environment
- Resolves Adobe API metrics not being tracked in ECS tasks
…gging

- Create S3 tagger Lambda to convert metadata to tags
- Extract user-sub from Cognito UI uploads (metadata)
- Apply UserId tag for consistent metrics tracking
- Support both authenticated UI and direct S3 uploads
- Fallback to 'anonymous' for untagged uploads
- Add tagger to both PDF-to-PDF and PDF-to-HTML stacks
- Document per-user tracking architecture and usage
- Docker build context cannot access parent directories
- Copy metrics_helper.py directly into docker_autotag/
- Simplify Dockerfile to use COPY . /app
- Remove separate S3 tagger Lambda (caused S3 notification conflicts)
- Add tagging logic directly to split_pdf and pdf2html Lambdas
- Convert metadata to tags at start of processing
- Avoids overlapping S3 event notification rules
- Maintains same user attribution functionality
- Remove FileName from all metric dimensions
- Aggregate at Service/UserId level only
- Reduces metric streams from per-file to per-user
- Fix undefined files_metric in dashboard
- Simplifies dashboard queries
- Direct metric queries with Service dimension
- Sum for pages, SampleCount for files
- Avoids SEARCH expression array division errors
- Metrics have Service+UserId dimensions
- Use SUM() to aggregate SEARCH results into single value
- Fixes empty dashboard widgets
- ECS tasks were getting AccessDenied when emitting metrics
- Add cloudwatch:PutMetricData to ecs_task_role policy
- Enables Adobe API call metrics from ECS containers
- Lambda function names have generated suffixes
- Expose log group names from main stack
- Pass actual names to dashboard
- Fixes 'log group does not exist' error in per-user widgets
- Log Insights queries failed (logs are plain text, not JSON)
- Use CloudWatch Metrics with UserId dimension instead
- Add Adobe API calls widget
- Metrics already working and more efficient
- Shows per-user breakdown with legend
- See docs/METRICS_STATUS.md for full analysis
- Rewrite dashboard: remove broken/duplicate widgets, keep 6 focused sections
- Per-user widgets use same SEARCH as totals (without SUM wrapper)
- Remove FileName from all metric dimensions (adobe, bedrock, cost)
- Pass USER_ID env var from Step Function Map state to ECS tasks
- Add user_id to chunk metadata so Map state can access it
- Sync metrics_helper.py to docker_autotag
- Emit JSON log line {event, userId, fileName, pageCount, service} from both Lambdas
- Log Insights table: files & pages aggregated by userId
- Log Insights table: recent processing activity with details
- Replaces graph widgets with table format for per-user section
- SEARCH('{PDFAccessibility,Service,Operation}') matched ZERO metrics
  because all AdobeAPICalls have 3 dims (FileName or UserId added)
- Use SEARCH('{PDFAccessibility} MetricName=...') to match any dim set
- Add AdobeDocTransactions metric per Adobe licensing model:
  AutoTag = 10 doc transactions/page, ExtractPDF = ceil(pages/5)
- Pass page_count from PdfReader to track_adobe_api_call
- Add Document Transactions widget + quota info to dashboard
- pdf2html is a DockerImageFunction - Lambda layers don't work
- Copy metrics_helper.py into pdf2html Docker build context
- Add COPY to Dockerfile so it's included in the image
- Remove /opt/python path hack (file is now in /var/task)
- Add cloudwatch:PutMetricData permission to Lambda role
- Always include pdf2html log group in dashboard queries
- Separate --init (first-time) from update (default) flows
- --init creates secrets, BDA project, S3 bucket, ECR repo
- Updates reuse existing BDA project from CloudFormation params
- Always sync metrics_helper.py to Docker build contexts
- pdf2html: build/push Docker + force Lambda image update
- pdf2pdf: CDK handles Docker via DockerImageAsset automatically
- Support --pdf2pdf, --pdf2html, --all, --profile, --region flags
- No more duplicate BDA projects on every run
- 'with MetricsContext(...):' had no indented body (line 190)
- All code after it was at same indent level, not inside the with
- Python raised SyntaxError on import, Lambda couldn't start at all
- Replace with explicit __enter__/__exit__ to avoid re-indenting 250 lines
- Profile region (us-west-2) differs from deployment region (us-east-1)
- Script now checks if Pdf2HtmlStack exists and uses its region
- Prevents pushing to wrong ECR region
…pace

- SEARCH('{PDFAccessibility}') matches only metrics with ZERO dimensions
- Must specify dimension names: '{PDFAccessibility,Service,Operation,UserId}'
- Verified with get-metric-data: exact dims returns data, namespace-only returns empty
The lambda/add_title/venv/ directory (558 files including pip,
pymupdf, and binary .so files) was accidentally committed despite
being listed in .gitignore. Remove from git tracking while
preserving the .gitignore entry to prevent recurrence.
File contained a real AWS account ID and is not needed
for upstream contribution. Added to .gitignore to prevent
accidental re-commit.
lambda/shared/metrics_helper.py was missing page_count
param in track_adobe_api_call and still included FileName
dimension. Now matches the other three copies.
Bare except catches KeyboardInterrupt and SystemExit
which makes debugging harder. Both clauses are in tag
retrieval fallback paths added by the dev branch.
Rename directories back to match main branch naming
to minimize diff for upstream PR. All path references
in app.py, deploy-local.sh, .gitignore, and docs
updated accordingly. Observability features preserved.
Replace 7 separate observability docs with single
OBSERVABILITY.md. Restore IAM_PERMISSIONS.md,
MANUAL_DEPLOYMENT.md, and CONFIGURING_LIMITS.md
from main branch. Remove hardcoded WSL paths.
Rename autotag.py, alt-text.js, and myapp.py to match
main branch names. Update Dockerfiles, app.py handler,
and docs references accordingly.
Use main's optimized multi-stage build for the
alt-text-generator container (node:22-slim, separate
builder/production stages, smaller final image).
Start from main's app.py and surgically add only
observability features: metrics layer, CloudWatch
PutMetricData permission, USER_ID env var, S3 tagging
permissions, log group exports, and UsageMetrics stack.
Preserves main's VPC endpoints, zstd compression,
scoped IAM policies, and naming conventions.
Start from main's .gitignore, add only the
existing-stack.json exclusion.
Start from main's Dockerfile and adobe_autotag_processor.py,
add only metrics imports and tracking calls to autotag
and extract_api functions.
- Add interactive solution selection when no flags given
- Prompt for Adobe credentials when secret missing
- Create ECR repo before Docker push (was --init only)
- Create BDA project automatically if none exists
- Create S3 bucket and CORS for pdf2html if missing
- Use pip3 to match python3 interpreter
- Add CDK bootstrap before every deploy
- Add retry logic for CDK deploy
- Add --app flag for pdf2html CDK deploy
- Add Docker push retry with ECR login refresh
- Remove --init flag (resources created on demand)
- Pass user_id to Adobe API tracking calls in autotag container
- Add structured JSON logs to autotag container for log query widgets
- Add Bedrock metrics tracking to title-generator and alt-text-generator
- Add @aws-sdk/client-cloudwatch dependency to alt-text container
- Fix dashboard Bedrock widgets to query PDFAccessibility namespace
- Include all log groups (JS container, pdf2html) in dashboard queries
Remove structured log from autotag container since it processes
chunks not files, causing duplicate entries with _chunk_1 suffix.
The splitter already emits the correct file-level event.

Move pdf2html structured log outside usage_data.json dependency
with pypdf fallback so it always emits even without usage data.
Separate the structured log into its own try/except block so it
fires even if metrics tracking (estimate_cost etc) throws. The
previous code had both in the same try block, so any exception
in cost estimation would skip the dashboard log entirely.
Revert deploy.sh IAM policies to main's scoped versions.
The wildcard Resource:* on all policy statements was a
security regression. Main's policies already include
cloudwatch:PutMetricData and PutDashboard which is all
our observability features need.
Restore main's scoped Bedrock model ARNs, BDA project
permissions, and log group ARN. Keep new observability
additions: s3:GetObjectTagging, s3:PutObjectTagging,
and cloudwatch:PutMetricData.
These files had typo regressions (remidiation,
accessability) from rebasing. Our observability work
does not modify these Lambdas.
Restore MODEL_ID_ALT_TEXT/MODEL_ID_LINK_ALT_TEXT constants,
modifyPDF throw-on-error, success/failure counting with
all-failed exit guard, progress logging, and sleep(2000).
Keep observability additions: CloudWatch metrics tracking
for Bedrock invocations and token usage.
@AWS-fpenland AWS-fpenland marked this pull request as ready for review March 2, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant