feat(observability): Add usage metrics, per-user tracking, dashboard, and local deploy script#39
Open
AWS-fpenland wants to merge 52 commits intoASUCICREPO:mainfrom
Open
feat(observability): Add usage metrics, per-user tracking, dashboard, and local deploy script#39AWS-fpenland wants to merge 52 commits intoASUCICREPO:mainfrom
AWS-fpenland wants to merge 52 commits intoASUCICREPO:mainfrom
Conversation
- Add detailed observability analysis document - Create UsageMetricsDashboard CDK stack with: * Pages processed tracking * Bedrock token usage metrics * Adobe API call tracking * Processing duration monitoring * Error tracking * Cost estimation per file/user - Add metrics_helper.py utility for CloudWatch metrics emission - Add integration guide for implementing metrics in existing code - Support per-user usage tracking via S3 object tagging - Include cost calculation formulas for both solutions
- Add metrics layer to all Lambda functions - Update split_pdf Lambda with page tracking and file size metrics - Update Adobe autotag script with API call tracking - Update PDF2HTML Lambda with Bedrock token tracking and cost estimation - Add user attribution via S3 object tags throughout pipeline - Deploy UsageMetricsDashboard as part of main stack - All metrics automatically tracked with MetricsContext for duration/errors
- Add self.bucket = bucket to make bucket accessible to UsageMetricsDashboard - Fixes AttributeError when deploying observability stack
…ics helper - Move metrics_helper.py to python/ subdirectory for proper Lambda layer structure - Lambda layers require python/ directory for Python packages to be importable
…gregation - Add metrics_helper.py to Docker container - Update dashboard to use SEARCH expression for aggregating metrics across dimensions - Fix import path in autotag.py for Docker environment - Resolves Adobe API metrics not being tracked in ECS tasks
…gging - Create S3 tagger Lambda to convert metadata to tags - Extract user-sub from Cognito UI uploads (metadata) - Apply UserId tag for consistent metrics tracking - Support both authenticated UI and direct S3 uploads - Fallback to 'anonymous' for untagged uploads - Add tagger to both PDF-to-PDF and PDF-to-HTML stacks - Document per-user tracking architecture and usage
- Docker build context cannot access parent directories - Copy metrics_helper.py directly into docker_autotag/ - Simplify Dockerfile to use COPY . /app
- Remove separate S3 tagger Lambda (caused S3 notification conflicts) - Add tagging logic directly to split_pdf and pdf2html Lambdas - Convert metadata to tags at start of processing - Avoids overlapping S3 event notification rules - Maintains same user attribution functionality
- Remove FileName from all metric dimensions - Aggregate at Service/UserId level only - Reduces metric streams from per-file to per-user - Fix undefined files_metric in dashboard - Simplifies dashboard queries
- Direct metric queries with Service dimension - Sum for pages, SampleCount for files - Avoids SEARCH expression array division errors
- Metrics have Service+UserId dimensions - Use SUM() to aggregate SEARCH results into single value - Fixes empty dashboard widgets
- ECS tasks were getting AccessDenied when emitting metrics - Add cloudwatch:PutMetricData to ecs_task_role policy - Enables Adobe API call metrics from ECS containers
- Lambda function names have generated suffixes - Expose log group names from main stack - Pass actual names to dashboard - Fixes 'log group does not exist' error in per-user widgets
- Log Insights queries failed (logs are plain text, not JSON) - Use CloudWatch Metrics with UserId dimension instead - Add Adobe API calls widget - Metrics already working and more efficient - Shows per-user breakdown with legend - See docs/METRICS_STATUS.md for full analysis
- Rewrite dashboard: remove broken/duplicate widgets, keep 6 focused sections - Per-user widgets use same SEARCH as totals (without SUM wrapper) - Remove FileName from all metric dimensions (adobe, bedrock, cost) - Pass USER_ID env var from Step Function Map state to ECS tasks - Add user_id to chunk metadata so Map state can access it - Sync metrics_helper.py to docker_autotag
- Emit JSON log line {event, userId, fileName, pageCount, service} from both Lambdas
- Log Insights table: files & pages aggregated by userId
- Log Insights table: recent processing activity with details
- Replaces graph widgets with table format for per-user section
- SEARCH('{PDFAccessibility,Service,Operation}') matched ZERO metrics
because all AdobeAPICalls have 3 dims (FileName or UserId added)
- Use SEARCH('{PDFAccessibility} MetricName=...') to match any dim set
- Add AdobeDocTransactions metric per Adobe licensing model:
AutoTag = 10 doc transactions/page, ExtractPDF = ceil(pages/5)
- Pass page_count from PdfReader to track_adobe_api_call
- Add Document Transactions widget + quota info to dashboard
- pdf2html is a DockerImageFunction - Lambda layers don't work - Copy metrics_helper.py into pdf2html Docker build context - Add COPY to Dockerfile so it's included in the image - Remove /opt/python path hack (file is now in /var/task) - Add cloudwatch:PutMetricData permission to Lambda role - Always include pdf2html log group in dashboard queries
- Separate --init (first-time) from update (default) flows - --init creates secrets, BDA project, S3 bucket, ECR repo - Updates reuse existing BDA project from CloudFormation params - Always sync metrics_helper.py to Docker build contexts - pdf2html: build/push Docker + force Lambda image update - pdf2pdf: CDK handles Docker via DockerImageAsset automatically - Support --pdf2pdf, --pdf2html, --all, --profile, --region flags - No more duplicate BDA projects on every run
- 'with MetricsContext(...):' had no indented body (line 190) - All code after it was at same indent level, not inside the with - Python raised SyntaxError on import, Lambda couldn't start at all - Replace with explicit __enter__/__exit__ to avoid re-indenting 250 lines
- Profile region (us-west-2) differs from deployment region (us-east-1) - Script now checks if Pdf2HtmlStack exists and uses its region - Prevents pushing to wrong ECR region
…pace
- SEARCH('{PDFAccessibility}') matches only metrics with ZERO dimensions
- Must specify dimension names: '{PDFAccessibility,Service,Operation,UserId}'
- Verified with get-metric-data: exact dims returns data, namespace-only returns empty
The lambda/add_title/venv/ directory (558 files including pip, pymupdf, and binary .so files) was accidentally committed despite being listed in .gitignore. Remove from git tracking while preserving the .gitignore entry to prevent recurrence.
File contained a real AWS account ID and is not needed for upstream contribution. Added to .gitignore to prevent accidental re-commit.
lambda/shared/metrics_helper.py was missing page_count param in track_adobe_api_call and still included FileName dimension. Now matches the other three copies.
Bare except catches KeyboardInterrupt and SystemExit which makes debugging harder. Both clauses are in tag retrieval fallback paths added by the dev branch.
Rename directories back to match main branch naming to minimize diff for upstream PR. All path references in app.py, deploy-local.sh, .gitignore, and docs updated accordingly. Observability features preserved.
Replace 7 separate observability docs with single OBSERVABILITY.md. Restore IAM_PERMISSIONS.md, MANUAL_DEPLOYMENT.md, and CONFIGURING_LIMITS.md from main branch. Remove hardcoded WSL paths.
Rename autotag.py, alt-text.js, and myapp.py to match main branch names. Update Dockerfiles, app.py handler, and docs references accordingly.
Use main's optimized multi-stage build for the alt-text-generator container (node:22-slim, separate builder/production stages, smaller final image).
Start from main's app.py and surgically add only observability features: metrics layer, CloudWatch PutMetricData permission, USER_ID env var, S3 tagging permissions, log group exports, and UsageMetrics stack. Preserves main's VPC endpoints, zstd compression, scoped IAM policies, and naming conventions.
Start from main's .gitignore, add only the existing-stack.json exclusion.
Start from main's Dockerfile and adobe_autotag_processor.py, add only metrics imports and tracking calls to autotag and extract_api functions.
- Add interactive solution selection when no flags given - Prompt for Adobe credentials when secret missing - Create ECR repo before Docker push (was --init only) - Create BDA project automatically if none exists - Create S3 bucket and CORS for pdf2html if missing - Use pip3 to match python3 interpreter - Add CDK bootstrap before every deploy - Add retry logic for CDK deploy - Add --app flag for pdf2html CDK deploy - Add Docker push retry with ECR login refresh - Remove --init flag (resources created on demand)
- Pass user_id to Adobe API tracking calls in autotag container - Add structured JSON logs to autotag container for log query widgets - Add Bedrock metrics tracking to title-generator and alt-text-generator - Add @aws-sdk/client-cloudwatch dependency to alt-text container - Fix dashboard Bedrock widgets to query PDFAccessibility namespace - Include all log groups (JS container, pdf2html) in dashboard queries
Remove structured log from autotag container since it processes chunks not files, causing duplicate entries with _chunk_1 suffix. The splitter already emits the correct file-level event. Move pdf2html structured log outside usage_data.json dependency with pypdf fallback so it always emits even without usage data.
Separate the structured log into its own try/except block so it fires even if metrics tracking (estimate_cost etc) throws. The previous code had both in the same try block, so any exception in cost estimation would skip the dashboard log entirely.
Revert deploy.sh IAM policies to main's scoped versions. The wildcard Resource:* on all policy statements was a security regression. Main's policies already include cloudwatch:PutMetricData and PutDashboard which is all our observability features need.
Restore main's scoped Bedrock model ARNs, BDA project permissions, and log group ARN. Keep new observability additions: s3:GetObjectTagging, s3:PutObjectTagging, and cloudwatch:PutMetricData.
These files had typo regressions (remidiation, accessability) from rebasing. Our observability work does not modify these Lambdas.
Restore MODEL_ID_ALT_TEXT/MODEL_ID_LINK_ALT_TEXT constants, modifyPDF throw-on-error, success/failure counting with all-failed exit guard, progress logging, and sleep(2000). Keep observability additions: CloudWatch metrics tracking for Bedrock invocations and token usage.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds comprehensive observability to both PDF-to-PDF and PDF-to-HTML pipelines, including custom CloudWatch metrics, per-user usage attribution, a dedicated monitoring dashboard, and a local deployment script.
What's included
Observability (
metrics_helper.py,usage_metrics_stack.py)PDFAccessibilityCloudWatch namespaceMetricsContextcontext manager for automatic duration and error trackingPer-user attribution (
s3_object_tagger/, splitter + pdf2html Lambdas)UserIdfrom Cognito metadata on uploadService+UserIddimensions for per-user breakdownanonymousfor direct uploads without CognitoCloudWatch dashboard (
PDFAccessibilityUsageMetricsstack)SUM(SEARCH(...))Instrumented components
Deployment tooling
deploy-local.sh— deploy both pipelines from local repo without CodeBuild/GitHubbuildspec-unified.ymlfor S3 bucketcdk.jsonupdated to usepython3explicitlyCDK changes (
app.py)cloudwatch:PutMetricDataadded to ECS task roles3:GetObjectTagging/s3:PutObjectTaggingfor splitter LambdaUSER_IDenv var passed to ECS tasks via Step FunctionsDocumentation
docs/OBSERVABILITY.md— full metrics reference, dimensions, per-user tracking flow, cost estimation, dashboard guideWhat's NOT changed
deploy.shIAM policies