All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- LocalStack Community deployment (
deploy/localstack/) — Build script (auto-detects host arch arm64/amd64), Python deploy script using boto3 to match production Terraform resource shape, and Makefile targets for full local E2E testing. Creates DynamoDB tables with streams, EventBridge custom bus + rules, SQS alert queue, 6 Lambda functions, Step Functions state machine, IAM dummy roles, and event source mappings. Enables running real interlock Lambdas locally against LocalStack for integration testing without AWS costs. SKIP_SCHEDULERenv var —sla-monitorLambda no-ops EventBridge Scheduler calls (CreateSchedule,DeleteSchedule) whenSKIP_SCHEDULER=true. Allows deployment to LocalStack Community (Scheduler is Pro-only) while preserving production behavior. Guards applied ininternal/lambda/sla_monitor.go,internal/lambda/sla/, andinternal/lambda/watchdog/.- Dead-letter queue subsystem (
internal/dlq/) — Typed error classification (transient vs permanent), SQS routing with slog fallback when SQS is unreachable, ULID-based record IDs, and per-record metrics counter interface. Includes no-op router for testing and dry-run modes. - Stream batch handler (
internal/handler/) — Implements AWSReportBatchItemFailuresfor partial batch processing. UsesSequenceNumber(notEventID) per AWS contract. Enforces accounting invariant: processed + dlq_routed + batch_failures == total. - Lambda context middleware (
internal/aws/lambda/) — Derivescontext.WithTimeoutfrom Lambda's remaining execution time minus a configurable safety buffer (default 500ms). Floors at 50ms to prevent zero/negative timeouts. - OpenTelemetry initialization (
internal/telemetry/) — OTLP gRPC trace and metric exporters with graceful no-op fallback whenOTEL_EXPORTER_OTLP_ENDPOINTis unset. Defines 6 application metrics: records processed, stage duration, rules evaluated, DLQ routed, worker pool active, circuit breaker state. - Structured logging with correlation IDs (
internal/telemetry/) — slog handler wrapper that injectscorrelation_idfrom context into every log record. JSON output with source location. - Circuit breaker for HTTP evaluators (
internal/client/) — Wraps external HTTP calls withsony/gobreaker. Configurable trip thresholds, nil-safe defaults, and state introspection. - Exponential backoff retry (
internal/resilience/) — Context-aware retry with jitter clamping to [0, 1], propertime.NewTimercleanup (no timer leaks), and configurable max retries/delay. - Bounded worker pool (
internal/concurrency/) — Thin wrapper arounderrgroup+semaphore.Weightedfor bounded concurrent processing within Lambda executions. - CI quality gates — Makefile
audittarget usinggolangci-lint(reads.golangci.yml) andgo test -race. GitHub Actions workflow updated to usemake auditas a blocking gate. - DLQ audit tracker (
internal/audit/) — Record lifecycle tracking with RWMutex-protected state map, valid transition enforcement (PENDING→ACKED/REJECTED), duplicate detection, and reconciliation reporting for data loss detection. - Hardening config (
internal/config/) — Centralized env-var-based configuration for timeouts, worker pools, DLQ, and circuit breaker thresholds with validation at startup. - Pipeline stage decorators (
internal/pipeline/) — ComposableWithTimeoutandComposedecorators for pipeline stage wrapping. Context pre-cancellation check avoids unnecessary goroutine allocation. - Serverless health checks (
internal/handler/) — EventBridge__ping__handler with pluggableHealthCheckerinterface returning provider connectivity status. - CPU profiler (
internal/handler/) — Captures pprof CPU profiles on__profile__payloads and uploads to S3 with collision-resistant timestamped keys. - Integration and fault injection tests (
tests/integration/) — Mixed-batch stream processing, DLQ router failures, circuit breaker state transitions, retry exhaustion, and context cancellation under fault injection.
- HTTP trigger retry —
ExecuteHTTPretries transient failures (5xx, network errors) with exponential backoff. Request body resets between attempts. Permanent errors (4xx) skip retry. - Alert dispatcher circuit breaker — Slack HTTP client wrapped with gobreaker to prevent cascade during Slack outages.
- Stream router correlation IDs — Per-record correlation IDs injected into context for structured log tracing across services.
- Telemetry flush per invocation — OTel providers flush (not shutdown) per Lambda invocation to survive environment reuse across warm starts.
go.opentelemetry.io/otelv1.43.0 (traces + metrics)go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc(OTLP gRPC trace export)go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc(OTLP gRPC metric export)github.com/sony/gobreaker(circuit breaker)github.com/oklog/ulid/v2(DLQ record IDs)golang.org/x/sync(errgroup + semaphore for worker pool)
- Split
internal/lambda/into handler-aligned sub-packages — Monolithic package replaced with focused sub-packages:orchestrator/,stream/,watchdog/,sla/,alert/,sink/. - Extracted shared utilities into focused root files — Common logic moved to dedicated files: publish, date, exclusion, sensor, schedule, config, terminal.
- Trigger config registry — Replaced
buildTriggerConfigswitch statement with generic registry map (trigger_registry.go). - SLA deadline calculations wired through
pkg/sla/— Pure functions for SLA deadline resolution, decoupled from Lambda handler context.
pkg/sla/package — Pure SLA deadline calculation functions usable across packages.PipelineConfig.DeepCopy()method — Safe config cache isolation without JSON marshal/unmarshal roundtrip.EventWatchdogDegradedevent type — Watchdog health observability event for degraded-state detection.- Smoke tests for all 6
cmd/lambda/packages —ValidateEnvcoverage for every Lambda entry point.
HandleWatchdogsilent error suppression — Now returns aggregate errors viaerrors.Joininstead of silently returning nil.HandleWatchdogdegraded-state signaling — PublishesWATCHDOG_DEGRADEDevent when checks fail.- Config cache isolation — Uses typed
DeepCopy()instead of JSON marshal/unmarshal roundtrip, eliminating silent data loss on unexported fields.
- Shared HTTP client construction (DRY-2) — Extracted
resolveHTTPClient()replacing identical 7-line blocks inExecuteHTTPandExecuteAirflow. - Shared SLA schedule creation loop (DRY-3) — Extracted
createSLASchedules()replacing duplicated warning/breach schedule loops in watchdog and sla-monitor. - Split watchdog.go into focused files — 1079-line monolith split into 5 files by domain: stale triggers, missed schedules, SLA alerting, and post-run monitoring (~200 lines each).
- Encryption at rest for DynamoDB and SQS — All DynamoDB tables now explicitly enable server-side encryption. SQS queues (alert queue, alert DLQ, stream-router DLQs) now use KMS encryption. New optional
kms_key_arnvariable for custom CMK when full key control and rotation are needed; defaults to AWS-managed keys (free, no configuration required). - SSRF protection on trigger HTTP clients — Custom
http.Transportwith dial-time IP validation rejects connections to private, loopback, link-local, and multicast addresses. Protects HTTP, Airflow, and Databricks triggers against targeting internal endpoints. - EventBridge PutEvents partial failure detection —
publishEventnow checksFailedEntryCounton the response. Previously, partial failures were silently discarded. - Command trigger shell injection eliminated — Replaced
sh -cwith directexec.CommandContext+strings.Fieldsargument splitting. No shell interpretation of pipes, redirects, or variable expansion.
- Drift detection silently skipped zero values (BUG-1) —
ExtractFloatreturned 0 for both missing keys and actual zero values, causing theprevCount > 0guard to silently skip legitimate transitions like 5000→0 or 0→5000. NewExtractFloatOkdistinguishes absent from zero. SharedDetectDriftfunction consolidates 3 duplicated drift comparison sites. - RemapPerPeriodSensors map mutation during range (BUG-2) — Adding keys during
rangeiteration over a Go map is nondeterministic per the spec. Staging map now collects additions, merged after iteration. - Orphaned rerun burns retry budget (BUG-3) —
handleRerunRequestwrote the rerun record before acquiring the trigger lock. If lock acquisition failed, the rerun record was left orphaned and permanently consumed retry budget. Reordered to lock-first, then write. - Stream router discarded partial batch failures (BUG-4) —
HandleStreamEventreturned a single error, causing Lambda to retry the entire batch. Now returnsDynamoDBEventResponsewith per-recordBatchItemFailuresfor partial retry viaReportBatchItemFailures. - SLA_MET published when pipeline never ran (BUG-5) —
handleSLACancelpublished SLA_MET regardless of whether a trigger existed. Now checks for trigger existence first. - Trigger deadline used SLA timezone instead of schedule timezone (BUG-6) —
closeSensorTriggerWindowread timezone fromcfg.SLA.Timezoneinstead ofcfg.Schedule.Timezone. Falls back to SLA timezone if schedule timezone is not set. - Validation mode case-sensitive (BUG-8) —
EvaluateRulesmatched mode withswitch modeso "any" fell through to the default ALL branch. Now usesstrings.ToUpper(mode). - Epoch timestamp unit mismatch in rerun freshness (BUG-9) —
checkSensorFreshnesscompared raw epoch values without normalizing units. Timestamps below 1e12 (seconds) are now converted to milliseconds. - Post-run baseline field collision (BUG-10) — Baseline was stored as a flat map, so two rules with the same field name overwrote each other. Now namespaced by rule key. Clean break: existing flat baselines self-heal on next pipeline completion.
- publishEvent errors silently discarded in SLA reconcile (CQ-5) — Replaced
_ = publishEvent(...)with error-logged calls.
- lambda_trigger_arns default changed to [] with precondition (SEC-1) — Wildcard default removed; explicit ARN list required when triggers are enabled.
- Slack plaintext token deprecation warning (SEC-2) — Terraform
checkblock warns at plan time when plaintext token is used without Secrets Manager. - Trigger IAM policy scoping (SEC-4) — New variables
glue_job_arns,emr_cluster_arns,emr_serverless_app_arns,sfn_trigger_arns(all default[]) with preconditions requiring non-empty values when the corresponding trigger is enabled. - EventBridge bus resource policy (SEC-5) — Restricts PutEvents to Lambda execution roles only.
DRY_RUN_COMPLETEDevent — terminal event that closes the dry-run observation loop for each evaluation period. Published afterDRY_RUN_WOULD_TRIGGERandDRY_RUN_SLA_PROJECTION, carrying the SLA verdict (met,breach, orn/a) so operators can see each period resolve.
- Dry-run pipelines could start Step Function executions via rerun and job-failure paths —
handleRerunRequestandhandleJobFailuredid not checkcfg.DryRunbefore callingstartSFNWithName, allowing rerun requests and job failure retries to start real SFN executions for dry-run pipelines. Added dry-run guards in both handlers and defense-in-depth instartSFNWithNameto suppress execution unconditionally. Watchdog reconciliation loop now skips dry-run pipelines to prevent orphaned trigger locks. - Watchdog scheduled real SLA alerts for dry-run pipelines —
scheduleSLAAlerts,detectMissedSchedules,detectMissedInclusionSchedules,checkTriggerDeadlines,detectMissingPostRunSensors, anddetectRelativeSLABreachesall iterated over dry-run pipelines without checkingcfg.DryRun. This caused EventBridge Scheduler entries for SLA_WARNING/SLA_BREACH, SCHEDULE_MISSED events, and RELATIVE_SLA_BREACH alerts to fire for observation-only pipelines. Addedcfg.DryRunguard to all six watchdog functions. - Duplicate
JOB_COMPLETEDalerts for polled jobs —handleCheckJobin the orchestrator publishedJOB_COMPLETEDwhen polling detected success, but the stream-router'shandleJobSuccessalso published the same event when the JOB# record arrived via DynamoDB stream. Removed the orchestrator emission; the stream-router is now the single canonical source forJOB_COMPLETEDacross all job types.
0.9.0 - 2026-03-12
- Dry-run / shadow mode (
dryRun: true) — observation-only mode for evaluating Interlock against running pipelines without executing jobs. The stream-router evaluates trigger conditions, validation rules, and SLA projections inline, publishing all observations as EventBridge events. No Step Function executions, no job triggers, no rerun requests. New events:DRY_RUN_WOULD_TRIGGER,DRY_RUN_LATE_DATA,DRY_RUN_SLA_PROJECTION,DRY_RUN_DRIFT. DRY_RUN# markers stored with 7-day TTL for dedup and late-data detection. Post-run drift detection captures baseline at trigger time and compares when sensors update. SLA projection reuses productionhandleSLACalculatefor consistent deadline resolution. Requiresschedule.triggerandjob.type. Calendar exclusions honored. DryRunSKkey helper for DynamoDB DRY_RUN# sort keys.WriteDryRunMarker/GetDryRunMarkerstore methods with conditional write (idempotent dedup) and consistent read.- Config validation for dry-run: requires both
job.typeandschedule.trigger.
0.8.0 - 2026-03-10
- Inclusion calendar scheduling (
schedule.include.dates) — explicit YYYY-MM-DD date lists for pipelines that run on known irregular dates (monthly close, quarterly filing, specific business dates). Mutually exclusive with cron. Watchdog detects missed inclusion dates and publishesIRREGULAR_SCHEDULE_MISSEDevents. - Relative SLA (
sla.maxDuration) — duration-based SLA for ad-hoc pipelines with no predictable schedule. Clock starts at first sensor arrival and covers the entire lifecycle: evaluation → trigger → job → post-run → completion. Warning at 75% of maxDuration (orbreachAt - expectedDurationwhen set). New events:RELATIVE_SLA_WARNING,RELATIVE_SLA_BREACH. - First-sensor-arrival tracking — stream-router records
first-sensor-arrival#<date>on lock acquisition (idempotent conditional write). Used as T=0 for relative SLA calculation. - Watchdog defense-in-depth for relative SLA —
detectRelativeSLABreachesscans pipelines withmaxDurationconfig and firesRELATIVE_SLA_BREACHif the EventBridge scheduler failed to fire. WriteSensorIfAbsentstore method — conditional PutItem that only writes if the key doesn't exist, used for first-sensor-arrival idempotency.- Config validation for new fields: cron/include mutual exclusion, inclusion date format (YYYY-MM-DD), maxDuration format and 24h cap, maxDuration requires trigger.
- Glue false-success detection —
verifyGlueRCAnow checks both the RCA insight stream (Check 1) and the driver output stream for ERROR/FATAL log4j severity markers (Check 2). Catches Spark failures that Glue reports as SUCCEEDED when the application framework swallows exceptions.
SLAConfig.DeadlineandSLAConfig.ExpectedDurationare nowomitempty— relative SLA configs may omit the wall-clock deadline entirely.- SFN ASL passes
maxDurationandsensorArrivalAttoCancelSLASchedulesandCancelSLAOnCompleteTriggerFailurestates. - sla-monitor
handleSLACalculateroutes to relative path whenMaxDuration+SensorArrivalAtare present.
0.7.4 - 2026-03-10
- False SLA warnings/breaches for sensor-triggered daily pipelines —
scheduleSLAAlertsresolved the SLA deadline against today's date, but sensor-triggered daily pipelines run T+1 (data for today completes tomorrow). Between 00:00 UTC and the deadline hour, the breach time landed on the same day instead of the next day, causing premature SLA alerts. The SLA calculation now shifts the deadline date by +1 day for sensor-triggered daily pipelines.
0.7.3 - 2026-03-10
- Configurable drift detection field (
PostRunConfig.DriftField) — specifies which sensor field to compare for post-run drift detection. Defaults to"sensor_count"for backward compatibility. Fixes broken drift detection when sensors use a different field name (e.g.,"count").
- Post-run drift detection never fired when sensor field was
count— drift comparison was hardcoded to"sensor_count"but bronze consumers write"count"in hourly-status sensors, causingExtractFloatto return 0 for both baseline and current values. TheprevCount > 0 && currCount > 0guard silently suppressed all drift detection. - Two flaky time-sensitive tests —
TestSLAMonitor_Reconcile_PastWarningFutureBreachandTestWatchdog_MissedSchedule_DetailFieldsused real wall-clock time instead of injectedNowFunc, causing failures depending on time of day.
0.7.2 - 2026-03-08
- Configurable sensor trigger deadline (
trigger.deadline) — closes auto-trigger window after expiry, publishesSENSOR_DEADLINE_EXPIRED. - TOCTOU-safe
CreateTriggerIfAbsentstore method using DynamoDB conditional writes. - CloudWatch alarms: Per-function Lambda error alarms, Step Functions failure alarm, DLQ depth alarms (control, joblog, alert queues), and DynamoDB Stream iterator age alarms. All alarm actions conditionally route to an SNS topic via
sns_alarm_topic_arn. - EventBridge input transformers for alarm routing: CloudWatch alarm state changes are reshaped into
INFRA_ALARMInterlockEvent format and routed to both event-sink and alert-dispatcher — zero Go code changes required. - Lambda concurrency limits: Per-function reserved concurrent executions via
lambda_concurrencyobject variable (defaults: stream-router=10, orchestrator=10, sla-monitor=5, watchdog=2, event-sink=5, alert-dispatcher=3). - Secrets Manager Slack token:
slack_secret_arnvariable enables alert-dispatcher to read the Slack bot token from Secrets Manager instead of an environment variable. Falls back toSLACK_BOT_TOKENenv var if not configured. - Lambda trigger IAM scoping:
enable_lambda_triggerandlambda_trigger_arnsvariables grant orchestratorlambda:InvokeFunctionpermission scoped to specific function ARNs.
- Sensor-triggered pipelines now receive proactive SLA scheduling (removed cron-only guard).
- Trigger deadline check extracted into independent
checkTriggerDeadlineswatchdog scan. - Env var expansion restricted to
INTERLOCK_prefix:os.ExpandEnvin trigger config (Airflow, Databricks, Lambda) now only expands variables prefixed withINTERLOCK_, preventing unintended system variable substitution. time.Now()→d.now()across all handlers: All Lambda handlers use dependency-injected time for consistent testability.- Config cache deep copy via JSON round-trip:
GetAll()returns a deep copy preventing callers from mutating shared cache state. - Single-instant rule evaluation: All validation rules within an evaluation cycle use the same timestamp for temporal consistency.
- Trigger lock release on SFN start failure: Both rerun and job-failure retry paths release the trigger lock if
StartExecutionfails, preventing permanently stuck pipelines (previously caused 4.5h deadlock). scheduleSLAAlertsskip-on-error: SLA alert scheduling now correctly skips on error instead of falling through to the next handler.- 9 silent audit write error discards → WARN logging: All
publishEventcall sites across stream-router and orchestrator now log errors at WARN level instead of silently discarding them. - Missing
EVENTS_TABLE/EVENTS_TTL_DAYSenvcheck for alert-dispatcher: Startup validation now checks for required environment variables.
0.7.1 - 2026-03-08
- Glue RCA false-positive failure classification (Check 2 removed): The
verifyGlueRCACheck 2 filter pattern (?Exception ?Error ?FATAL ...) matched benign JVM startup output in Glue's stderr (/aws-glue/jobs/error), causing every SUCCEEDED Glue job to be reclassified as FAILED. Classpath entries like-XX:OnOutOfMemoryErrorand Glue's internalAnalyzerLogHelpermessages contain "Error" as substrings, producing a 100% false positive rate. Removed Check 2 entirely — Check 1 (GlueExceptionAnalysisJobFailed in the RCA log stream) is Glue's purpose-built mechanism for detecting false successes and is sufficient. Post-run validation provides the application-level safety net for data quality issues.
0.7.0 - 2026-03-08
- Event-driven post-run monitoring: Post-run drift detection moves from SFN poll loop to DynamoDB Stream processing. When a sensor arrives after job completion, stream-router compares the current sensor value against a date-scoped baseline captured at trigger completion. Drift above a configurable threshold triggers a rerun via the existing circuit breaker. Sensors arriving while a job is still running produce informational drift events only.
- Configurable drift threshold:
PostRunConfig.DriftThreshold(*float64) sets the minimum sensor count change to trigger a drift rerun. Defaults to 0 (any change). - Watchdog post-run sensor absence detection: Watchdog detects pipelines that completed without receiving post-run sensor data after a configurable
SensorTimeoutgrace period. PublishesPOST_RUN_SENSOR_MISSINGevent. - Typed trigger errors:
TriggerErrorwithCategory(PERMANENT/TRANSIENT) andUnwrap()support.ClassifyFailureuseserrors.Asfor structured error handling. HTTP triggers return 4xx as PERMANENT, 5xx as TRANSIENT. - ConfigCache deep copy:
GetAll()returns a deep copy of cached configs, preventing callers from mutating shared cache state. - Validation engine string parsing:
toFloat64now handles string-typed numeric fields viastrconv.ParseFloat, enablinggte/lterules on sensor data stored as strings. - E2E test coverage: 8 new end-to-end test groups covering rerun budget separation, post-run inflight drift, calendar exclusion skip, hour boundary rollover, concurrent drift dedup, pre-baseline sensor arrival, rerun after TTL expiry, and SLA hourly deadlines.
- Post-run removed from SFN: 6 ASL states removed (
CheckHasPostRun,InitPostRunLoop,PostRunEvaluate,IsPostRunDone,WaitForPostRun,IncrementPostRunElapsed). Job success routes directly toCompleteTrigger. State count: 30 → 24. - Post-run removed from orchestrator:
handlePostRunfunction and"post-run"dispatch case removed. Post-run logic is now entirely stream-based. - Watchdog uses dependency-injected time:
WatchdogNowFuncpackage variable replaced withDeps.NowFuncfor consistent testability across all handlers. - Trigger runner context threading: All AWS SDK client constructors (
getGlueClient,getEMRClient, etc.) acceptcontext.Contextand pass it through toaws.LoadDefaultConfig. - SFN execution name truncation: Execution names are truncated to 80 characters (AWS limit) at all 3 construction sites.
- Environment variable scoping:
os.ExpandEnvin trigger config restricted toINTERLOCK_prefixed variables only.
- Calendar exclusion uses execution date:
isExcludedDatechecks the job's execution date (nottime.Now()), preventing incorrect exclusions on re-runs for previous days. Supports bothYYYY-MM-DDand compositeYYYY-MM-DDTHHdate formats. - Atomic lock reset:
ResetTriggerLockuses single DynamoDBUpdateItemwithattribute_exists(PK)condition, eliminating the race window between delete and create. - Lock release on SFN start failure: Both rerun and job-failure retry paths release the trigger lock if
StartExecutionfails, preventing permanently stuck pipelines. - Terminal trigger status on calendar exclusion:
handleJobFailuresetsFAILED_FINALinstead of leaving the lock inRUNNINGstate to silently expire via TTL. - ASL CompleteTrigger failure path: New
CheckSLAForCompleteTriggerFailure→CancelSLAOnCompleteTriggerFailure→CompleteTriggerFailedstates ensure SLA schedules are cancelled before entering terminal Fail state. - Event ordering:
RERUN_ACCEPTEDonly publishes afterResetTriggerLockconfirms lock atomicity. - New events:
BASELINE_CAPTURE_FAILED(baseline capture error),PIPELINE_EXCLUDED(calendar exclusion in sensor, rerun, job-failure, and post-run drift paths),RERUN_ACCEPTED(audit trail for accepted reruns). - publishEvent error logging: All 17
publishEventcall sites across stream-router and orchestrator now log errors at WARN level instead of silently discarding them. - SLA monitor error wrapping:
createOneTimeSchedulewraps errors with schedule name context viafmt.Errorf. - HTTP response body sanitization: Error response bodies truncated to 512 bytes with control characters stripped.
- DynamoDB table protection: All 4 tables (control, joblog, events, rerun) now have
deletion_protection_enabledand point-in-time recovery enabled.
0.6.2 - 2026-03-08
- Lambda trigger type: New
TriggerLambdatype for direct AWS Lambda SDK invocation (RequestResponse). IncludesLambdaTriggerConfigwithfunctionNameand optionalpayload(supports env-var expansion). Non-pollingCheckStatus. Useful when the orchestrator can invoke Lambda directly instead of going through function URLs. - Non-polling trigger synchronous completion:
handleTriggernow writes a success joblog entry immediately for non-polling triggers (http, command, lambda) and returns a sentinelrunIdso the Step FunctionsCheckJobJSONPath resolves. Previously, non-polling triggers that succeeded would crash the SFN because$.triggerResult.runIdwas omitted viaomitempty.
0.6.1 - 2026-03-08
- Glue RCA false-positive failure classification:
verifyGlueRCACheck 2 queried/aws-glue/jobs/errorwithout aFilterPattern, causing benign stderr messages (e.g.,"Preparing ...") to be misclassified as failures. AddedFilterPatternwith error indicators (?Exception ?Error ?FATAL ?Traceback ?OutOfMemoryError ?StackOverflowError) so only genuine errors trigger failure classification. This prevented unnecessary re-runs that doubled Glue compute. - Watchdog date-boundary test flake:
TerminalTriggerRetainsRecordE2E test used sensor data without adatefield, causingResolveExecutionDateto fall back totime.Now(). When the calendar date advanced past the hardcoded trigger date, reconciliation resolved to a different date and fired a spuriousTRIGGER_RECOVERED. Sensor data now includes an explicit date.
0.6.0 - 2026-03-07
- SFN global timeout: Step Function definition includes
TimeoutSeconds(default 4h, configurable viasfn_timeout_secondsTerraform variable). Prevents unbounded execution if the orchestrator loop stalls. - Configurable trigger retry count: Trigger state
MaxAttemptsis now driven bytrigger_max_attemptsTerraform variable (default 3, previously hardcoded 4). Reduces retry budget from 4 to 3 attempts. - Trigger terminal status lifecycle: new
CompleteTriggerASL state sets trigger row toCOMPLETEDon job success andFAILED_FINALon fail/timeout. New orchestratorcomplete-triggermode withEventinput field. PreviouslyTriggerStatusCompletedwas defined but never written, leaving all triggers stuck atRUNNINGafter SFN completion. - Bounded job poll window: 5 new ASL states implement a time-bounded poll loop. Configurable via
jobPollWindowSecondsin pipeline config (default: 3600s / 1h).handleJobPollExhaustedwrites a timeout joblog entry, sets trigger toFAILED_FINAL, and publishesJOB_POLL_EXHAUSTEDevent. Prevents unboundedcheck-jobpolling when external jobs hang. - Per-source rerun limits: new
maxDriftReruns,maxManualReruns, andmaxCodeRetriesfields onJobConfigwith*intpointer semantics (nil = default, 0 = disabled).CountRerunsBySourcefilters rerun records by reason prefix, so drift/manual reruns no longer consume the job-failure retry budget. - Pipeline config validation:
ValidatePipelineConfigenforces bounds on all retry/rerun fields. Stream-router usesgetValidatedConfigwrapper at all 3 config-read callsites — invalid configs are logged and skipped (fail-open). - Failure classification:
FailureCategorypropagated fromStatusCheckerthroughhandleCheckJobto joblog via variadicWithFailureCategoryoption.handleJobFailurereads the latest joblog category —PERMANENTfailures usemaxCodeRetriesbudget (default 1),TRANSIENT/empty usemaxRetries. Separates infrastructure flakes from code bugs. - Dynamic trigger lock TTL:
ResolveTriggerLockTTL()readsSFN_TIMEOUT_SECONDSenv var and adds a 30-minute buffer (default 4h30m). All 4AcquireTriggerLockcallsites use it. Terraform wires the variable to stream-router and watchdog Lambda environments.
- Dead
FailureEvaluatorCrashconstant (defined but never referenced)
- Glue false-success detection:
checkGlueStatusnow cross-checks the CloudWatch RCA log stream when Glue reportsSUCCEEDED. Glue can returnSUCCEEDEDviaGetJobRunwhen the Spark driver catches aSparkExceptionand exits cleanly (exit code 0). The RCA log stream ({runId}-job-insights-rca-driver) recordsGlueExceptionAnalysisJobFailedevents with the actual failure reason. When detected, the job is reported as failed with the real error (e.g., "No space left on device"). Degrades gracefully if CloudWatch Logs is unavailable. - SLA alert suppression for completed pipelines:
handleSLAFireAlertchecks trigger status before publishing — suppresses warnings/breaches when the pipeline already completed or permanently failed. WatchdogscheduleSLAAlertsskips schedule creation for finished pipelines, preventing ghost schedules afterCancelSLASchedulescleanup. - Joblog fallback in SLA guards:
handleSLAFireAlertandscheduleSLAAlertscheck the joblog for terminal events (success/fail/timeout) when the trigger row is nil orRUNNING. Covers cron pipelines (no trigger rows), TTL-expired triggers, and the race window beforeCompleteTriggerruns. - Watchdog forward-only alerting:
detectMissedSchedulesnow skips cron schedules whose most recent expected fire time is before the Lambda's cold start. Prevents retroactiveSCHEDULE_MISSEDalerts after fresh deploys or redeployments. Uses alastCronFirehelper to compute expected fire times from hourly (MM * * * *) and daily (MM HH * * *) cron patterns. - Watchdog reconcile mass-triggering:
reconcileSensorTriggerschecked onlyHasTriggerForDatebefore re-triggering. Trigger records have 24h DynamoDB TTL — after expiry, reconcile saw "satisfied sensor + no trigger" and re-triggered completed pipelines. Now checks joblog for terminal events (success/fail/timeout) before acquiring a new lock. Additionally,SetTriggerStatusremoves TTL on terminal statuses (COMPLETED/FAILED_FINAL) so trigger records persist indefinitely.
0.5.2 - 2026-03-05
- Hourly SLA deadline offset: relative deadline
:MMfor hourly pipelines (date format2026-03-05T12) now resolves to H+1:MM instead of H:MM. Data for hour H isn't generated until ~H+1:00, so the previous calculation set the breach deadline before data existed — guaranteeing a false breach every execution. Daily pipelines are unchanged.
0.5.1 - 2026-03-05
- Proactive SLA monitoring: watchdog Lambda creates EventBridge Scheduler entries for all pipelines with SLA configs, ensuring warnings and breaches fire even when pipelines never trigger (data never arrives, sensor fails, trigger missed). Idempotent via deterministic scheduler names —
ConflictExceptionmeans the schedule already exists and is skipped. (#45)
- SLA scheduling moved from SFN to watchdog: removed
CheckSLAConfigandScheduleSLAAlertsstates from the Step Function (18 → 16 states). The SFN retains onlyCancelSLASchedulesto clean up unfired timers on job completion. (#45) - CancelSLASchedules accepts deadline/expectedDuration: instead of pre-computed
warningAt/breachAt, the cancel handler now receives raw SLA config and recalculates internally. This decouples cancel from the removed scheduling state. (#45) - Watchdog Lambda now requires
SLA_MONITOR_ARN,SCHEDULER_ROLE_ARN, andSCHEDULER_GROUP_NAMEenvironment variables andscheduler:CreateSchedule+iam:PassRoleIAM permissions. (#45)
0.5.0 - 2026-03-04
- Centralized observability pipeline: new
event-sinkLambda writes all 14 EventBridge event types to a DynamoDB events table with a GSI for querying by event type and timestamp. Newalert-dispatcherLambda reads from an SQS alert queue and delivers formatted Slack notifications. EventBridge rules route events to both targets automatically. (#41) - Slack Bot API with message threading: alert-dispatcher uses
chat.postMessagewith Bot token authentication. Thread records (THREAD#{scheduleId}#{date}) stored in the events table group related alerts into Slack threads by pipeline, schedule, and date. First alert for a pipeline-day creates a new thread; subsequent alerts reply in-thread. (#41) - SLA warning suppression:
fire-alertmode checksBreachAttimestamp before publishingSLA_WARNING. If breach time has already passed, the warning is suppressed to prevent duplicate warning+breach notifications.handleSLASchedulenow includesBreachAtin warning payloads.handleSLAReconcilefires breach only (not both warning and breach) when past breach time. (#41) - Manual rerun system: external processes write
RERUN_REQUEST#records to the control table. Stream-router validates requests with a circuit breaker that compares sensorupdatedAtvs joblogcompletedAt— rejects reruns when no new data has arrived. (#40) - Late data arrival detection: stream-router detects sensor updates after job completion and publishes
LATE_DATA_ARRIVALevents to EventBridge. (#40) - Prefix-match sensor keys: stream-router matches sensor keys by prefix for per-period pipelines, enabling a single trigger condition to match sensors keyed with date+hour suffixes. (#38)
- Watchdog trigger reconciliation: watchdog re-evaluates sensor trigger conditions every 5 minutes. If a sensor meets the trigger threshold but no trigger lock exists (due to a silent completion-write failure), the watchdog acquires the lock, starts the Step Function, and publishes a
TRIGGER_RECOVEREDevent. (#44) - New event types:
LATE_DATA_ARRIVAL,RERUN_REJECTED,RETRY_EXHAUSTED,TRIGGER_RECOVERED - New DynamoDB events table with GSI1 (
eventType→timestamp) for centralized event logging - SQS alert queue with dead-letter queue for reliable Slack delivery
- alert-dispatcher now requires
SLACK_BOT_TOKENandSLACK_CHANNEL_IDenvironment variables (replacesSLACK_WEBHOOK_URL) - Terraform module exposes
slack_bot_token(sensitive) andslack_channel_idvariables instead ofslack_webhook_url - Lambda function count: 4 → 6 (added event-sink and alert-dispatcher)
- Orchestrator remaps per-period sensor keys during evaluation, preventing key mismatch when sensors use date+hour suffixes (#39)
ValidationExhaustednow ends the Step Functions execution asFAILEDinstead ofSUCCEEDED(#41)- Joblog entry written on validation exhaustion for audit trail completeness (#41)
- Watchdog prefix-matches trigger records for per-hour pipelines instead of requiring exact key match (#41)
- SLA deadline calculation for daily pipelines with next-day deadlines (e.g.,
"02:00") now rolls forward 24 hours when the computed breach time is already past (#43)
0.4.0 - 2026-03-03
- SLA monitoring via EventBridge Scheduler: one-time Scheduler entries fire
SLA_WARNINGandSLA_BREACHevents at exact timestamps, replacing the previous parallel-branch polling approach. Schedules auto-delete after firing. On job completion, unfired schedules are cancelled andSLA_METis published. - Sub-daily execution granularity: pipelines can run at hourly or daily cadence depending on sensor data. When sensors include both
dateandhourfields, the framework uses a composite execution date (2026-03-03T10). Glue triggers receive--par_dayand--par_hourarguments automatically. - Infrastructure trigger retry: trigger execution failures (e.g., Glue
ConcurrentRunsExceededException) retry 4 times with exponential backoff (30s, 60s, 120s, 240s) via Step Functions native Retry. Each failure is logged to the joblog table for audit. This retry budget is separate frommaxRetriesfor job failures. - StatusChecker fallback in check-job: when no terminal joblog entry exists, the orchestrator polls the trigger API directly to determine job status.
- Declarative validation rules replace the archetype/trait/evaluator system. Pipeline configs define validation as YAML rules (
exists,equals,gt,gte,lt,lte,age_lt,age_gt) — no custom evaluator code needed. - 3 DynamoDB tables (control, joblog, rerun) replace the single-table design for clearer access patterns and independent scaling.
- 4 Lambda functions (stream-router, orchestrator, sla-monitor, watchdog) replace the previous 7+ handlers. The orchestrator is a multi-mode handler covering evaluate, trigger, check-job, and post-run.
- 18-state sequential Step Functions workflow replaces the 47-state machine. SLA monitoring uses EventBridge Scheduler instead of a parallel branch.
- EventBridge events replace SNS for all alerting and lifecycle notifications.
- Reusable Terraform module — consumers deploy infrastructure without framework code in their repo.
- Framework reads DynamoDB only — external processes push sensor data into the control table.
- sla-monitor Lambda supports 5 modes:
schedule,cancel,fire-alert,calculate,reconcile. - Trigger state retries infrastructure failures independently of job failure retries (
maxRetries). Exhausted trigger retries route to SLA cleanup and graceful termination instead of crashing.
- Redis and Postgres storage providers (AWS-first; GCP and Azure planned after AWS stabilizes)
- CLI binary (
cmd/interlock) and HTTP server - Archetype, trait, and evaluator subprocess system
- Local mode (Docker Compose + Redis)
- Cascade notifications, post-completion drift monitoring, replay support
- SNS alert sinks, S3 alert sinks
- cobra, chi, pgx, go-redis, color dependencies
- Pipeline config included in Step Functions execution input, eliminating redundant DynamoDB reads during orchestrator modes (#30)
- YAML configs converted to JSON before DynamoDB storage (#27)
- Missing trigger enable variables added to Terraform module (
enable_emr_serverless_trigger,enable_sfn_trigger) (#26) - Sensor data
datamap unwrapped in stream-router before trigger condition evaluation (#28) - Orchestrator output uses
statusstring instead ofpassedboolean (#29) - SLA monitor handles relative
:MMdeadline format for hourly pipelines (#35) check-jobskips non-terminal joblog events (e.g.,infra-trigger-failure) to prevent infinite polling loops (#36)
0.3.1 - 2026-02-28
- stream-router date accuracy: MARKER record processing now reads the
datefield from the DynamoDB NewImage when present, falling back totime.Now().UTC()for backward compatibility. This ensures correct date resolution at midnight rollover — e.g., an h23 completion marker written at 00:01 carries the previous day's date rather than inheriting today from the wall clock.
0.3.0 - 2026-02-27
- Post-Completion Monitoring Expiry: watchdog handles
COMPLETED_MONITORING → COMPLETEDtransitionsCheckCompletedMonitoring()pure function scans for runs inCOMPLETED_MONITORINGwhose monitoring window has elapsed- Uses
RunLogEntry.UpdatedAtas the monitoring start time — no schema changes required - Transitions expired runs to
COMPLETEDviaPutRunLog, appendsEventMonitoringCompletedaudit event - Monitoring window duration read from
Watch.Monitoring.Durationper-pipeline config - Wired into both local-mode
Watchdog.scan()and Lambda handler - Offloads the monitoring wait from Step Function executions to the watchdog's 5-minute scan, freeing SFN capacity immediately after the Glue job succeeds
MonitoringResultexported type for callers to inspect results- 3 unit tests: expired → transitions, still-in-window → skips, already-completed → ignores
0.2.1 - 2026-02-26
- Evaluation retry loop: Step Function now retries trait evaluation every 60s when traits aren't ready, terminating via the existing validation timeout. Previously in AWS event-driven mode, the SFN wrote a PENDING RUNLOG and exited immediately — the RUNLOG stayed PENDING indefinitely unless a new MARKER arrived. The local watcher loop was unaffected.
- ResolvePipeline error logging: Step Function now writes a FAILED RUNLOG when
ResolvePipelinereturns an error (missing archetype, pipeline not found, archetype resolution failure). Previously these configuration errors caused a silent exit with no audit trail. logResultattempt counter:attemptNumbernow only increments when retrying after a terminal state (COMPLETED/FAILED/CANCELLED). Transitioning within the same attempt (e.g., PENDING → COMPLETED) preserves the counter.- Validation timeout test: Fixed midnight boundary edge case in
TestCheckValidationTimeout_NotBreachedwherenow + 1hwrapping past midnight caused a false positive.
- Step Function state count: 49 → 52 (added
LogResolveFailed,CheckIfFirstEvalAttempt,WaitForReadiness)
0.2.0 - 2026-02-25
- Lifecycle Events: stream-router publishes SNS events on pipeline terminal status changes
PIPELINE_COMPLETEDandPIPELINE_FAILEDevents emitted to a dedicated lifecycle SNS topic- Configurable via
LIFECYCLE_TOPIC_ARNenvironment variable on stream-router Lambda - Best-effort publishing — errors are logged, not propagated
LifecycleEventsPublishedexpvar counter for observabilityLifecycleEventstruct ininternal/lambda/types.go
- Stuck-Run Detection: watchdog detects runs stuck in non-terminal states
CheckStuckRuns()pure function scans for runs in PENDING/TRIGGERING/RUNNING beyond a configurable threshold- Default threshold: 30 minutes, configurable via
WatchdogConfig.StuckRunThreshold - Dedup via distributed lock (
watchdog:stuck:{pipeline}:{schedule}:{date}, 24h TTL) EventRunStuckevent kind for audit trailRUN_STUCKalert with Category"stuck_run"StuckRunsDetectedexpvar counter- 265 new lines of tests in
watchdog_test.go
- Alert Categorization: machine-readable
Categoryfield on all alertsAlert.Categoryfield (JSON:alertType) set by all alert producers- Values:
schedule_missed,stuck_run,evaluation_sla_breach,completion_sla_breach,validation_timeout,trait_drift - Enables downstream consumers (alert-logger, dashboards) to filter and route by category without parsing message strings
- New
EventKindvalues:PIPELINE_COMPLETED,PIPELINE_FAILED,RUN_STUCK
0.1.1 - 2026-02-25
- SLA Watchdog: Framework-level absence detection for silently missed pipelines
internal/watchdog/package withCheckMissedSchedules()pure function andWatchdogpolling struct- Detects when upstream ingestion fails silently and no evaluation starts by the configured deadline
- Deadline resolution: schedule-level
Deadline>SLA.EvaluationDeadline - Dedup via distributed lock (
watchdog:{pipeline}:{schedule}:{date}, 24h TTL) - Respects calendar exclusions and per-pipeline watch enable/disable
EventScheduleMissedevent kind for audit trailWatchdogConfigwith enable/interval onProjectConfigSchedulesMissedexpvar counter- 11 unit tests covering all edge cases
- AWS Lambda handler for watchdog (
cmd/lambda/watchdog/), invoked by EventBridge - Terraform resources: watchdog Lambda, EventBridge rule, IAM role (DynamoDB RW + SNS publish)
0.1.0 - 2026-02-23
Initial release of the Interlock STAMP-based safety framework for data pipeline reliability.
- STAMP safety model with archetypes, traits, readiness rules, and run state machine
- Engine for trait evaluation and readiness checking (
internal/engine/) - Subprocess and HTTP evaluator runners with JSON protocol (
internal/evaluator/) - Reactive watcher loop with poll-based evaluation lifecycle (
internal/watcher/) - Structured logging, request ID middleware, and HTTP API server
- Redis provider with Lua CAS, Streams, TTL retention (
internal/provider/redis/) - DynamoDB provider with single-table design, conditional writes, GSI (
internal/provider/dynamodb/) - Postgres archival store with cursor-based incremental sync (
internal/provider/postgres/) - Shared provider conformance test suite — 15 contract tests both backends pass
- Per-pipeline multi-schedule support with independent locks, run logs, and SLA windows
- Calendar-based exclusions (named YAML calendars + inline days/dates)
- Timezone-aware schedule activation and deadline resolution
- 8 trigger types: HTTP, command, Airflow, Glue, EMR, EMR Serverless, Step Functions, Databricks
- Runner dispatch with injectable AWS SDK clients and lazy initialization
- Generic
CheckStatuspolling for all AWS and Databricks trigger types
- Retry with configurable exponential backoff and failure classification
- Cascade notifications to downstream pipelines
- Post-completion drift monitoring with configurable duration
- Replay support for re-running completed pipelines
- SLA tracking with evaluation and completion deadlines
- 5 Lambda handlers: stream-router, evaluator, orchestrator, trigger, run-checker
- Step Function state machine (47 states) orchestrating full pipeline lifecycle
- DynamoDB Streams integration for event-driven evaluation
- SNS and S3 alert sinks for serverless alerting
- Shared Lambda init with environment-based configuration
- Terraform modules for AWS deployment (DynamoDB, Lambda, Step Functions, SNS, IAM, EventBridge)
- S3 backend bootstrap for Terraform state
- Docker Compose local environment with Redis and Postgres
- Lambda build script for cross-compilation
- Hugo documentation site with Hextra theme (18 content pages)
- GitHub Pages deployment workflow
- Architecture, configuration, deployment, and API reference docs
- Go CI workflow (lint, format, test)
- PR labeler for automatic path-based labels
- GitHub Pages docs deployment on push to main
- Unit tests for evaluator, orchestrator, run-checker, archiver, Lambda init, commands, schedule
- Local E2E test suite (6 scenarios with sensor-backed evaluators)
- AWS E2E test suite (6 scenarios with DynamoDB and Step Functions)
Released under the Elastic License 2.0.