Skip to content

Fix DagProcessor crash: add missing name_is_otel_safe() guard to gauge() and timer()#68284

Open
ersalil wants to merge 3 commits into
apache:mainfrom
ersalil:fix/otel-gauge-timer-missing-name-is-otel-safe
Open

Fix DagProcessor crash: add missing name_is_otel_safe() guard to gauge() and timer()#68284
ersalil wants to merge 3 commits into
apache:mainfrom
ersalil:fix/otel-gauge-timer-missing-name-is-otel-safe

Conversation

@ersalil

@ersalil ersalil commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

SafeOtelLogger.gauge() and timer() were missing the name_is_otel_safe() guard that incr(), decr(), and timing() all have. A DAG filename containing a space (e.g. PBI_SKU_Performance copy.py) caused the DagProcessor to crash on every loop iteration when OTel metrics were enabled.

Why

The DagProcessor emits a gauge metric for every known DAG file on every parsing loop:

Stats.gauge(f"dag_processing.last_run.seconds_ago.{file_name}", seconds_ago)

gauge() only validated against the allow/block list (metrics_validator.test()), which does not check OTel naming rules. A filename with a space passed that check and reached meter.create_gauge() inside the OTel SDK, which raised a plain Exception that propagated uncaught and crashed the process:

Exception: Expected ASCII string of maximum length 63 characters but got airflow.dag_processing.last_run.seconds_ago.PBI_SKU_Performance copy

Note: the OTel SDK error message says "63 characters" but this is a stale message — the actual SDK validation regex allows up to 255 characters (matching OTEL_NAME_MAX_LENGTH). The real rejection reason is the space character, which is not a valid OTel instrument name character. See open-telemetry/opentelemetry-python#3442.

The same gap existed in timer(): it had no validation at all before being passed to _OtelTimer, which would then call record_histogram_value() with the invalid name at .stop() time.

Fix

Added metrics_validator.test(stat) and name_is_otel_safe(self.prefix, stat) to gauge() (both the stat and back_compat_name paths) and timer(), making all five recording methods consistent:

Method Guard before Guard after
incr() ✅ both checks ✅ unchanged
decr() ✅ both checks ✅ unchanged
timing() ✅ both checks ✅ unchanged
gauge() ❌ validator only ✅ both checks
timer() ❌ none ✅ both checks

Testing

  • test_gauge_with_invalid_stat_names_skipped_without_raising — covers the exact reported scenario (space in filename) and non-ASCII, both reproducing the DagProcessor crash path.
  • test_timer_with_invalid_stat_name_does_not_record — covers non-ASCII and space for the timer path.
$ pytest shared/observability/tests/observability/metrics/test_otel_logger.py \
    -k "invalid_stat_names or invalid_stat_name" -v

test_gauge_with_invalid_stat_names_skipped_without_raising[...PBI_SKU_Performance copy] PASSED
test_gauge_with_invalid_stat_names_skipped_without_raising[...mein_däg_file]            PASSED
test_timer_with_invalid_stat_name_does_not_record[...preço_task.duration]              PASSED
test_timer_with_invalid_stat_name_does_not_record[...task copy.duration]               PASSED

========================= 4 passed in 0.14s =========================

closes: #68282

Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Sonnet 4.6)

This PR was prepared with Gen-AI assistance (Claude). I reviewed all generated code.

@ersalil ersalil marked this pull request as ready for review June 9, 2026 14:50
Comment thread shared/observability/src/airflow_shared/observability/metrics/otel_logger.py Outdated
Comment thread shared/observability/src/airflow_shared/observability/metrics/otel_logger.py Outdated
@kaxil kaxil added this to the Airflow 3.3.0 milestone Jun 9, 2026
@ersalil ersalil requested a review from kaxil June 10, 2026 00:30

@kaxil kaxil left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static checks are failing:

    shared/observability/src/airflow_shared/observability/metrics/otel_logger.py:203: error: Argument 1 to "test" of "ListValidator" has incompatible type "str | None"; expected "str"  [arg-type]
                return bool(stat) and self.metrics_validator.test(stat) and name_is_otel_safe(self.prefix, stat)
                                                                  ^~~~
    shared/observability/src/airflow_shared/observability/metrics/otel_logger.py:203: error: Argument 2 to "name_is_otel_safe" has incompatible type "str | None"; expected "str"  [arg-type]
                return bool(stat) and self.metrics_validator.test(stat) and name_is_otel_safe(self.prefix, stat)
                                                                                                           ^~~~
    Found 2 errors in 1 file (checked 20 source files)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SafeOtelLogger.gauge() missing name_is_otel_safe() guard causes DagProcessor crash for DAG filenames with spaces or invalid OTel characters

2 participants