Skip to content

[v3-2-test] Fix Callback.handle_event crash on OTel metrics with dict tag values (#67527)#67528

Closed
vatsrahul1001 wants to merge 1 commit into
v3-2-testfrom
backport-67527-to-v3-2-test
Closed

[v3-2-test] Fix Callback.handle_event crash on OTel metrics with dict tag values (#67527)#67528
vatsrahul1001 wants to merge 1 commit into
v3-2-testfrom
backport-67527-to-v3-2-test

Conversation

@vatsrahul1001

Copy link
Copy Markdown
Contributor

Manual backport of #67527 to `v3-2-test` for 3.2.3.

The auto-backport bot can't cherry-pick this one — `main` and `v3-2-test` diverged in the imports block of `airflow-core/src/airflow/models/callback.py` (`main` has `from dataclasses import dataclass`, `v3-2-test` still has `import inspect; from collections.abc import Callable`), so the cherry-pick conflicts. Conflict resolved by keeping the v3-2-test imports and adding the new `import json` the fix needs. The `Stats.incr` call site (v3-2-test) vs `stats.incr` (main) needed no change — `handle_event` carries the v3-2-test form unmodified.

Why this needs backporting

`Callback.handle_event` crashes the triggerer when OpenTelemetry metrics are enabled and a deadline async callback completes with a dict-typed `result` (or dict-typed `kwargs`). OTel builds its aggregation key as `frozenset(attributes.items())`, which raises `TypeError: unhashable type: 'dict'`. Statsd accepts non-primitive tag values silently, so OSS deployments on default metrics don't see it — Astronomer Astro and any deployment with `[metrics] otel_*` configured hit it consistently. Reported by the Astro Runtime team while exercising 3.2.2rc2-based images.

What's in the cherry-pick

  • `Callback.get_metric_info` coerces non-primitive tag values (dict/list/set) to JSON strings before returning. Primitives (`str | int | float | bool | None`) pass through unchanged. `json.dumps(..., default=str, sort_keys=True)` so `datetime` values fall back cleanly and equivalent dicts in different insertion order don't split metric cardinality.
  • New regression test `test_get_metric_info_dict_values_are_stringified` asserts every tag value is a primitive and that `frozenset(tags.items())` does not raise.
  • Updates the existing `test_get_metric_info` to expect the JSON-stringified `kwargs` shape.

Review feedback already addressed in #67527

  • `sort_keys=True` per @uranusjr's suggestion
  • Cross-ref comment removed from the existing test per @amoghrajesh's suggestion

The triggerer crashes on the next deadline async callback when OpenTelemetry
metrics are enabled:

    File ".../airflow/jobs/triggerer_job_runner.py", line 659, in handle_events
        Trigger.submit_event(...)
    File ".../airflow/models/callback.py", line 234, in handle_event
        Stats.incr(**self.get_metric_info(status, self.output))
    File ".../airflow/_shared/observability/metrics/otel_logger.py", line 211, in incr
        counter.add(count, attributes=tags)
    File ".../opentelemetry/sdk/metrics/.../view_instrument_match.py", line 105
        aggr_key = frozenset(attributes.items())
    TypeError: unhashable type: 'dict'

`Callback.get_metric_info` builds the metric tags dict directly from the
callback's `result` and `self.data` (which includes `kwargs`). Both are
frequently dicts — for deadline async callbacks the `result` is the user
callback's return value, and `kwargs` is the captured callback kwargs. When
the metrics backend is OTel, the SDK builds the aggregation key as
`frozenset(attributes.items())`, which raises if any value is unhashable
(dict, list, set). The result is a triggerer crash and stalled triggers.

The bug is metrics-backend-dependent: statsd accepts non-primitive tag values
without complaint, so OSS users running default statsd never see it. OTel
backends (used in production by Astronomer Astro Cloud and any OSS deployment
that enables `[metrics] otel_*`) hit it consistently.

Reproduces against 3.2.1 and main; not a 3.2.x regression.

Sanitize tag values to primitives before returning from `get_metric_info`:
keep `str | int | float | bool | None` as-is, JSON-stringify anything else.
Using `default=str` in `json.dumps` so values like `datetime` fall back
cleanly instead of raising.

Adds a regression test that asserts every tag value is hashable and that
`frozenset(tags.items())` does not raise.

Reported by Astronomer Runtime team while testing 3.2.2rc2-based images.

(cherry picked from commit 9a34e73)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant