[v3-2-test] Fix Callback.handle_event crash on OTel metrics with dict tag values (#67527)#67528
Closed
vatsrahul1001 wants to merge 1 commit into
Closed
[v3-2-test] Fix Callback.handle_event crash on OTel metrics with dict tag values (#67527)#67528vatsrahul1001 wants to merge 1 commit into
vatsrahul1001 wants to merge 1 commit into
Conversation
The triggerer crashes on the next deadline async callback when OpenTelemetry
metrics are enabled:
File ".../airflow/jobs/triggerer_job_runner.py", line 659, in handle_events
Trigger.submit_event(...)
File ".../airflow/models/callback.py", line 234, in handle_event
Stats.incr(**self.get_metric_info(status, self.output))
File ".../airflow/_shared/observability/metrics/otel_logger.py", line 211, in incr
counter.add(count, attributes=tags)
File ".../opentelemetry/sdk/metrics/.../view_instrument_match.py", line 105
aggr_key = frozenset(attributes.items())
TypeError: unhashable type: 'dict'
`Callback.get_metric_info` builds the metric tags dict directly from the
callback's `result` and `self.data` (which includes `kwargs`). Both are
frequently dicts — for deadline async callbacks the `result` is the user
callback's return value, and `kwargs` is the captured callback kwargs. When
the metrics backend is OTel, the SDK builds the aggregation key as
`frozenset(attributes.items())`, which raises if any value is unhashable
(dict, list, set). The result is a triggerer crash and stalled triggers.
The bug is metrics-backend-dependent: statsd accepts non-primitive tag values
without complaint, so OSS users running default statsd never see it. OTel
backends (used in production by Astronomer Astro Cloud and any OSS deployment
that enables `[metrics] otel_*`) hit it consistently.
Reproduces against 3.2.1 and main; not a 3.2.x regression.
Sanitize tag values to primitives before returning from `get_metric_info`:
keep `str | int | float | bool | None` as-is, JSON-stringify anything else.
Using `default=str` in `json.dumps` so values like `datetime` fall back
cleanly instead of raising.
Adds a regression test that asserts every tag value is hashable and that
`frozenset(tags.items())` does not raise.
Reported by Astronomer Runtime team while testing 3.2.2rc2-based images.
(cherry picked from commit 9a34e73)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Manual backport of #67527 to `v3-2-test` for 3.2.3.
The auto-backport bot can't cherry-pick this one — `main` and `v3-2-test` diverged in the imports block of `airflow-core/src/airflow/models/callback.py` (`main` has `from dataclasses import dataclass`, `v3-2-test` still has `import inspect; from collections.abc import Callable`), so the cherry-pick conflicts. Conflict resolved by keeping the v3-2-test imports and adding the new `import json` the fix needs. The `Stats.incr` call site (v3-2-test) vs `stats.incr` (main) needed no change — `handle_event` carries the v3-2-test form unmodified.
Why this needs backporting
`Callback.handle_event` crashes the triggerer when OpenTelemetry metrics are enabled and a deadline async callback completes with a dict-typed `result` (or dict-typed `kwargs`). OTel builds its aggregation key as `frozenset(attributes.items())`, which raises `TypeError: unhashable type: 'dict'`. Statsd accepts non-primitive tag values silently, so OSS deployments on default metrics don't see it — Astronomer Astro and any deployment with `[metrics] otel_*` configured hit it consistently. Reported by the Astro Runtime team while exercising 3.2.2rc2-based images.
What's in the cherry-pick
Review feedback already addressed in #67527