Skip to content

Guard transport.send() in proc() so IAM/gRPC errors don't crash supervised components#68250

Closed
goingforstudying-ctrl wants to merge 2 commits into
apache:mainfrom
goingforstudying-ctrl:fix/stackdriver-guard-transport-send
Closed

Guard transport.send() in proc() so IAM/gRPC errors don't crash supervised components#68250
goingforstudying-ctrl wants to merge 2 commits into
apache:mainfrom
goingforstudying-ctrl:fix/stackdriver-guard-transport-send

Conversation

@goingforstudying-ctrl

Copy link
Copy Markdown

Fixes Bug 3 from #68240.

What

StackdriverRemoteLogIO.processorsproc() calls transport.send() without
any error handling. In Airflow 3's supervisor model, REMOTE_TASK_LOG applies
to ALL supervised components (scheduler, dag-processor, triggerer, workers).
A single IAM misconfiguration or gRPC error would crash the entire process.

Observed: dag-processor pod enters CrashLoopBackOff on every log emit when the
Kubernetes Service Account lacks the logging.logEntries.create IAM binding.

Fix

Wrap _transport.send() in try/except Exception and emit a
logging.warning instead of propagating. Log delivery is best-effort;
a Cloud Logging error should never kill a task-executing process.

Changes

  • Guarded _transport.send() call in proc() with try/except
  • New test test_processors_survives_transport_send_failure verifies:
    • proc() returns the event (doesn't crash)
    • warning is logged with the failure details

relates to #68240

@henry3260 henry3260 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify how the observed IAM, gRPC error propagates from _transport.send()? StackdriverRemoteLogIO uses BackgroundThreadTransport by default, whosesend()only enqueues the entry. The network request happens in the background worker, where exceptions frombatch.commit()are already caught.

Comment thread providers/google/tests/unit/google/cloud/log/test_stackdriver_task_handler.py Outdated
@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/stackdriver-guard-transport-send branch 18 times, most recently from c55644c to 1e8ef8b Compare June 11, 2026 17:15
@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/stackdriver-guard-transport-send branch 16 times, most recently from d7f36ad to b4f0658 Compare June 17, 2026 05:26
@shahar1 shahar1 changed the title fix(google): guard transport.send() in proc() so IAM/gRPC errors don't crash supervised components Guard transport.send() in proc() so IAM/gRPC errors don't crash supervised components Jun 17, 2026
@shahar1

shahar1 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Could you clarify how the observed IAM, gRPC error propagates from _transport.send()? StackdriverRemoteLogIO uses BackgroundThreadTransport by default, whosesend()only enqueues the entry. The network request happens in the background worker, where exceptions frombatch.commit()are already caught.

@goingforstudying-ctrl please address Henry's concern, thanks!

@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/stackdriver-guard-transport-send branch 9 times, most recently from fdb836f to bf2e7d8 Compare June 18, 2026 00:15
goingforstudying-ctrl and others added 2 commits June 19, 2026 20:16
In Airflow 3 the supervisor process runs REMOTE_TASK_LOG handlers, but
record.task_instance is never set in supervisor context (it is a
task-subprocess concept).  When task_instance is missing the proc()
closure shipped log entries with empty labels, making Cloud Logging
entries unsearchable by dag_id / task_id.

Parse dag_id, task_id, and try_number from the structured AF3 log path
(dag_id=<x>/run_id=<x>/task_id=<x>/attempt=<N>.log) instead.  This
requires zero DB access and works regardless of whether the handler
runs in a task subprocess or the supervisor.

relates to apache#68240
- Rename _labels_from_path to _extract_labels_from_path
- Use path labels as fallback instead of overriding event-derived labels
- Actually store run_id instead of no-op pass branch
@goingforstudying-ctrl

Copy link
Copy Markdown
Author

Closing this one — the label extraction fix was already merged via #68292 by ashb. Henry's point about BackgroundThreadTransport already catching network errors is also correct: send() only enqueues so the described IAM crash scenario doesn't apply. Thanks for the reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:logging area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants