Skip to content

[OTLP Validation] OTLP Data Quality Validation — Run 26434253996 (2026-05-26) #34877

@github-actions

Description

@github-actions

A. Executive Summary

Overall Status: WARN

The gh-aw.agent.setup span is correctly shaped, fully attributed, and successfully exported to Sentry. However, the Grafana Tempo endpoint — the second configured OTLP fan-out target — received zero connections during this run. This represents a silent fan-out failure: no export errors were written, yet no spans reached Grafana. The run is still in-flight (conclusion and agent spans are pending).

Main risks:

  • Grafana Tempo is receiving no telemetry from gh-aw workflows (silent data loss)
  • If the failure is systemic, historical span coverage in Grafana is zero for this workflow

Most likely root cause: Missing /v1/traces path suffix on the Grafana endpoint URL (https://otlp-gateway-prod-eu-west-2.grafana.net/otlp → should be .../otlp/v1/traces), causing a silent 404 that is non-retried and not logged to the error file, or the export error is swallowed before reaching /tmp/gh-aw/otlp-export-errors.jsonl.


B. Trace Completeness

Metric Value
Validation window 2026-05-26 05:32:44 UTC — in-flight
service.name gh-aw.otlp-data-quality-validator
github.run_id 26434253996
Unique traceId in JSONL 1 (1bc136b146e27e5d4f880a21b39aaca3)
Unique span identity (traceId+spanId) 1
Duplicate spans 0
Spans in JSONL mirror 1 (gh-aw.agent.setup)
Spans exported to Sentry 13 TCP connections observed (all status 200) — covers activation.setup, activation.conclusion, agent.setup, and prior retries
Spans exported to Grafana 0 connections observed
Confidence High for Sentry (firewall evidence); High for Grafana gap (no *.grafana.net connections in 257-entry audit log)

Note: Conclusion span (gh-aw.agent.conclusion) and agent span (gh-aw.agent.agent) are not yet emitted — the agent job is still running. These are expected absent at validation time, not missing.

JSONL mirror path discrepancy: The workflow prompt and spec reference /tmp/gh-aw/agent/otel.jsonl, but the actual mirror is at /tmp/gh-aw/otel.jsonl. The agent/ subdirectory is empty. This path mismatch would cause any JSONL-based tooling pointed at the documented path to report false negatives.


C. Span Hierarchy Validation

Single-job agent workflow (activation + agent jobs).

Check Result
Setup spans share global parent (8b496f47c4e6810f) ✅ PASS — gh-aw.agent.setup parents under global root span ID
Conclusion spans parent under their setup span ⏳ PENDING — agent still running
Agent spans parent under conclusion span ⏳ PENDING — agent still running
Span naming pattern gh-aw.<job>.<op> ✅ PASS — gh-aw.agent.setup matches
parentSpanId of setup span ≠ empty ✅ PASS — 8b496f47c4e6810f present
GITHUB_AW_OTEL_PARENT_SPAN_ID == setup spanId ✅ PASS — both equal c4daf0dc08e1d824 (conclusion span will parent correctly)

Per spec §9.3, the single observed setup span correctly parents under the global root span ID.


D. Attribute Contract Validation

Setup span (gh-aw.agent.setup) — spec §10.1

All 7 required attributes present:

Attribute Value
gh-aw.job.name agent
gh-aw.workflow.name OTLP Data Quality Validator
gh-aw.run.id 26434253996
gh-aw.run.attempt 1
gh-aw.run.actor mnkiefer
gh-aw.repository github/gh-aw
gh-aw.staged false (boolValue)

Additional attributes present: gh-aw.cli.version, gen_ai.system (github_models), gh-aw.engine.id, gh-aw.event_name, gh-aw.episode.id, gh-aw.episode.kind, gh-aw.hop.id, gh-aw.workflow_call.id.

Conclusion span — spec §10.2

⏳ PENDING (agent still running). Cannot validate required attributes: gh-aw.run.status, gh-aw.error_count, gh-aw.warning_count, gh-aw.action_minutes, gh-aw.output.item_count, gh-aw.otlp.export_errors.

Agent span — spec §10.3

⏳ PENDING. Cannot validate GenAI semantic conventions: gen_ai.system, gen_ai.request.model, gen_ai.operation.name, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens.

Resource attributes — spec §11.1

All 6 required resource attributes present:

Attribute Value
service.name gh-aw.otlp-data-quality-validator
service.version 1.0.52
github.repository github/gh-aw
github.run_id 26434253996
github.run_attempt 1
github.actions.run_url https://github.com/github/gh-aw/actions/runs/26434253996

Additional resource attributes present: github.event_name, github.ref, github.ref_name, github.sha, github.job, github.workflow_ref, github.actor_id, runner.*, gh-aw.awf.version, deployment.environment.

Instrumentation scope — spec §11.3

✅ PASS — scope.name = gh-aw, scope.version = 1.0.52 (matches service.version).


E. Export and Fan-out Health

Endpoint URL Status Firewall connections
Sentry https://o205451.ingest.us.sentry.io/api/4511347087179777/integration/otlp ✅ Reachable (TCP_TUNNEL 200) 13 connections observed
Grafana Tempo https://otlp-gateway-prod-eu-west-2.grafana.net/otlp ⚠️ Zero connections 0 in 257 audit entries

Sentry header rewrite: ✅ Applied — Authorization header correctly rewritten to x-sentry-auth in GH_AW_OTLP_ENDPOINTS.

Export error log: No /tmp/gh-aw/agent/otlp-export-errors.jsonl or /tmp/gh-aw/agent/otlp-export-errors.count found. This means either Grafana errors are not being captured, or the export attempt never reaches the HTTP layer.

JSONL mirror write:/tmp/gh-aw/otel.jsonl contains the gh-aw.agent.setup span with correct content.

Multi-endpoint fan-out independence: Sentry is receiving spans independently of Grafana failure. Fan-out independence is structurally preserved on the Sentry side. Grafana side is silent-failing without affecting Sentry delivery.


F. Root-Cause Hypothesis

Primary hypothesis (confidence: MEDIUM-HIGH)

Grafana endpoint URL missing /v1/traces suffix.

  • Configured: https://otlp-gateway-prod-eu-west-2.grafana.net/otlp
  • Expected: https://otlp-gateway-prod-eu-west-2.grafana.net/otlp/v1/traces

Grafana Cloud's OTLP gateway requires the full path. Without /v1/traces, the server returns an HTTP 404. If the send_otlp_span.cjs error handler swallows non-retryable 4xx errors without writing to the error log, this would explain zero firewall connections — wait, we'd still see the TCP connection attempt. The fact that there are zero TCP connections to *.grafana.net suggests the export attempt is abandoned before the HTTP call is made.

Secondary hypothesis (confidence: MEDIUM)

Grafana GH_AW_OTEL_GRAFANA_ENDPOINT secret resolves to an empty string at compile time.

If the secret was missing when the lock file was compiled, the URL in GH_AW_OTLP_ENDPOINTS would be an empty string. The parseOTLPEndpoints() function in send_otlp_span.cjs filters out entries with empty .url (. filter(e => e && typeof e.url === "string" && e.url.trim() !== "")). However, GH_AW_OTLP_ENDPOINTS shows a non-empty Grafana URL, so this is less likely.

Tertiary hypothesis (confidence: LOW)

Exception in Grafana export branch swallowed before fetch() call.

A JavaScript error (e.g., header parsing failure) in the Grafana branch could abort the export before TCP connects. The lack of any error log makes this possible but unconfirmed.

Alternative explanations ruled out:

  • Network allowlist blocking: *.grafana.net is in shared/otlp.md network allowlist ✅
  • if-missing: error blocking: Would fail the job entirely, not silently skip Grafana ✅
  • Ingestion delay: Zero TCP connections eliminates this ✅

G. Recommended Fixes (Prioritized)

  1. Investigate and fix Grafana fan-out silence (P0 — data loss)

    • Check whether send_otlp_span.cjs is actually iterating over all GH_AW_OTLP_ENDPOINTS entries for the Grafana entry
    • Verify the Grafana URL has the correct /v1/traces suffix in the secret GH_AW_OTEL_GRAFANA_ENDPOINT
    • Add explicit per-endpoint error logging to /tmp/gh-aw/agent/otlp-export-errors.jsonl for all HTTP failures, including 4xx responses
  2. Fix JSONL mirror path discrepancy (P1 — observability tooling breakage)

    • The spec and workflow prompt reference /tmp/gh-aw/agent/otel.jsonl, but the actual file is at /tmp/gh-aw/otel.jsonl
    • Update specs/otel-observability-spec.md or the runtime to align the documented and actual path
  3. Ensure export errors are always written (P2 — silent failure prevention)

    • Zero export errors file despite apparent Grafana data loss indicates error capture is incomplete
    • send_otlp_span.cjs should write a record to otlp-export-errors.jsonl for every failed endpoint, including pre-HTTP failures (DNS, URL parse errors, empty URL after filtering)
  4. Add backend visibility verification to validation workflow (P3 — diagnostic coverage)

    • Once Grafana export is restored, add a step that queries the Grafana Tempo backend via API to confirm span ingestion after the run completes

H. Validation Queries and Commands Used

# JSONL mirror span summary
jq -c '.resourceSpans[].scopeSpans[].spans[] | {name, traceId, spanId, parentSpanId, kind}' /tmp/gh-aw/otel.jsonl

# Resource attributes
jq -c '.resourceSpans[].resource.attributes[] | {(.key): .value}' /tmp/gh-aw/otel.jsonl | sort -u

# Setup span required attribute check
python3 -c "
import json
data=json.load(open('/tmp/gh-aw/otel.jsonl'))
attrs={a['key']:list(a['value'].values())[0] for rs in data['resourceSpans'] for ss in rs['scopeSpans'] for s in ss['spans'] for a in s.get('attributes',[])}
required=['gh-aw.job.name','gh-aw.workflow.name','gh-aw.run.id','gh-aw.run.attempt','gh-aw.run.actor','gh-aw.repository','gh-aw.staged']
print('Missing:', [k for k in required if k not in attrs])
"

# Trace ID consistency check
echo "JSONL:"; jq -r '.resourceSpans[].scopeSpans[].spans[].traceId' /tmp/gh-aw/otel.jsonl | sort -u
echo "ENV: $GITHUB_AW_OTEL_TRACE_ID"

# Firewall OTLP endpoint connections
python3 -c "
import json
entries={}
for line in open('/tmp/gh-aw/sandbox/firewall/logs/audit.jsonl'):
    d=json.loads(line)
    h=d.get('host','').split(':')[0]
    entries[h]=entries.get(h,0)+1
for h,c in sorted(entries.items(),key=lambda x:-x[1])[:10]: print(h,c)
"

# Export errors
cat /tmp/gh-aw/agent/otlp-export-errors.jsonl 2>/dev/null || echo "No export errors file"
cat /tmp/gh-aw/agent/otlp-export-errors.count 2>/dev/null || echo "0"

Validated against: specs/otel-observability-spec.md §9.3, §10.1, §11.1, §11.3
Workflow run: 26434253996
Validation time: 2026-05-26T05:35 UTC (run in-flight)

Generated by 🧭 OTLP Data Quality Validator · sonnet46 2.2M ·

  • expires on Jun 2, 2026, 5:40 AM UTC

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions