Skip to content

OCM-19740 - local observability#571

Open
Alcamech wants to merge 1 commit into
openshift:masterfrom
Alcamech:OCM-19740
Open

OCM-19740 - local observability#571
Alcamech wants to merge 1 commit into
openshift:masterfrom
Alcamech:OCM-19740

Conversation

@Alcamech

@Alcamech Alcamech commented Jan 26, 2026

Copy link
Copy Markdown

What type of PR is this?

Documentation

What this PR does / why we need it?

Adds an "Adding New Metrics" guide to docs/metrics.md with step-by-step instructions for defining, registering, and verifying Prometheus metrics locally.
Adds an Metrics Tracding guide to docs/metrics-tracing.md that provides a comprehensive mapping of all Prometheus metrics

Which Jira/Github issue(s) this PR fixes?

OCM-19740

@openshift-ci

openshift-ci Bot commented Jan 26, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Alcamech
Once this PR has been reviewed and has the lgtm label, please assign clcollins for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread docs/metrics-tracing.md
- Invalid PDBs could block node drains
- Manual interventions detected

**Alert**: `UpgradeClusterCheckFailedSRE` (paging)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have this alert?
I remember we didn't implement this alert.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referenced in the pagingAlerts slice in pkg/metrics/metrics.go:74-81 but I do not see it in https://github.com/openshift/managed-cluster-config/blob/master/deploy/sre-prometheus/100-managed-upgrade-operator.PrometheusRule.yaml

Do you want me to remove this reference from the doc and pagingAlerts slice?

Comment thread docs/metrics-tracing.md

**Paging Alerts Tracked** (from `pkg/metrics/metrics.go:74-81`):
- `UpgradeConfigValidationFailedSRE`
- `UpgradeClusterCheckFailedSRE`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember we have this alert

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referenced in the pagingAlerts slice in pkg/metrics/metrics.go:74-81 but I do not see it in https://github.com/openshift/managed-cluster-config/blob/master/deploy/sre-prometheus/100-managed-upgrade-operator.PrometheusRule.yaml

Do you want me to remove this reference from the doc and pagingAlerts slice?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This alert was removed since 2021. Maybe you can comment it out.

Comment thread docs/metrics-tracing.md
- `UpgradeControlPlaneUpgradeTimeoutSRE`
- `UpgradeNodeUpgradeTimeoutSRE`
- `UpgradeNodeDrainFailedSRE`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpgradeStateNotificationFailureSRE this alert is missing

@Alcamech Alcamech Mar 4, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is commented out in the pagingAlerts slice in pkg/metrics/metrics.go:74-81

//"UpgradeNotificationFailedSRE", TODO: OSD-26790 - Create an Alert in mcc repo

but I do see it in https://github.com/openshift/managed-cluster-config/blob/master/deploy/sre-prometheus/100-managed-upgrade-operator.PrometheusRule.yaml

Do you also want me to uncomment this in the pagingAlerts slice?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is UpgradeStateNotificationFailureSRE now

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@Alcamech: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/validate 0b178f6 link true /test validate

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.33%. Comparing base (b43d6ec) to head (0b178f6).
⚠️ Report is 138 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #571      +/-   ##
==========================================
- Coverage   54.35%   54.33%   -0.03%     
==========================================
  Files         123      123              
  Lines        6123     6212      +89     
==========================================
+ Hits         3328     3375      +47     
- Misses       2592     2631      +39     
- Partials      203      206       +3     

see 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants