Skip to content

RHINENG-26546: add job to backfill workspace data#2221

Merged
TenSt merged 7 commits into
RedHatInsights:masterfrom
TenSt:stepan/RHINENG-26546-job-to-backfill-workspace-data
Jun 2, 2026
Merged

RHINENG-26546: add job to backfill workspace data#2221
TenSt merged 7 commits into
RedHatInsights:masterfrom
TenSt:stepan/RHINENG-26546-job-to-backfill-workspace-data

Conversation

@TenSt
Copy link
Copy Markdown
Collaborator

@TenSt TenSt commented Jun 1, 2026

This PR:

  • Adds a batched workspace_backfill job to populate workspace_id and workspace_name from workspaces
  • Adds job tests and prometheus metrics
  • Adds grafana panel
  • Adds a suspended CronJob in deploy/clowdapp.yaml (every 10 min, 50k rows/run in prod)
  • Runs the job as the admin DB user (for session_replication_role and updates)
  • Adds local Docker compose, e2e script
  • Adds a slim test_generate_system_inventory.sql for local load testing
  • Adds dev/workspace_backfill.md with docs and how to test locally

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Jun 1, 2026

Reviewer's Guide

Introduces a new batched workspace backfill job that runs as the admin DB user to populate denormalized workspace_id and workspace_name columns from the workspaces JSON field, wires it into the job runner and ClowdApp as a suspended CronJob with tunable limits, exposes Prometheus metrics and a Grafana panel for monitoring, and adds local Docker/e2e tooling plus SQL helpers for realistic load generation and verification.

File-Level Changes

Change Details Files
Add a batched workspace backfill job that updates system_inventory workspace columns from the workspaces JSON field with safety limits and metrics.
  • Introduce tasks/workspace_backfill package implementing account-partitioned batching with per-run row limits, per-batch size, and optional sleep between batches.
  • Run the backfill using the admin DB configuration so it can set session_replication_role to replica and update partitioned tables in a transaction-safe way.
  • Query pending accounts and rows via read-replica transactions, and track pending/invalid row stats with structured logging before each run.
  • Expose Prometheus counters for rows updated, batches completed, and batch errors, and push them via a dedicated pushgateway job name.
  • Add unit test that seeds systems with workspaces JSON, clears denormalized columns under replica role, runs backfillBatch, and asserts workspace columns are set while last_updated is preserved.
  • Wire the new job into main.runJob via the workspace_backfill case so it can be invoked through the existing job entrypoint.
tasks/workspace_backfill/workspace_backfill.go
tasks/workspace_backfill/metrics.go
tasks/workspace_backfill/workspace_backfill_test.go
main.go
Expose configuration and deployment wiring for the workspace backfill job, including CronJob, POD_CONFIG, and tunable batch parameters.
  • Add WorkspaceBackfillMaxRowsPerRun, WorkspaceBackfillBatchSize, and WorkspaceBackfillBatchSleepMs to the tasks config, sourced from PodConfig with sensible defaults.
  • Add a workspace-backfill CronJob object in the ClowdApp manifest that runs the job command, uses the admin DB check initContainer, and passes PROMETHEUS_PUSHGATEWAY and WORKSPACE_BACKFILL_CONFIG env.
  • Introduce new ClowdApp parameters for schedule, suspend flag, and default WORKSPACE_BACKFILL_CONFIG string tuned for production (50k rows/run, 1k batch size).
tasks/config.go
deploy/clowdapp.yaml
Add observability via Grafana and Prometheus for workspace backfill progress and errors.
  • Register three Prometheus counters (rows_updated, batches, batch_errors) under the patchman_engine_workspace_backfill namespace and expose them via a pushgateway pusher.
  • Add a Grafana timeseries panel that graphs increases of rows updated, batches, and batch errors over the standard $interval for the workspace_backfill job.
tasks/workspace_backfill/metrics.go
dashboards/app-sre/grafana-dashboard-insights-patchman-engine-general.configmap.yaml
Provide local Docker, SQL generators, and scripts to run and validate workspace backfill end-to-end.
  • Add a dedicated docker-compose.workspace-backfill.yml stack with a DB container and a workspace-backfill runner that executes a new e2e script.
  • Introduce workspace_backfill.env to configure local POD_CONFIG (batch size, max rows per run, sleep) separate from other jobs.
  • Create dev/test_generate_system_inventory.sql to quickly generate rh_account, system_inventory, and system_patch data including realistic workspaces JSON and timestamps suited for backfill testing.
  • Add dev/prepare_workspace_backfill_test.sql to clear workspace_id and workspace_name under replica role without firing triggers, and dev/verify_workspace_backfill.sql to report pending and mismatched rows.
  • Implement scripts/workspace_backfill_e2e.sh which waits for DB, runs migrations, loads test data, clears workspace columns, runs the job once, and validates via verify_workspace_backfill.sql.
  • Document the local workflow, configuration, and production notes in dev/workspace_backfill.md, including manual batched runs and one-shot e2e usage.
docker-compose.workspace-backfill.yml
conf/workspace_backfill.env
dev/test_generate_system_inventory.sql
dev/prepare_workspace_backfill_test.sql
dev/verify_workspace_backfill.sql
scripts/workspace_backfill_e2e.sh
dev/workspace_backfill.md
dev/test_generate_data.sql

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 1, 2026

Codecov Report

❌ Patch coverage is 6.73077% with 97 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.46%. Comparing base (52d4524) to head (b95d69b).

Files with missing lines Patch % Lines
tasks/workspace_backfill/workspace_backfill.go 7.14% 89 Missing and 2 partials ⚠️
tasks/workspace_backfill/metrics.go 0.00% 4 Missing ⚠️
main.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2221      +/-   ##
==========================================
- Coverage   59.07%   58.46%   -0.61%     
==========================================
  Files         137      139       +2     
  Lines        8821     8925     +104     
==========================================
+ Hits         5211     5218       +7     
- Misses       3064     3159      +95     
- Partials      546      548       +2     
Flag Coverage Δ
unittests 58.46% <6.73%> (-0.61%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TenSt TenSt marked this pull request as ready for review June 1, 2026 21:45
@TenSt TenSt requested a review from a team as a code owner June 1, 2026 21:45
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The workspace eligibility predicates are duplicated across backfillUpdateSQL, pendingAccountsSQL, pendingRowsSQL, invalidPendingRowsSQL, and the SQL in verify_workspace_backfill.sql; consider centralizing this condition (e.g., in a view or a single reusable SQL snippet) so future changes to the rules don’t get out of sync between the job and verification scripts.
  • loadPendingAccounts scans rh_account_id into a []int, which can be narrower than the DB type; if rh_account_id is bigint in the schema, it would be safer to use []int64 (and adjust the function signatures) to avoid potential truncation issues.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The workspace eligibility predicates are duplicated across `backfillUpdateSQL`, `pendingAccountsSQL`, `pendingRowsSQL`, `invalidPendingRowsSQL`, and the SQL in `verify_workspace_backfill.sql`; consider centralizing this condition (e.g., in a view or a single reusable SQL snippet) so future changes to the rules don’t get out of sync between the job and verification scripts.
- `loadPendingAccounts` scans `rh_account_id` into a `[]int`, which can be narrower than the DB type; if `rh_account_id` is `bigint` in the schema, it would be safer to use `[]int64` (and adjust the function signatures) to avoid potential truncation issues.

## Individual Comments

### Comment 1
<location path="tasks/workspace_backfill/workspace_backfill.go" line_range="93" />
<code_context>
+	}
+}
+
+func runWorkspaceBackfill() (nUpdated int64, complete bool, err error) {
+	if err := logPendingStats(); err != nil {
+		return 0, false, err
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the backfill loop into a helper and centralizing the shared SQL predicate to simplify control flow and make predicate changes safer.

You can reduce complexity in two focused spots without changing behavior: the per-account loop and the SQL predicates.

---

### 1. Simplify `runWorkspaceBackfill` loop control

Right now `runWorkspaceBackfill` has:

- nested `for` loops
- two `total >= maxRows` checks
- `break` vs `return` scattered in the inner loop

You can make the control flow easier to follow by:

1. Extracting per-account processing into a helper.
2. Making the inner loop condition explicit.
3. Having a single place that decides whether the global limit was hit.

For example:

```go
func runWorkspaceBackfill() (nUpdated int64, complete bool, err error) {
	if err := logPendingStats(); err != nil {
		return 0, false, err
	}

	accounts, err := loadPendingAccounts()
	if err != nil {
		return 0, false, err
	}
	if len(accounts) == 0 {
		return 0, true, nil
	}

	utils.LogInfo("accounts", len(accounts), "Starting workspace backfill")

	maxRows := int64(tasks.WorkspaceBackfillMaxRowsPerRun)
	var total int64

	for i, rhAccountID := range accounts {
		rows, hitLimit, err := processAccountBatches(i, rhAccountID, maxRows, total)
		total += rows
		if err != nil {
			// keep existing behavior: on batch error we just skip that account
			continue
		}
		if hitLimit {
			return total, false, nil
		}
	}

	pending, err := countPending()
	if err != nil {
		return total, false, err
	}
	return total, pending == 0, nil
}

func processAccountBatches(idx, rhAccountID int, maxRows, totalSoFar int64) (rowsUpdated int64, hitLimit bool, err error) {
	for totalSoFar+rowsUpdated < maxRows {
		remaining := maxRows - (totalSoFar + rowsUpdated)
		batchLimit := tasks.WorkspaceBackfillBatchSize
		if int64(batchLimit) > remaining {
			batchLimit = int(remaining)
		}

		rows, batchErr := backfillBatch(rhAccountID, batchLimit)
		if batchErr != nil {
			utils.LogWarn("rhAccountID", rhAccountID, "err", batchErr.Error(), "Workspace backfill batch failed")
			backfillErrorsCnt.Inc()
			return rowsUpdated, false, batchErr
		}
		if rows == 0 {
			return rowsUpdated, false, nil
		}

		rowsUpdated += rows
		backfillRowsCnt.Add(float64(rows))
		backfillBatchesCnt.Inc()
		utils.LogInfo("i", idx, "rhAccountID", rhAccountID, "nRows", rows, "total", totalSoFar+rowsUpdated, "Workspace backfill batch")

		if tasks.WorkspaceBackfillBatchSleepMs > 0 {
			time.Sleep(time.Duration(tasks.WorkspaceBackfillBatchSleepMs) * time.Millisecond)
		}
	}

	return rowsUpdated, true, nil
}
```

This keeps the same behavior but:

- makes the limit condition explicit (`totalSoFar+rowsUpdated < maxRows`)
- centralizes the “did we hit the limit?” decision in a single boolean
- isolates per-account concerns in `processAccountBatches`

---

### 2. Centralize the “pending rows” predicate

The JSON predicate for “pending” rows is repeated in four places with a negation for invalid rows. You can factor out the core predicate once and build the other SQL strings from it, which will make future changes safer.

For example:

```go
const workspacePendingPredicate = `
workspace_id IS NULL
  AND workspaces IS NOT NULL
  AND jsonb_typeof(workspaces) = 'array'
  AND jsonb_array_length(workspaces) > 0
  AND workspaces->0->>'id' IS NOT NULL
  AND workspaces->0->>'name' IS NOT NULL
  AND NOT empty(workspaces->0->>'name')
`

const backfillUpdateSQL = `
UPDATE system_inventory si
SET workspace_id   = (si.workspaces->0->>'id')::uuid,
    workspace_name = si.workspaces->0->>'name'
FROM (
    SELECT rh_account_id, id
    FROM system_inventory
    WHERE rh_account_id = ?
      AND ` + workspacePendingPredicate + `
    ORDER BY id
    LIMIT ?
) batch
WHERE si.rh_account_id = batch.rh_account_id
  AND si.id = batch.id
`

const pendingAccountsSQL = `
SELECT rh_account_id
FROM system_inventory
WHERE ` + workspacePendingPredicate + `
GROUP BY rh_account_id
ORDER BY hash_partition_id(rh_account_id, 128), rh_account_id
`

const pendingRowsSQL = workspacePendingPredicate

const invalidPendingRowsSQL = `
workspace_id IS NULL
  AND workspaces IS NOT NULL
  AND NOT (
    ` + workspacePendingPredicate + `
  )
`
```

This keeps the SQL semantics identical, but:

- “what is a pending row” is defined exactly once
- adding/changing a condition only requires updating `workspacePendingPredicate`
- `countPending`, `backfillUpdateSQL`, `pendingAccountsSQL`, and `invalidPendingRowsSQL` all stay in sync automatically
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread tasks/workspace_backfill/workspace_backfill.go
@TenSt TenSt force-pushed the stepan/RHINENG-26546-job-to-backfill-workspace-data branch 2 times, most recently from 954040f to 52ddab1 Compare June 1, 2026 22:28
@TenSt TenSt force-pushed the stepan/RHINENG-26546-job-to-backfill-workspace-data branch from 52ddab1 to a653a73 Compare June 2, 2026 00:00
@TenSt TenSt force-pushed the stepan/RHINENG-26546-job-to-backfill-workspace-data branch from a653a73 to b95d69b Compare June 2, 2026 00:02
@MichaelMraka MichaelMraka self-assigned this Jun 2, 2026
@TenSt TenSt merged commit 86e75ea into RedHatInsights:master Jun 2, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants