RHINENG-26546: add job to backfill workspace data#2221
Conversation
Reviewer's GuideIntroduces a new batched workspace backfill job that runs as the admin DB user to populate denormalized workspace_id and workspace_name columns from the workspaces JSON field, wires it into the job runner and ClowdApp as a suspended CronJob with tunable limits, exposes Prometheus metrics and a Grafana panel for monitoring, and adds local Docker/e2e tooling plus SQL helpers for realistic load generation and verification. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #2221 +/- ##
==========================================
- Coverage 59.07% 58.46% -0.61%
==========================================
Files 137 139 +2
Lines 8821 8925 +104
==========================================
+ Hits 5211 5218 +7
- Misses 3064 3159 +95
- Partials 546 548 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The workspace eligibility predicates are duplicated across
backfillUpdateSQL,pendingAccountsSQL,pendingRowsSQL,invalidPendingRowsSQL, and the SQL inverify_workspace_backfill.sql; consider centralizing this condition (e.g., in a view or a single reusable SQL snippet) so future changes to the rules don’t get out of sync between the job and verification scripts. loadPendingAccountsscansrh_account_idinto a[]int, which can be narrower than the DB type; ifrh_account_idisbigintin the schema, it would be safer to use[]int64(and adjust the function signatures) to avoid potential truncation issues.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The workspace eligibility predicates are duplicated across `backfillUpdateSQL`, `pendingAccountsSQL`, `pendingRowsSQL`, `invalidPendingRowsSQL`, and the SQL in `verify_workspace_backfill.sql`; consider centralizing this condition (e.g., in a view or a single reusable SQL snippet) so future changes to the rules don’t get out of sync between the job and verification scripts.
- `loadPendingAccounts` scans `rh_account_id` into a `[]int`, which can be narrower than the DB type; if `rh_account_id` is `bigint` in the schema, it would be safer to use `[]int64` (and adjust the function signatures) to avoid potential truncation issues.
## Individual Comments
### Comment 1
<location path="tasks/workspace_backfill/workspace_backfill.go" line_range="93" />
<code_context>
+ }
+}
+
+func runWorkspaceBackfill() (nUpdated int64, complete bool, err error) {
+ if err := logPendingStats(); err != nil {
+ return 0, false, err
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the backfill loop into a helper and centralizing the shared SQL predicate to simplify control flow and make predicate changes safer.
You can reduce complexity in two focused spots without changing behavior: the per-account loop and the SQL predicates.
---
### 1. Simplify `runWorkspaceBackfill` loop control
Right now `runWorkspaceBackfill` has:
- nested `for` loops
- two `total >= maxRows` checks
- `break` vs `return` scattered in the inner loop
You can make the control flow easier to follow by:
1. Extracting per-account processing into a helper.
2. Making the inner loop condition explicit.
3. Having a single place that decides whether the global limit was hit.
For example:
```go
func runWorkspaceBackfill() (nUpdated int64, complete bool, err error) {
if err := logPendingStats(); err != nil {
return 0, false, err
}
accounts, err := loadPendingAccounts()
if err != nil {
return 0, false, err
}
if len(accounts) == 0 {
return 0, true, nil
}
utils.LogInfo("accounts", len(accounts), "Starting workspace backfill")
maxRows := int64(tasks.WorkspaceBackfillMaxRowsPerRun)
var total int64
for i, rhAccountID := range accounts {
rows, hitLimit, err := processAccountBatches(i, rhAccountID, maxRows, total)
total += rows
if err != nil {
// keep existing behavior: on batch error we just skip that account
continue
}
if hitLimit {
return total, false, nil
}
}
pending, err := countPending()
if err != nil {
return total, false, err
}
return total, pending == 0, nil
}
func processAccountBatches(idx, rhAccountID int, maxRows, totalSoFar int64) (rowsUpdated int64, hitLimit bool, err error) {
for totalSoFar+rowsUpdated < maxRows {
remaining := maxRows - (totalSoFar + rowsUpdated)
batchLimit := tasks.WorkspaceBackfillBatchSize
if int64(batchLimit) > remaining {
batchLimit = int(remaining)
}
rows, batchErr := backfillBatch(rhAccountID, batchLimit)
if batchErr != nil {
utils.LogWarn("rhAccountID", rhAccountID, "err", batchErr.Error(), "Workspace backfill batch failed")
backfillErrorsCnt.Inc()
return rowsUpdated, false, batchErr
}
if rows == 0 {
return rowsUpdated, false, nil
}
rowsUpdated += rows
backfillRowsCnt.Add(float64(rows))
backfillBatchesCnt.Inc()
utils.LogInfo("i", idx, "rhAccountID", rhAccountID, "nRows", rows, "total", totalSoFar+rowsUpdated, "Workspace backfill batch")
if tasks.WorkspaceBackfillBatchSleepMs > 0 {
time.Sleep(time.Duration(tasks.WorkspaceBackfillBatchSleepMs) * time.Millisecond)
}
}
return rowsUpdated, true, nil
}
```
This keeps the same behavior but:
- makes the limit condition explicit (`totalSoFar+rowsUpdated < maxRows`)
- centralizes the “did we hit the limit?” decision in a single boolean
- isolates per-account concerns in `processAccountBatches`
---
### 2. Centralize the “pending rows” predicate
The JSON predicate for “pending” rows is repeated in four places with a negation for invalid rows. You can factor out the core predicate once and build the other SQL strings from it, which will make future changes safer.
For example:
```go
const workspacePendingPredicate = `
workspace_id IS NULL
AND workspaces IS NOT NULL
AND jsonb_typeof(workspaces) = 'array'
AND jsonb_array_length(workspaces) > 0
AND workspaces->0->>'id' IS NOT NULL
AND workspaces->0->>'name' IS NOT NULL
AND NOT empty(workspaces->0->>'name')
`
const backfillUpdateSQL = `
UPDATE system_inventory si
SET workspace_id = (si.workspaces->0->>'id')::uuid,
workspace_name = si.workspaces->0->>'name'
FROM (
SELECT rh_account_id, id
FROM system_inventory
WHERE rh_account_id = ?
AND ` + workspacePendingPredicate + `
ORDER BY id
LIMIT ?
) batch
WHERE si.rh_account_id = batch.rh_account_id
AND si.id = batch.id
`
const pendingAccountsSQL = `
SELECT rh_account_id
FROM system_inventory
WHERE ` + workspacePendingPredicate + `
GROUP BY rh_account_id
ORDER BY hash_partition_id(rh_account_id, 128), rh_account_id
`
const pendingRowsSQL = workspacePendingPredicate
const invalidPendingRowsSQL = `
workspace_id IS NULL
AND workspaces IS NOT NULL
AND NOT (
` + workspacePendingPredicate + `
)
`
```
This keeps the SQL semantics identical, but:
- “what is a pending row” is defined exactly once
- adding/changing a condition only requires updating `workspacePendingPredicate`
- `countPending`, `backfillUpdateSQL`, `pendingAccountsSQL`, and `invalidPendingRowsSQL` all stay in sync automatically
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
954040f to
52ddab1
Compare
52ddab1 to
a653a73
Compare
a653a73 to
b95d69b
Compare
This PR:
workspace_backfilljob to populateworkspace_idandworkspace_namefrom workspacesdeploy/clowdapp.yaml(every 10 min, 50k rows/run in prod)test_generate_system_inventory.sqlfor local load testingdev/workspace_backfill.mdwith docs and how to test locally