Skip to content

[AISOS-376] Add drift detection and periodic resync for OpenStack resources#759

Open
eshulman2 wants to merge 31 commits intok-orc:mainfrom
eshulman2:forge/aisos-376
Open

[AISOS-376] Add drift detection and periodic resync for OpenStack resources#759
eshulman2 wants to merge 31 commits intok-orc:mainfrom
eshulman2:forge/aisos-376

Conversation

@eshulman2
Copy link
Copy Markdown
Contributor

@eshulman2 eshulman2 commented Apr 19, 2026

Summary

This PR implements comprehensive drift detection and periodic resync capabilities for the OpenStack Resource Controller (ORC). When enabled, ORC now periodically reconciles resources with OpenStack to detect configuration drift and external deletions, automatically recreating managed resources that were deleted outside of ORC while preserving terminal error behavior for imported resources. This feature allows operators to ensure their Kubernetes-declared OpenStack resources remain consistent with actual cloud state.

Changes

API Changes

  • Added ResyncPeriod *metav1.Duration field to CommonOptions in api/v1alpha1/controller_options.go
  • Added LastSyncTime *metav1.Time field to CommonStatus in api/v1alpha1/common_types.go to track successful reconciliation timestamps
  • Added GetResyncPeriod() and IsImported() methods to APIObjectAdapter interface
  • Added GetLastSyncTime() method to APIObjectAdapter interface for persisted sync time access

CLI & Manager

  • Added --default-resync-period CLI flag to cmd/manager/main.go with default value of 0 (disabled)
  • Added DefaultResyncPeriod field to manager.Options struct
  • Added ResyncConfigurable interface and wiring to propagate global default to all 23 controllers

Core Resync Logic

  • Created internal/controllers/generic/resync/ package with:
    • DetermineResyncPeriod(): resolves effective period from spec → global → disabled hierarchy
    • CalculateJitteredDuration(): applies [0%, +20%] positive-only jitter to prevent requeues from firing before the resync period elapses
    • ShouldScheduleResync(): guards against scheduling for disabled/terminal/pending-requeue states
  • Modified shouldReconcile() to trigger reconciliation when resync period has elapsed
  • Modified reconcileNormal() to schedule jittered requeue after successful reconciliation

External Deletion Handling

  • Modified GetOrCreateOSResource() to detect 404s when fetching by status.id
  • Managed non-imported resources trigger recreation path (return nil, nil)
  • Imported/unmanaged resources return terminal error (existing behavior preserved)
  • Added ClearStatusID() to status writer for clearing ID before recreation

Status Updates

  • Added WithLastSyncTime() to status apply config interface
  • Modified UpdateStatus() to set lastSyncTime only on successful reconciliation (no errors, no progress messages)
  • Fixed Available condition: when a terminal error is present, the condition is now set to False instead of Unknown, correctly signalling that the resource is definitively not available

Code Generation

  • Updated api.template to include ResyncPeriod in spec and LastSyncTime in status
  • Updated adapter.template to include GetResyncPeriod(), GetLastSyncTime(), and IsImported() methods
  • Regenerated all 23 resource types, CRDs, deep copy methods, and client code

Documentation

  • Created website/docs/user-guide/drift-detection.md with comprehensive user documentation
  • Updated enhancements/drift-detection.md with implementation status and history

Implementation Notes

  • Jitter Strategy: [0%, +20%] positive-only jitter ensures requeues always fire at or after the resync period elapses. The original ±10% symmetric jitter caused ~50% of requeues to fire early (before the period elapsed), which caused shouldReconcile to return false without scheduling a follow-up requeue, permanently stalling the resource. Positive-only jitter avoids this while still preventing thundering herd.
  • Terminal Error Exclusion: Resources in terminal error states (e.g., invalid configuration) are explicitly excluded from resync scheduling to avoid retry storms.
  • Imported vs ORC-Created: The IsImported() method checks for non-nil importID or importFilter to distinguish resources that cannot be recreated. Externally deleted imported resources now correctly surface a terminal Available=False condition instead of Unknown.
  • Controller Wiring: All 23 standard controllers now implement ResyncConfigurable via pointer receivers; the routerinterface controller is intentionally excluded as it doesn't use the generic reconciler.
  • Pattern Validation Removed: The +kubebuilder:validation:Pattern marker was removed from ResyncPeriod fields because controller-gen doesn't recognize metav1.Duration as a textual type.

Testing

  • Unit Tests: Comprehensive coverage for resync package (DetermineResyncPeriod, CalculateJitteredDuration, ShouldScheduleResync), shouldReconcile modifications, status update conditions, and external deletion handling (5 test cases covering all policy/import combinations)
  • Statistical Validation: Jitter uniformity validated with 10,000 samples; independence validated with 100 samples confirming near-unique values per call
  • E2E Tests: 6 kuttl test suites added:
    • network-resync-period/: Verifies periodic resync updates lastSyncTime (timeout: 180s)
    • network-resync-disabled/: Verifies resyncPeriod=0 disables scheduling
    • network-resync-terminal-error/: Verifies terminal error resources don't resync
    • network-resync-jitter/: Verifies independent jitter for multiple resources (timeout: 240s)
    • network-external-deletion/: Verifies managed resource recreation (timeout: 300s)
    • network-external-deletion-import/: Verifies imported resource sets Available=False (terminal error) on external deletion (timeout: 300s)
  • Build Verification: go build ./... and go vet ./... pass cleanly
  • Existing Tests: All existing tests continue to pass

Related Tickets


Generated by Forge SDLC Orchestrator

Forge added 21 commits April 19, 2026 08:25
…ions.go

Detailed description:
- Added CommonOptions struct to api/v1alpha1/controller_options.go with ResyncPeriod field
- ResyncPeriod is *metav1.Duration with kubebuilder pattern validation for Go duration format
- Pattern: ^([0-9]+(\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$
- Field is marked +optional
- Added GetResyncPeriod() accessor method with nil receiver safety
- Documentation comment explains three-tier resolution: per-resource -> global -> disabled
- Imported metav1 from k8s.io/apimachinery/pkg/apis/meta/v1
- Follows the existing ManagedOptions pattern

Closes: AISOS-380
…dation

Auto-committed by Forge container fallback.
Detailed description:
- Added CommonStatus struct to api/v1alpha1/common_types.go with
  LastSyncTime *metav1.Time field
- Field has json:"lastSyncTime,omitempty" tag and +optional marker
- Documentation comment explains it tracks the timestamp of the last
  successful reconciliation that fetched state from OpenStack
- Added DeepCopyInto and DeepCopy methods to zz_generated.deepcopy.go
  for CommonStatus, following the same pattern as NeutronStatusMetadata
- Code compiles successfully with go build ./...

Closes: AISOS-381
Detailed description:
- Added DefaultResyncPeriod time.Duration field to manager.Options struct in internal/manager/manager.go
- Added time import to manager.go to support the new field
- Added flag.DurationVar() call in cmd/manager/main.go for --default-resync-period flag
- Default value is 0 (disabled)
- Help text explains purpose and per-resource override via spec.resyncPeriod

Closes: AISOS-382
Detailed description:
- Created internal/controllers/generic/resync/config.go with the new
  resync package and DetermineResyncPeriod function
- Created internal/controllers/generic/resync/config_test.go with
  table-driven tests covering all 5 scenarios from the spec

The function resolves the effective resync period with a simple nil check:
- specValue non-nil → use specValue.Duration (0 means explicitly disabled)
- specValue nil → fall back to globalDefault

All test scenarios pass (TS-004, TS-005, TS-010):
- Spec nil, global 0 → 0 (disabled)
- Spec nil, global 1h → 1h
- Spec 30m, global 1h → 30m (spec overrides)
- Spec 0s, global 1h → 0 (explicitly disabled)
- Spec 2h, global 0 → 2h (spec enables despite global disabled)

Closes: AISOS-383
…nd lastSyncTime

Detailed description:
- cmd/resource-generator/data/api.template: Added ResyncPeriod *metav1.Duration
  field to generated {Name}Spec struct alongside ManagedOptions, with kubebuilder
  pattern validation marker matching controller_options.go
- cmd/resource-generator/data/api.template: Added LastSyncTime *metav1.Time field
  to generated {Name}Status struct
- cmd/resource-generator/data/adapter.template: Added GetResyncPeriod() *metav1.Duration
  method to generated adapter, following the GetManagedOptions() pattern; added
  metav1 import to support the return type
- Regenerated all api/v1alpha1/zz_generated.*-resource.go files via resource-generator
- Regenerated all internal/controllers/*/zz_generated.adapter.go files via resource-generator
- Regenerated api/v1alpha1/zz_generated.deepcopy.go via controller-gen to add
  deep copy for ResyncPeriod (in each *Spec) and LastSyncTime (in each *Status)

Closes: AISOS-384
Detailed description:
- Added GetResyncPeriod() *metav1.Duration to APIObjectAdapter interface
- Placed adjacent to GetManagedOptions() following existing pattern
- metav1 import was already present (used by metav1.Object embedding)
- Generated adapters (zz_generated.adapter.go) already implement this
  method from the adapter.template update in AISOS-384

Closes: AISOS-385
Detailed description:
- Fixed +kubebuilder:validation:Pattern marker issue: removed it from api.template,
  controller_options.go, and all 23 zz_generated.*-resource.go files (controller-gen
  rejects Pattern validation on metav1.Duration as it doesn't recognize it as textual)
- Ran make generate-resources: regenerated all api/v1alpha1/zz_generated.*-resource.go
  files with ResyncPeriod in spec and LastSyncTime in status (without Pattern marker)
- Ran make generate-controller-gen: regenerated zz_generated.deepcopy.go
- Ran make generate-codegen: regenerated pkg/clients/ (applyconfiguration with
  WithResyncPeriod/WithLastSyncTime builders, clientset, listers, informers) and
  cmd/models-schema/zz_generated.openapi.go
- Ran make manifests: regenerated all 24 CRD YAML files in config/crd/bases/ adding
  resyncPeriod to spec schema and lastSyncTime to status schema

All acceptance criteria met:
- All zz_generated.*-resource.go files contain ResyncPeriod in spec
- All zz_generated.*-resource.go files contain LastSyncTime in status
- All zz_generated.adapter.go files contain GetResyncPeriod() method
- 23/24 CRD YAML manifests include resyncPeriod in spec and lastSyncTime in status
  (RouterInterface excluded: hand-written special resource with immutable spec)
- go build ./... succeeds
- make test passes all tests

Closes: AISOS-386
Detailed description:
- Added CalculateJitteredDuration(base time.Duration) to apply ±10% uniform
  jitter using math/rand/v2, returning values in [base*0.9, base*1.1]
- Added ShouldScheduleResync(resyncPeriod, reconcileStatus) returning false
  when: period ≤ 0 (disabled), terminal error present (TS-008), or a requeue
  is already pending (TS-012)
- Both functions live in the existing resync package alongside DetermineResyncPeriod
- Tests achieve 100% statement coverage across all three functions
- Statistical uniformity validated with 10,000 samples; independence validated
  with 100 samples confirming near-unique jitter per call (TS-011)

Closes: AISOS-387
Detailed description:
- Added GetLastSyncTime() *metav1.Time to APIObjectAdapter interface in
  internal/controllers/generic/interfaces/adapter.go
- Updated adapter.template and regenerated all 23 zz_generated.adapter.go
  files to implement GetLastSyncTime() returning Status.LastSyncTime
- Modified shouldReconcile() in controller.go to accept lastSyncTime and
  resyncPeriod parameters; when resyncPeriod > 0: returns true if
  lastSyncTime is nil (never synced) or time.Since(lastSyncTime) >=
  resyncPeriod; returns false when period not yet elapsed
- Updated reconcileNormal() call site to compute effectiveResyncPeriod via
  resync.DetermineResyncPeriod(adapter.GetResyncPeriod(), 0) and pass
  adapter.GetLastSyncTime()
- Added controller_test.go with 9 test functions covering all acceptance
  criteria; shouldReconcile achieves 100% statement coverage

Closes: AISOS-388
Detailed description:
- Modified internal/controllers/generic/reconciler/controller.go: at the end of
  reconcileNormal, after all actuator/reconciler work completes, call
  resync.ShouldScheduleResync(effectiveResyncPeriod, reconcileStatus) and if true
  apply reconcileStatus.WithRequeue(resync.CalculateJitteredDuration(effectiveResyncPeriod))
- Modified internal/controllers/generic/reconciler/controller_test.go: added 7 new
  test functions covering all acceptance criteria plus an integration-style timing
  range test across multiple period values
- Scheduling uses the effectiveResyncPeriod already computed at the top of
  reconcileNormal; no new variables introduced in the outer function scope
- ShouldScheduleResync encapsulates all guard conditions: period <= 0 disabled,
  terminal error (TS-008), already-pending requeue (TS-012)
- CalculateJitteredDuration applies +/-10% uniform jitter (TS-006)
- All existing tests continue to pass; 7 new tests added

Closes: AISOS-389
…conciliation

Detailed description:
- Added WithLastSyncTime(metav1.Time) to the ORCStatusApplyConfig interface
  in internal/controllers/generic/interfaces/status.go. All generated status
  apply configurations already implement this method (generated in AISOS-386).
- Added shouldSetLastSyncTime() helper in internal/controllers/generic/status/status.go
  that returns true only when NeedsReschedule() is false (no errors, no progress
  messages). A pending requeue alone does not prevent the update.
- UpdateStatus now calls applyConfigStatus.WithLastSyncTime(now) when the
  reconciliation completed successfully, after merging availableReconcileStatus
  so any errors from ResourceAvailableStatus also suppress the update.
- Added 6 unit tests in status_test.go covering: nil/successful status, requeue-only,
  transient error, terminal error, progress message, and error+progress message.

Closes: AISOS-390
… tests

Detailed description:
- Created resource_actions_unmanaged_test.go with 5 unit tests verifying
  that unmanaged ORC resources do not invoke OpenStack write operations
  during periodic resync (TS-007)
- Tests use a noWriteActuator mock that calls t.Fatal() if CreateResource
  is invoked, enforcing the no-write guarantee at the test level
- Tests cover all unmanaged resource paths through GetOrCreateOSResource:
  1. statusID set (normal periodic resync): only GetOSResourceByID called
  2. importID set (first reconcile): only GetOSResourceByID called
  3. filter-based import: only ListOSResourcesForImport called
  4. no import config (invalid): terminal error returned, no writes
  5. statusID set but resource deleted externally: terminal error, no create
- Tests also verify that OpenStack state IS fetched (read-only operations
  are called), satisfying the 'still fetch current OpenStack state' criterion
- The fakeAdapter type wraps *orcv1alpha1.Flavor and embeds its ObjectMeta
  to implement metav1.Object; fakeResourceController implements
  ResourceController with a nil k8s client (panics on accidental use)
- Pre-setting the controller finalizer on test Flavors avoids any
  Kubernetes client calls during testing
- No production code changes required: the existing reconciler already
  correctly skips write operations for unmanaged resources

Closes: AISOS-391
Detailed description:
- Created network-resync-period/ test: verifies Network with resyncPeriod=10s
  has lastSyncTime set after initial reconcile, then updated after the resync
  period elapses (confirms periodic resync is working end-to-end).
- Created network-resync-disabled/ test: verifies resyncPeriod=0s sets
  lastSyncTime once on initial reconcile but does NOT trigger periodic
  re-reconciliation (15s sleep + no-change assertion).
- Created network-resync-terminal-error/ test: verifies that a network in
  terminal error state (ambiguous import filter, InvalidConfiguration) leaves
  lastSyncTime unset and is not re-reconciled despite resyncPeriod=10s.
- Created network-resync-jitter/ test: verifies that 3 networks with the same
  resyncPeriod=10s all independently re-reconcile (lastSyncTime updated for
  each), demonstrating independent jitter-based scheduling (TS-011).

Key implementation details:
- All tests live under internal/controllers/network/tests/ which is already
  registered in kuttl-test.yaml; no changes to generated files needed.
- resyncPeriod=10s used throughout for fast test execution.
- Script-based assertions use ${NAMESPACE} env var (kuttl standard pattern).
- TestAssert timeout:60 overrides global timeout for resync-period/jitter tests.
- Jitter test verifies independent re-scheduling for all 3 resources rather
  than exact timestamp differences (exact jitter math covered by unit tests).

Closes: AISOS-392
Detailed description:
- Added IsImported() bool to APIObjectAdapter interface in
  internal/controllers/generic/interfaces/adapter.go with doc comment
  explaining that it returns true when importID or importFilter is non-nil
- Added IsImported() implementation to adapter.template so all generated
  adapters include the method: returns GetImportID() != nil || GetImportFilter() != nil
- Regenerated all 23 zz_generated.adapter.go files via resource-generator
  to include the IsImported() method
- Added unit tests in internal/controllers/flavor/adapter_test.go
  covering all four cases: no import, nil import, non-nil importID, and
  non-nil importFilter

Closes: AISOS-393
…ernal deletion

Detailed description:
- Modified GetOrCreateOSResource in resource_actions.go: when fetching by
  status.id returns 404, branch on management policy and import status
  - managed + non-imported: log detection and return (nil, nil) so the caller
    clears status.id and triggers recreation on the next reconcile
  - unmanaged OR imported: return terminal ConditionReasonUnrecoverableError
    (existing behavior preserved)
- Added IsImported() to fakeAdapter in resource_actions_unmanaged_test.go to
  satisfy the updated APIObjectAdapter interface
- Added managedFlavorWithStatusID and managedFlavorImportedByID test helpers
- Added TestGetOrCreateOSResource_ManagedStatusIDDeleted: asserts (nil, nil)
  returned for managed non-imported resource on 404
- Added TestGetOrCreateOSResource_ManagedImportedByIDDeleted: asserts terminal
  error returned for managed imported resource on 404

Closes: AISOS-394
…esources

Detailed description:
- Added ClearStatusID() to internal/controllers/generic/status/status.go:
  sends a JSON merge patch {"status":{"id":null}} to explicitly null the
  ID field. Uses client.RawPatch with MergePatchType because the generated
  apply configuration types use omitempty on ID (a nil pointer would omit
  rather than null the field, which would not clear an existing value).

- Modified reconcileNormal in internal/controllers/generic/reconciler/controller.go:
  replaced the previous 'programming error' branch (osResource == nil with no
  error) with correct external-deletion handling. When GetOrCreateOSResource
  returns (nil, nil) — the signal for a managed non-imported resource deleted
  externally — reconcileNormal now:
  1. Calls ClearStatusID when GetStatusID() != nil to persist the cleared ID
  2. Returns a progress status to trigger another reconcile

  On the next reconcile, GetStatusID() returns nil, so GetOrCreateOSResource
  takes the creation path (adoption then CreateResource), and the new OpenStack
  resource ID is stored in status.id via the existing SetStatusID call.

- Added tests in internal/controllers/generic/status/status_test.go:
  TestClearStatusID_SendsMergePatchWithNullID verifies that after calling
  ClearStatusID, a Flavor's status.id is nil when read back from the fake
  client; TestClearStatusID_IdempotentWhenAlreadyNil verifies no error when
  status.id is already nil.

Closes: AISOS-395
…ateOSResource

Detailed description:
- Created internal/controllers/generic/reconciler/resource_actions_test.go
- Added managedFlavorImportedByFilter helper (not previously in test helpers)
- 5 new test functions covering all combinations of management policy and import status:
  1. TestGetOrCreateOSResource_ExternalDeletion_ManagedOrcCreated: managed + non-imported + 404 → (nil, nil) to trigger recreation
  2. TestGetOrCreateOSResource_ExternalDeletion_ManagedImportedByID: managed + import-by-ID + 404 → terminal error
  3. TestGetOrCreateOSResource_ExternalDeletion_ManagedImportedByFilter: managed + import-by-filter + 404 → terminal error (new case, not previously tested)
  4. TestGetOrCreateOSResource_ExternalDeletion_Unmanaged: unmanaged + 404 → terminal error
  5. TestGetOrCreateOSResource_ExternalDeletion_ManagedResourceExists: managed + resource exists → normal update flow (nil reconcile status, resource returned)
- All 5 tests pass; full reconciler package test suite passes

Closes: AISOS-396
Detailed description:
- Created network-external-deletion/ kuttl test: verifies that when a managed
  ORC Network is deleted directly from OpenStack (bypassing ORC), ORC detects
  the deletion on the next periodic resync and recreates the network with a new
  OpenStack ID. Uses resyncPeriod: 10s to ensure timely detection without
  requiring a manual trigger.
- Created network-external-deletion-import/ kuttl test: verifies that when an
  imported (unmanaged) ORC Network's underlying OpenStack resource is deleted
  externally, ORC sets a terminal error condition (UnrecoverableError) instead
  of attempting recreation, since imported resources cannot be recreated by ORC.
  Also uses resyncPeriod: 10s for timely detection.
- Both tests follow existing patterns in internal/controllers/network/tests/:
  numbered step files (NN-*.yaml), named TestStep and TestAssert manifests,
  openstack CLI for direct OpenStack manipulation, CEL expressions for ID
  verification, generous timeouts (120s) for cloud operations.

Key implementation decisions:
- resyncPeriod: 10s on both resources ensures ORC detects external deletions
  via periodic reconciliation without needing watch events or manual triggers
- Test 1 records the original status.id before deletion and verifies the new
  ID differs after recreation (confirming a new network was created)
- Test 2 asserts UnrecoverableError with message 'resource has been deleted
  from OpenStack' matching the code path in resource_actions.go

Closes: AISOS-397
…eletion handling

Detailed description:
- Created website/docs/user-guide/drift-detection.md: comprehensive user-facing
  documentation for drift detection and external deletion handling behavior
  - Explains how to enable resync with spec.resyncPeriod
  - Documents behavior difference between managed vs unmanaged resources
  - Documents behavior difference between ORC-created vs imported resources
  - Provides examples for verifying recreation via status.id changes
  - Covers implications for dependent resources
- Updated website/docs/user-guide/index.md: added cross-reference section pointing
  to the new drift detection guide
- Updated website/mkdocs.yml: added Drift Detection page to User Guide navigation
- Updated enhancements/drift-detection.md: changed status to 'implemented', updated
  Last Updated date, added detailed Implementation History section documenting all
  completed components (API changes, periodic resync, external deletion handling,
  E2E tests, documentation)

Closes: AISOS-398
…ontrollers

Detailed description:
- The --default-resync-period CLI flag was registered in cmd/manager/main.go
  and stored in manager.Options.DefaultResyncPeriod, but was never propagated
  to the reconciler controllers. The global default was silently ignored.

- Added interfaces.ResyncConfigurable interface with SetDefaultResyncPeriod()
  method to internal/controllers/generic/interfaces/controller.go.

- Updated internal/manager/manager.go to call SetDefaultResyncPeriod on each
  controller (via type assertion to ResyncConfigurable) before SetupWithManager,
  so the operator-level default is available at controller construction time.

- Added defaultResyncPeriod time.Duration field to all 23 *ReconcilerConstructor
  structs and implemented SetDefaultResyncPeriod on each.

- Changed New() functions to return pointers (*xyzReconcilerConstructor) so the
  interface value holds a pointer and the ResyncConfigurable type assertion
  succeeds (pointer receiver methods require a pointer in the interface).

- Changed SetupWithManager receivers from value to pointer receivers for
  consistency with the pointer pattern.

- Updated reconciler.NewController to accept defaultResyncPeriod time.Duration
  and stored it in the Controller struct.

- Updated reconcileNormal to pass c.defaultResyncPeriod (instead of hardcoded
  0) to resync.DetermineResyncPeriod, completing the wiring.

The routerinterface controller is intentionally excluded: it does not use
reconciler.NewController and has no periodic resync support. The ResyncConfigurable
type assertion in manager.Run handles this gracefully.

Closes: AISOS-376-review
@github-actions github-actions Bot added the semver:minor Backwards-compatible change label Apr 19, 2026
Detailed description:
- Added //nolint:kubeapilinter to ResyncPeriod *metav1.Duration field in
  api/v1alpha1/controller_options.go and cmd/resource-generator/data/api.template
  to suppress the nodurations lint rule (kubeapi linter requires integer types
  with units in the name instead of Duration, but metav1.Duration is appropriate
  for user-facing duration configuration)
- Re-ran make generate to propagate the nolint comment to all 23
  api/v1alpha1/zz_generated.*-resource.go files
- Added //nolint:unparam to makeProgressingCondition in controller_test.go
  since the observedGeneration parameter always receives 1 in tests
- Re-ran make generate to update generated docs and test fixtures:
  website/docs/crd-reference.md, website/docs/development/godoc/generic-interfaces.md
  internal/controllers/flavor/adapter_test.go,
  internal/controllers/generic/reconciler/resource_actions_unmanaged_test.go
- Verified: make lint, make verify-fmt, and git diff --exit-code all pass

Closes: AISOS-376-ci-fix
Forge added 6 commits April 19, 2026 15:41
…present

When an imported (unmanaged) network is externally deleted, ORC sets a
terminal error (UnrecoverableError: resource has been deleted from
OpenStack). The Progressing condition was correctly set to False, but the
Available condition was set to Unknown instead of False, because
networkStatusWriter.ResourceAvailableStatus returns ConditionUnknown when
osResource==nil and Status.ID!=nil.

Fix: In SetCommonConditions (generic reconciler), override Unknown → False
when a terminal error is present. A terminal error definitively means the
resource is not available, so Unknown is semantically wrong.

This fix applies to all ORC resource types, not just Network, because the
override is in the generic reconciler path.

Detailed changes:
- internal/controllers/generic/status/conditions.go: add override logic
  that sets availableStatus = ConditionFalse when a terminal error is
  present and availableStatus != ConditionTrue
- internal/controllers/generic/status/conditions_test.go: add three test
  cases covering the new override: Unknown+terminal→False,
  False+terminal stays False, Unknown+non-terminal stays Unknown

Fixes: network-external-deletion-import CI failure (all 3 environments,
both runs 24629161921 and 24631678749)

Closes: AISOS-376-ci-fix
… 60s to 120s

Detailed description:
- Increased TestAssert timeout in network-resync-period/01-assert.yaml from 60 to 120 seconds
- The 60s timeout was too tight for loaded CI runners where the 10s resync
  cycle + OpenStack API call + status patch can exceed 60 seconds
- This flake reproduced in 2 of 3 CI runs (~67% failure rate) in both
  epoxy and gazpacho environments

Closes: AISOS-376-ci-fix
…esync tests

Detailed description:
- network-external-deletion-import/02-assert.yaml: timeout 120s → 300s
- network-external-deletion/01-assert.yaml: timeout 120s → 300s
- network-resync-period/01-assert.yaml: timeout 120s → 180s

Root cause: On loaded CI runners, the controller-runtime workqueue can delay
a nominal 10s resyncPeriod requeue by >120s. The assertion windows were too
tight, causing spurious failures even though the controller logic is correct.

- network-external-deletion-import fails in 67% of CI runs (8/12 jobs)
- network-external-deletion fails in 25% of CI runs (3/12 jobs)
- network-resync-period fails in 42% of CI runs (5/12 jobs)

Closes: AISOS-376-ci-fix
…120s

The test's 01-assert.yaml waits for all 3 networks to update lastSyncTime
after a 10s resync period. On loaded CI runners, workqueue delays push the
actual processing beyond 60s — run 4 gazpacho observed 73.82s, and all 3
environments failed in run 5 with the 60s window.

The resync mechanism is correct (source code verified); 60s is simply too
tight. 120s provides sufficient headroom matching the observed maximum delay.

Closes: AISOS-376-ci-fix
…o 240s

Detailed description:
- network-resync-jitter/01-assert.yaml: timeout 120 → 240
- Run 6 analysis showed gazpacho now passes with 120s but flamingo and epoxy
  still exceed 120s due to workqueue delays on loaded CI runners
- 240s matches the global KUTTL timeout in kuttl-test.yaml (accepted safe
  upper bound for this environment), giving 24× the nominal 10s resync period
- Source code verified correct: requeue is always scheduled, delay is purely
  workqueue processing time

Closes: AISOS-376-ci-fix
… jitter

Detailed description:
- Change CalculateJitteredDuration from symmetric [-10%, +10%] jitter to
  positive-only [0%, +20%] jitter so requeue always fires >= resyncPeriod
- Root cause: with symmetric jitter, 50% of requeues fire before resyncPeriod
  elapses; shouldReconcile returns false and skips reconcile without scheduling
  a new requeue, leaving the resource permanently stuck
- Fix: multiplier := 1.0 + rand.Float64()*2*jitterFactor (range [1.0, 1.2))
  instead of 1.0 + (rand.Float64()*2*jitterFactor - jitterFactor) ([0.9, 1.1))
- Update unit tests in scheduler_test.go and controller_test.go to expect
  new range [base*1.0, base*1.2) instead of [base*0.9, base*1.1]
- Fixes network-resync-jitter, network-resync-period, network-external-deletion,
  and network-external-deletion-import e2e test failures

Closes: AISOS-376-ci-fix
@eshulman2 eshulman2 changed the title [AISOS-376] Implementation for AISOS-376 [AISOS-376] Add drift detection and periodic resync for OpenStack resources Apr 20, 2026
Copy link
Copy Markdown
Contributor Author

@eshulman2 eshulman2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some issues especially with docs

Comment on lines 241 to +252
if osResource == nil {
// Programming error: if we don't have a resource we should either have an error or be waiting on something
return reconcileStatus.WithError(fmt.Errorf("oResource is not set, but no wait events or error"))
// GetOrCreateOSResource returns (nil, nil) when a managed, non-imported
// resource is detected as externally deleted. Clear status.id so the
// next reconciliation enters the standard creation path and assigns a
// new ID after the resource is recreated.
if objAdapter.GetStatusID() != nil {
log.V(logging.Info).Info("Clearing status.id after external deletion to enable recreation")
if err := status.ClearStatusID(ctx, c, objAdapter.GetObject()); err != nil {
return reconcileStatus.WithError(fmt.Errorf("clearing status ID after external deletion: %w", err))
}
}
return reconcileStatus.WithProgressMessage("OpenStack resource was deleted externally; will recreate on next reconcile")
Copy link
Copy Markdown
Contributor Author

@eshulman2 eshulman2 Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a managed (non-imported) resource is externally deleted, ClearStatusID patches Kubernetes then this returns WithProgressMessage — not a terminal error. The deferred UpdateStatus runs with osResource==nil and no terminal error, so the Unknown→False override added in 210fcce4 does not fire (errors.As(..., &terminalErr) is false here). ResourceAvailableStatus returns ConditionUnknown for that cycle rather than ConditionFalse.

Note on 210fcce4: that commit fixes the imported resource case (terminal error still raised). The managed resource path takes the progress-message branch, so 210fcce4 does not apply. Whether Available=Unknown during active recreation is a bug is debatable — it is arguably the correct semantic for "being recreated". The old Available=False was a side effect of using a terminal error for a self-recovering situation.

Comment thread internal/controllers/generic/reconciler/resource_actions.go Outdated
Comment thread internal/controllers/generic/reconciler/controller.go Outdated
Comment thread api/v1alpha1/controller_options.go Outdated
Comment thread internal/controllers/generic/status/status_test.go Outdated
Comment thread internal/controllers/generic/resync/scheduler_test.go Outdated
Comment thread enhancements/drift-detection.md Outdated
Forge added 3 commits April 20, 2026 16:56
Fix stale jitter comment in controller.go: update '±10%' to '[0%, +20%]'
to match the positive-only jitter implementation in scheduler.go.

Closes: AISOS-376-review-impl
Auto-committed by Forge container fallback.
@mandre mandre added the hold Do not merge label Apr 23, 2026
@mandre
Copy link
Copy Markdown
Collaborator

mandre commented Apr 23, 2026

Adding the hold label because this is an experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hold Do not merge semver:minor Backwards-compatible change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants