Skip to content

Sample Platform Initial Assessment - Reliability Issues and Improvement Plan #942

@cfsmp3

Description

@cfsmp3

Sample Platform Initial Assessment

This issue documents a comprehensive reliability assessment of the Sample Platform and tracks progress on improvements.

Executive Summary

The Sample Platform has fundamental architectural issues causing:

  • Test timeouts (4+ hours)
  • Tests stuck in "queued" state indefinitely
  • False "PR closed or updated" errors
  • Silent failures leaving tests in limbo

The root cause is the lack of a proper task queue system combined with missing timeout handling and no error recovery mechanisms.


Technology Stack

Component Current Issue
Backend Python Flask Synchronous, blocking request handlers
Task Queue None ❌ Critical gap - uses cron polling instead
Job Trigger Cron jobs Only runs every N minutes, can crash silently
VM Management GCP Compute Engine Blocking API calls with no timeouts
Database MySQL/SQLAlchemy No transaction management, no connection timeouts
GitHub Integration PyGithub API calls without timeouts

Critical Issues

1. No Task Queue System

The platform processes tests via cron jobs instead of a proper queue (Celery, RQ, Redis Queue, etc.):

GitHub Webhook → DB Insert → Wait for Cron → Process Test

Problems:

  • Tests accumulate while waiting for next cron cycle
  • If cron crashes, no tests run until manually restarted
  • No parallel processing capability
  • No retry mechanism for failed tasks

2. Infinite Timeout Loop

File: mod_ci/controllers.py lines 563-591

def wait_for_operation(compute, project, zone, operation):
    while True:  # NO TIMEOUT OR MAX ITERATIONS!
        result = compute.zoneOperations().get(
            project=project, zone=zone, operation=operation
        ).execute()
        
        if result['status'] == 'DONE':
            return result
        
        time.sleep(1)  # Loops forever if GCP operation stalls

Impact: A single stuck GCP operation blocks the entire cron job for hours.


3. Missing Timeouts on External API Calls

File: mod_ci/controllers.py

Line Call Timeout
239 repository.get_pull(test.pr_nr) ❌ None
244 repository.get_commit(test.commit) ❌ None
386 repository.get_artifacts() ❌ None
396 requests.get(artifact_url) ❌ None
1669 requests.get(api_url) ✅ 10 seconds

Only 1 out of 5+ external calls has a timeout. Any network issue causes indefinite hangs.


4. Silent Failures

File: mod_ci/controllers.py lines 398-413

except Exception as e:
    log.critical("Could not fetch artifact, request timed out")
    return  # Silent return - test stays stuck in "preparation" forever!

if r.status_code != 200:
    log.critical(f"Could not fetch artifact, response code: {r.status_code}")
    return  # Silent return - no status update, no retry

Impact: Test remains in "preparation" state indefinitely. User sees test as "running" but nothing happens.


5. Race Condition: "PR closed or updated"

File: mod_ci/controllers.py lines 238-251

test_pr = repository.get_pull(test.pr_nr)
if test.commit != test_pr.head.sha:
    deschedule_test(gh_commit, message="PR closed or updated", test=test)
    continue

Scenario:

  1. User opens PR with commit abc123
  2. Test queued with commit = abc123
  3. User pushes fix (new commit def456)
  4. Cron runs 5 minutes later
  5. Check fails: abc123 != def456
  6. Test cancelled with "PR closed or updated"

User experience: "I just opened a PR and it immediately shows an error"


6. No Concurrency Control

File: mod_ci/controllers.py lines 225-252

pending_tests = Test.query.filter(
    Test.id.notin_(finished_tests), 
    Test.id.notin_(running_tests),
    Test.platform == platform
)

for test in pending_tests:
    start_test(...)  # No lock! Two cron processes can start same test

Impact: Duplicate VM creation, wasted resources, inconsistent results.


7. No Database Transaction Management

Multiple places use g.db.commit() without:

  • Try-except blocks
  • Rollback on error
  • Connection timeout configuration

Example locations: lines 182-183, 735-736, and many others.


8. Blocking Operations in Request Handlers

File: mod_ci/controllers.py lines 1149-1195 (progress_reporter endpoint)

The progress reporting endpoint performs synchronous database operations and file I/O, which can cause request timeouts when the VM reports progress.


9. Resource Leaks

File: mod_ci/controllers.py line 404

open(os.path.join(base_folder, 'ccextractor.zip'), 'wb').write(r.content)
  • File handle not explicitly closed
  • Entire artifact loaded into memory (no streaming)
  • Failed downloads leave temp files behind

Symptom-to-Cause Mapping

Symptom Root Cause
4+ hour timeouts wait_for_operation() infinite loop; GitHub/GCP API hangs without timeout
Tests stuck "queued" Cron crashes on exception; silent failures in start_test(); no state recovery
"PR closed or updated" errors Race condition between PR update and cron execution
Generic errors Silent failures with no retry logic; exceptions logged but test status not updated
Inconsistent results No concurrency control; duplicate test execution possible

Recommended Plan of Action

Phase 1: Critical Fixes (Quick Wins)

  • Add timeouts to all external API calls (GitHub, GCP, requests)

    • 30-60 second timeout for API calls
    • 5 minute timeout for artifact downloads
  • Add timeout/max iterations to wait_for_operation()

    • Maximum 30 minutes for GCP operations
    • Return error state instead of hanging
  • Update test status on failure

    • Change silent return statements to mark test as "failed"
    • Add error message to test record
  • Add basic locking

    • Database-level lock or Redis lock before starting a test
    • Prevent duplicate test execution

Phase 2: Improved Error Handling

  • Implement retry logic with exponential backoff

    • 3 retries for transient failures
    • Exponential backoff (1s, 2s, 4s)
  • Add database transaction management

    • Wrap operations in try-except
    • Rollback on failure
  • Fix race condition for PR updates

    • Option A: Queue test for specific commit, don't check for updates
    • Option B: Cancel old test when new commit pushed (via webhook)

Phase 3: Architectural Improvements

  • Implement proper task queue

    • Add Celery or RQ with Redis backend
    • Replace cron-based polling with event-driven processing
    • Enable parallel test execution
  • Add test state machine

    • Clear states: queued → preparing → running → completed/failed
    • Timeout-based state transitions
    • Heartbeat mechanism for running tests
  • Add health monitoring

    • Health check endpoint for cron job status
    • Alerting for stuck tests (>1 hour in same state)
    • Dashboard for queue depth and processing rate

Phase 4: Infrastructure

  • Stream artifact downloads

    • Don't load entire file into memory
    • Resume interrupted downloads
  • Add cleanup job

    • Remove stale temp files
    • Clean up orphaned GCP instances
  • Improve logging

    • Structured logging with correlation IDs
    • Log test state transitions

Progress Tracking

This section will be updated as PRs are submitted.

PR Description Status
- - -

References


Assessment conducted: 2025-12-22

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions