Skip to content

DevOps: Add contract and backend monitoring with alerting #306

@CelestinaBeing

Description

@CelestinaBeing

Summary

There is no monitoring or alerting configured for the Trivela system on mainnet. Contract pause events, backend downtime, Soroban RPC degradation, or abnormal error rates will go undetected until users report them. For a mainnet platform, proactive monitoring is non-negotiable.

Problem

  • Backend /health and /metrics endpoints exist but nothing consumes them with alerting rules
  • No alerting on: contract paused event, backend 5xx spike, RPC health check failure, campaign DB write errors
  • No uptime monitoring (e.g. UptimeRobot, BetterStack) configured
  • No runbook for common failure scenarios

Acceptance Criteria

Prometheus/Grafana (self-hosted option)

  • Add prometheus.yml scrape config targeting the backend /metrics endpoint
  • Add alerting_rules.yml with alerts for:
    • Backend error rate > 5% over 5 min
    • RPC health status degraded for > 2 min
    • Process uptime reset (restart detected)
  • Add Grafana dashboard JSON (monitoring/dashboards/trivela.json) with: request rate, error rate, uptime, route breakdown
  • Add monitoring/ directory with compose override (compose.monitoring.yml) for local Prometheus + Grafana

Soroban Event Monitoring

Runbook

  • Add docs/RUNBOOK.md with procedures for: backend restart, RPC failover, contract pause response, DB backup restore

References

  • backend/src/index.js/metrics endpoint (Prometheus format)
  • compose.yaml
  • docs/ARCHITECTURE_OVERVIEW.md

Metadata

Metadata

Assignees

Labels

Stellar WaveIssues in the Stellar wave program

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions