Fleet Stability
Estimated date of completion: Ongoing
Resources Required:
- 1 Delivery engineer (intermittent)
- Infra team support
Ensure all Logos Delivery production fleets are stable, monitored, and recoverable. This covers:
logos.dev— Logos testnet development fleetlogos.test— Logos testnet staging fleetstatus.prod— Status production fleetwaku.sandbox— Waku sandbox fleet
Note on monitoring: Status and Waku fleets (status.prod, waku.sandbox) already have monitoring infrastructure in place (Grafana, Kibana, alerts). For Logos fleets (logos.dev, logos.test), not all monitoring is available yet — in particular, Grafana dashboards are not possible because Logos Core does not yet expose metrics of modules. This is a blocker that needs to be resolved with the Logos Core team.
Definition of Done
This milestone is considered complete when all of the following are true for every operated fleet:
- Monitoring: Dashboards exist showing node health, connectivity, message throughput, and resource usage. For Logos fleets, this requires Logos Core to expose module metrics.
- Alerting: Alerts fire for node crashes, database issues, connectivity drops, and resource exhaustion.
- Crash reporting: Sentry is integrated and reporting crashes for fleet builds.
- Stability: No known P0 or P1 stability issues. P0 = fleet-wide outage or data loss. P1 = recurring crashes or degraded service affecting users.
Deliverables
Integrate Sentry for crash reporting
Owner: Delivery Team
- Compile-time flag (
-d:sentry) for public/open-source builds (no telemetry code present) - Runtime dynamic loading for fleet builds
- Dual approach satisfies privacy narrative while enabling crash reporting for operated fleets
- Sentry project configured and receiving reports from all fleet nodes
Done when: All fleet nodes report crashes to Sentry. Crash reports include stack traces and node metadata.
Fleet monitoring dashboards for Logos fleets
Owner: Delivery Team + Infra + Logos Core Team
- Logos Core exposes module metrics (blocker — coordinate with Logos Core team)
- Dashboards for
logos.devandlogos.testshowing:- Node uptime and restart frequency
- Message send/receive rates and latency
- Peer connectivity (number of peers, connection stability)
- Resource usage (CPU, memory, disk, bandwidth)
- Database health (size, query latency, corruption indicators)
Done when: Dashboards are deployed and accessible for Logos fleets. Data is current and accurate. (Status and Waku fleets already have this.)
Fleet alerting for Logos fleets
Owner: Delivery Team + Infra
- Alerts configured for:
- Node crash or unexpected restart
- Database corruption or excessive size
- Connectivity drops (below minimum peer threshold)
- Memory or disk exhaustion
- Message delivery failure rate exceeding threshold
Done when: Alerts fire within 5 minutes of incident. (Status and Waku fleets already have alerting.)
Resolve known fleet stability issues
Owner: Delivery Team
- Address recurring crashes reported in DST testing
- Database corruption issues
- Memory leaks in long-running nodes
- Discovery issues blocking DST benchmarking
Done when: No known P0 or P1 stability issues across all fleets for 30 consecutive days.