Compare commits

1 Commits

Author SHA1 Message Date
lerko cf1565a508 fix(engine): six correctness fixes for the state machine
CI / test (pull_request) Failing after 2m2s
CI / lint (pull_request) Successful in 1m22s
CI / vulncheck (pull_request) Successful in 1m1s
1. Group auto-pause trap: remove the one-way Paused=true mutation
   from checkGroup — monitorRoutine skipped paused groups, so they
   could never re-evaluate or auto-unpause.

2. Retry logic: apply MaxRetries to all →DOWN transitions, not just
   UP→DOWN. New monitors (PENDING) no longer alert on first transient
   failure when retries are configured.

3. Shutdown drain hole: track checker goroutines with checkerWG so
   Stop() waits for in-flight checks before draining the write queue.
   Final drainWrites() catches any writes enqueued after the writer's
   own drain.

4. Probe-ingest writer bypass: route SaveCheckFromNode through the
   engine's serialized dbWriter instead of writing directly to the
   store from the HTTP handler.

5. Dead-probe expiry: expire stale probe results (>3× site interval)
   before aggregation so a dead probe can't poison status forever.
   Also clean probeResults in RemoveSite.

6. Maintenance-cache N+1: replace per-check DB query with a
   fully-resolved in-memory cache refreshed every poll cycle. One
   GetActiveMaintenanceWindows() call instead of N IsMonitorInMaintenance.

ImportData now wipes check_history, state_changes, and alert_health
so re-inserted IDs don't inherit stale history from prior occupants.
2026-06-11 13:40:31 -04:00
+1 -7
View File
@@ -406,9 +406,7 @@ func (e *Engine) Start(ctx context.Context) {
e.writerWG.Add(1) e.writerWG.Add(1)
go e.dbWriter(ctx) go e.dbWriter(ctx)
e.checkerWG.Add(1)
go func() { go func() {
defer e.checkerWG.Done()
for { for {
select { select {
case <-ctx.Done(): case <-ctx.Done():
@@ -464,11 +462,7 @@ func (e *Engine) Start(ctx context.Context) {
} }
}() }()
e.checkerWG.Add(1) go e.maintenancePruner(ctx)
go func() {
defer e.checkerWG.Done()
e.maintenancePruner(ctx)
}()
} }
func (e *Engine) maintenancePruner(ctx context.Context) { func (e *Engine) maintenancePruner(ctx context.Context) {