fix(engine): six correctness fixes for the state machine #105

Merged
lerko merged 1 commits from fix/engine-state-machine into main 2026-06-11 18:03:16 +00:00
Owner

Summary

Phase 3 of the review-findings backlog — engine correctness. Six fixes, all in internal/monitor/ + internal/store/ + internal/server/.

  1. Group auto-pause trap — removed the one-way Paused=true mutation from checkGroup. Once set, monitorRoutine skipped the group and it could never re-evaluate.

  2. Retry logic for all →DOWNMaxRetries now applies to PENDING→DOWN, LATE→DOWN, STALE→DOWN too. New monitors no longer alert on first transient failure.

  3. Shutdown drain hole — checker goroutines tracked with checkerWG. Stop() waits for in-flight checks, then drains the write queue, then does a final drain for stragglers.

  4. Probe-ingest writer bypassSaveCheckFromNode routed through the engine's serialized dbWriter instead of direct store writes from the HTTP handler.

  5. Dead-probe expiry — stale probe results (>3× site interval) expired before aggregation. Dead probe no longer poisons status forever. RemoveSite cleans probeResults.

  6. Maintenance-cache N+1 — replaced per-check DB query with a fully-resolved in-memory cache refreshed every poll cycle (one query instead of N).

Also: ImportData now wipes check_history, state_changes, and alert_health so re-inserted IDs don't inherit stale history.

Deferred: Transactional Apply → Phase 5 (needs Store interface changes).

Test plan

  • 8 new tests covering all 6 fixes
  • 4 existing maintenance tests updated for cache
  • go test -count=1 ./... — all pass
  • golangci-lint — 0 issues
  • No regressions in existing UP→DOWN retry path
## Summary Phase 3 of the review-findings backlog — engine correctness. Six fixes, all in `internal/monitor/` + `internal/store/` + `internal/server/`. 1. **Group auto-pause trap** — removed the one-way `Paused=true` mutation from `checkGroup`. Once set, `monitorRoutine` skipped the group and it could never re-evaluate. 2. **Retry logic for all →DOWN** — `MaxRetries` now applies to PENDING→DOWN, LATE→DOWN, STALE→DOWN too. New monitors no longer alert on first transient failure. 3. **Shutdown drain hole** — checker goroutines tracked with `checkerWG`. `Stop()` waits for in-flight checks, then drains the write queue, then does a final drain for stragglers. 4. **Probe-ingest writer bypass** — `SaveCheckFromNode` routed through the engine's serialized `dbWriter` instead of direct store writes from the HTTP handler. 5. **Dead-probe expiry** — stale probe results (>3× site interval) expired before aggregation. Dead probe no longer poisons status forever. `RemoveSite` cleans `probeResults`. 6. **Maintenance-cache N+1** — replaced per-check DB query with a fully-resolved in-memory cache refreshed every poll cycle (one query instead of N). Also: `ImportData` now wipes `check_history`, `state_changes`, and `alert_health` so re-inserted IDs don't inherit stale history. **Deferred:** Transactional `Apply` → Phase 5 (needs Store interface changes). ## Test plan - [x] 8 new tests covering all 6 fixes - [x] 4 existing maintenance tests updated for cache - [x] `go test -count=1 ./...` — all pass - [x] `golangci-lint` — 0 issues - [x] No regressions in existing UP→DOWN retry path
lerko added 1 commit 2026-06-11 17:57:08 +00:00
fix(engine): six correctness fixes for the state machine
CI / test (pull_request) Successful in 1m59s
CI / lint (pull_request) Successful in 1m17s
CI / vulncheck (pull_request) Successful in 1m1s
5d5153351e
1. Group auto-pause trap: remove the one-way Paused=true mutation
   from checkGroup — monitorRoutine skipped paused groups, so they
   could never re-evaluate or auto-unpause.

2. Retry logic: apply MaxRetries to all →DOWN transitions, not just
   UP→DOWN. New monitors (PENDING) no longer alert on first transient
   failure when retries are configured.

3. Shutdown drain hole: track checker goroutines with checkerWG so
   Stop() waits for in-flight checks before draining the write queue.
   Final drainWrites() catches any writes enqueued after the writer's
   own drain.

4. Probe-ingest writer bypass: route SaveCheckFromNode through the
   engine's serialized dbWriter instead of writing directly to the
   store from the HTTP handler.

5. Dead-probe expiry: expire stale probe results (>3× site interval)
   before aggregation so a dead probe can't poison status forever.
   Also clean probeResults in RemoveSite.

6. Maintenance-cache N+1: replace per-check DB query with a
   fully-resolved in-memory cache refreshed every poll cycle. One
   GetActiveMaintenanceWindows() call instead of N IsMonitorInMaintenance.

ImportData now wipes check_history, state_changes, and alert_health
so re-inserted IDs don't inherit stale history from prior occupants.
lerko force-pushed fix/engine-state-machine from cf1565a508 to 5d5153351e 2026-06-11 17:57:08 +00:00 Compare
lerko merged commit 5d5153351e into main 2026-06-11 18:03:16 +00:00
lerko deleted branch fix/engine-state-machine 2026-06-11 18:03:17 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: lerkolabs/uptop#105