fix(engine): six correctness fixes for the state machine #105
Reference in New Issue
Block a user
Delete Branch "fix/engine-state-machine"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Phase 3 of the review-findings backlog — engine correctness. Six fixes, all in
internal/monitor/+internal/store/+internal/server/.Group auto-pause trap — removed the one-way
Paused=truemutation fromcheckGroup. Once set,monitorRoutineskipped the group and it could never re-evaluate.Retry logic for all →DOWN —
MaxRetriesnow applies to PENDING→DOWN, LATE→DOWN, STALE→DOWN too. New monitors no longer alert on first transient failure.Shutdown drain hole — checker goroutines tracked with
checkerWG.Stop()waits for in-flight checks, then drains the write queue, then does a final drain for stragglers.Probe-ingest writer bypass —
SaveCheckFromNoderouted through the engine's serializeddbWriterinstead of direct store writes from the HTTP handler.Dead-probe expiry — stale probe results (>3× site interval) expired before aggregation. Dead probe no longer poisons status forever.
RemoveSitecleansprobeResults.Maintenance-cache N+1 — replaced per-check DB query with a fully-resolved in-memory cache refreshed every poll cycle (one query instead of N).
Also:
ImportDatanow wipescheck_history,state_changes, andalert_healthso re-inserted IDs don't inherit stale history.Deferred: Transactional
Apply→ Phase 5 (needs Store interface changes).Test plan
go test -count=1 ./...— all passgolangci-lint— 0 issuescf1565a508to5d5153351e