1. Group auto-pause trap: remove the one-way Paused=true mutation
from checkGroup — monitorRoutine skipped paused groups, so they
could never re-evaluate or auto-unpause.
2. Retry logic: apply MaxRetries to all →DOWN transitions, not just
UP→DOWN. New monitors (PENDING) no longer alert on first transient
failure when retries are configured.
3. Shutdown drain hole: track checker goroutines with checkerWG so
Stop() waits for in-flight checks before draining the write queue.
Final drainWrites() catches any writes enqueued after the writer's
own drain.
4. Probe-ingest writer bypass: route SaveCheckFromNode through the
engine's serialized dbWriter instead of writing directly to the
store from the HTTP handler.
5. Dead-probe expiry: expire stale probe results (>3× site interval)
before aggregation so a dead probe can't poison status forever.
Also clean probeResults in RemoveSite.
6. Maintenance-cache N+1: replace per-check DB query with a
fully-resolved in-memory cache refreshed every poll cycle. One
GetActiveMaintenanceWindows() call instead of N IsMonitorInMaintenance.
ImportData now wipes check_history, state_changes, and alert_health
so re-inserted IDs don't inherit stale history from prior occupants.