fix(monitor): merge check results into live state, never overwrite #98

Merged
lerko merged 1 commits from fix/livestate-race into main 2026-06-10 20:50:53 +00:00
Owner

Problem

checkByID snapshotted a Site under RLock, ran a network check for seconds, then handleStatusChange wrote the entire stale struct back into liveState. Any concurrent mutation during the check — a user pause, a config edit, or a push heartbeat — was silently reverted.

Worst case: a heartbeat set UP and an in-flight checkPush overwrote it with a stale DOWN → false alert.

Fix

  • New applyState(id, mutate): single read-modify-write helper. Mutator runs against the current live entry under the write lock, so config + Paused are preserved automatically and status transitions read the true current status.
  • Routed handleStatusChange, RecordHeartbeat, ToggleSitePause, checkGroup through it — no more whole-struct overwrite.
  • Logs/alerts fire after the lock releases (tiny critical section).
  • Push false-DOWN guard: a non-UP result whose snapshot LastCheck predates the live LastCheck is dropped (a heartbeat/newer check superseded it). HTTP/probe stamp LastCheck=now, so unaffected — and serial per site anyway.
  • Bonus: fixes a latent bug where RecordHeartbeat read StatusChangedAt after overwriting it (always "was down 0s"); downSince is now captured before mutation.

Tests

4 new regression tests — pause / config-edit / heartbeat during in-flight check, and removed-site-dropped. Each fails against the old code. Full suite green under -race; build + vet clean.

Phase 1 of the fresh-eyes first-cut backlog.

## Problem `checkByID` snapshotted a `Site` under RLock, ran a network check for seconds, then `handleStatusChange` wrote the **entire stale struct** back into `liveState`. Any concurrent mutation during the check — a user **pause**, a **config edit**, or a push **heartbeat** — was silently reverted. Worst case: a heartbeat set UP and an in-flight `checkPush` overwrote it with a stale DOWN → **false alert**. ## Fix - New `applyState(id, mutate)`: single read-modify-write helper. Mutator runs against the **current** live entry under the write lock, so config + `Paused` are preserved automatically and status transitions read the true current status. - Routed `handleStatusChange`, `RecordHeartbeat`, `ToggleSitePause`, `checkGroup` through it — no more whole-struct overwrite. - Logs/alerts fire **after** the lock releases (tiny critical section). - **Push false-DOWN guard:** a non-UP result whose snapshot `LastCheck` predates the live `LastCheck` is dropped (a heartbeat/newer check superseded it). HTTP/probe stamp `LastCheck=now`, so unaffected — and serial per site anyway. - Bonus: fixes a latent bug where `RecordHeartbeat` read `StatusChangedAt` *after* overwriting it (always "was down 0s"); `downSince` is now captured before mutation. ## Tests 4 new regression tests — pause / config-edit / heartbeat during in-flight check, and removed-site-dropped. Each fails against the old code. Full suite green under `-race`; build + vet clean. Phase 1 of the fresh-eyes first-cut backlog.
lerko added 1 commit 2026-06-10 20:45:21 +00:00
fix(monitor): merge check results into live state, never overwrite
CI / test (pull_request) Successful in 2m51s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 51s
5e7faf9ea7
checkByID snapshotted a Site under RLock, ran a network check for
seconds, then handleStatusChange wrote the entire stale struct back into
liveState. Any concurrent mutation during the check — a user pause, a
config edit, or a push heartbeat — was silently reverted. Worst case: a
heartbeat set UP and an in-flight checkPush overwrote it with a stale
DOWN, firing a false alert.

Introduce applyState(id, mutate): a single read-modify-write helper that
runs the mutator against the CURRENT live entry under the write lock, so
config and Paused are preserved automatically and status transitions are
computed from the true current status. Route handleStatusChange,
RecordHeartbeat, ToggleSitePause and checkGroup through it. Logs and
alerts now fire after the lock is released, off the critical section.

Push false-DOWN is closed by a guard: a non-UP result whose snapshot
LastCheck predates the live LastCheck is dropped, since a heartbeat (or
newer check) superseded it. HTTP/probe stamp LastCheck=now before the
call, so they are unaffected (and serial per site anyway).

Also fixes a latent bug where RecordHeartbeat read StatusChangedAt after
overwriting it, always reporting "was down 0s"; downSince is now captured
before mutation.

Adds regression tests for pause/config-edit/heartbeat-during-check and
removed-site-dropped. Full suite green under -race.
lerko merged commit 5e7faf9ea7 into main 2026-06-10 20:50:53 +00:00
lerko deleted branch fix/livestate-race 2026-06-10 20:50:53 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: lerkolabs/uptop#98