fix(monitor): serialize DB writes through a single drained writer #99

Merged
lerko merged 1 commits from fix/db-writer into main 2026-06-10 22:28:31 +00:00
Owner

Problem

Every check spawned go e.db.Save*(...) with the error discarded — a fire-and-forget goroutine per log line, check, state change, and alert-health update:

  • SaveLog ran a full-table prune DELETE on every insert; SaveCheck a COUNT + conditional prune on every check — the hot path amplified each write into several statements.
  • Nothing tracked these goroutines, so at shutdown they raced the store's Close() → writes to a closing DB, silently swallowed.
  • state_changes was never pruned → unbounded growth.

Fix

  • Single writer goroutine drains a buffered channel of typed dbWrite values (log / check / state-change / alert-health). Writes enqueue non-blocking; a saturated queue drops and notes it in the in-memory log instead of blocking the check loop. Write errors are logged, not discarded.
  • Retention off the hot path: SaveLog/SaveCheck become plain INSERTs. PruneLogs / PruneCheckHistory / PruneStateChanges run on a 10-min timer inside the writer (keep-newest-N-per-site via a window function). state_changes is now bounded.
  • Engine.Stop() cancels the engine context, then waits for the writer to drain every buffered write before returning. main calls it before the deferred store Close(), so no write races a closed DB.
  • SQLite pragmas via DSNbusy_timeout=5000 + synchronous=NORMAL (+ WAL) applied through the DSN so every pooled connection inherits them (a post-open PRAGMA only touches one connection). :memory: test DBs left as-is.

Tests

Writer drains on Stop, Stop is idempotent, and the prune queries keep newest-N per site / N logs on real SQLite. Full suite green under -race; vet clean. Smoke-tested the real binary: WAL files created, clean SIGTERM shutdown (no hang).

Phase 2 of the fresh-eyes first-cut backlog (builds on #98).

## Problem Every check spawned `go e.db.Save*(...)` with the error discarded — a fire-and-forget goroutine per log line, check, state change, and alert-health update: - `SaveLog` ran a full-table prune DELETE on **every** insert; `SaveCheck` a COUNT + conditional prune on **every** check — the hot path amplified each write into several statements. - Nothing tracked these goroutines, so at shutdown they raced the store's `Close()` → writes to a closing DB, silently swallowed. - `state_changes` was never pruned → unbounded growth. ## Fix - **Single writer goroutine** drains a buffered channel of typed `dbWrite` values (log / check / state-change / alert-health). Writes enqueue **non-blocking**; a saturated queue drops and notes it in the in-memory log instead of blocking the check loop. Write errors are **logged, not discarded**. - **Retention off the hot path:** `SaveLog`/`SaveCheck` become plain INSERTs. `PruneLogs` / `PruneCheckHistory` / `PruneStateChanges` run on a 10-min timer inside the writer (keep-newest-N-per-site via a window function). `state_changes` is now bounded. - **`Engine.Stop()`** cancels the engine context, then waits for the writer to drain every buffered write before returning. `main` calls it before the deferred store `Close()`, so no write races a closed DB. - **SQLite pragmas via DSN** — `busy_timeout=5000` + `synchronous=NORMAL` (+ WAL) applied through the DSN so **every pooled connection** inherits them (a post-open PRAGMA only touches one connection). `:memory:` test DBs left as-is. ## Tests Writer drains on `Stop`, `Stop` is idempotent, and the prune queries keep newest-N per site / N logs on real SQLite. Full suite green under `-race`; vet clean. Smoke-tested the real binary: WAL files created, clean SIGTERM shutdown (no hang). Phase 2 of the fresh-eyes first-cut backlog (builds on #98).
lerko added 1 commit 2026-06-10 22:14:30 +00:00
fix(monitor): serialize DB writes through a single drained writer
CI / test (pull_request) Successful in 2m36s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 51s
8b39d4c1a1
Every check spawned `go e.db.Save*(...)` with the error discarded: a
fire-and-forget goroutine per log line, check, state change, and alert
health update. SaveLog ran a full-table prune DELETE on every insert and
SaveCheck a COUNT + conditional prune on every check, so the hot path
amplified each write into several statements. Nothing tracked these
goroutines, so at shutdown they raced the store's Close() — writes to a
closing DB, silently swallowed.

Introduce a single writer goroutine that drains a buffered channel of
typed dbWrite values (log/check/state-change/alert-health). Writes are
enqueued non-blocking; a saturated queue drops and notes it in the
in-memory log rather than blocking the check loop. Write errors are now
logged instead of discarded. Retention moves off the hot path: SaveLog
and SaveCheck become plain INSERTs, and PruneLogs/PruneCheckHistory/
PruneStateChanges run on a 10-minute timer inside the writer (single
keep-newest-N-per-site pass via a window function). state_changes was
previously never pruned — now bounded.

Add Engine.Stop(): cancels the engine's context, then waits for the
writer to drain every buffered write before returning. main wires it in
before the deferred store Close() so no write races a closed DB.

SQLite gains busy_timeout=5000 and synchronous=NORMAL, applied via the
DSN so every pooled connection inherits them (a post-open PRAGMA only
touches one connection); WAL moves to the DSN too. :memory: test DBs are
left as-is.

Tests: writer drains on Stop, Stop is idempotent, and the prune queries
keep newest-N per site / N logs on real SQLite. Full suite green under
-race.
lerko force-pushed fix/db-writer from 8f479635f9 to 8b39d4c1a1 2026-06-10 22:14:30 +00:00 Compare
lerko merged commit 8b39d4c1a1 into main 2026-06-10 22:28:31 +00:00
lerko deleted branch fix/db-writer 2026-06-10 22:28:31 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: lerkolabs/uptop#99