Commit Graph

15 Commits

Author SHA1 Message Date
lerko 52ccd7ad91 refactor(models): split Site into SiteConfig + SiteState
CI / test (pull_request) Successful in 1m58s
CI / lint (pull_request) Successful in 1m21s
CI / vulncheck (pull_request) Successful in 1m2s
Site now embeds SiteConfig (22 persistent fields) and SiteState
(11 ephemeral runtime fields). Field access unchanged via promotion
— site.Name and site.Status still work.

Store layer deals exclusively in SiteConfig — the DB never sees
runtime state. Engine's liveState keeps full Site composites.
UpdateSiteConfig reduced from 11-line field-by-field copy to
`existing.SiteConfig = cfg`.

RunCheck takes SiteConfig (only needs config fields). Checker is
now statically prevented from reading/writing runtime state.

Backup.Sites changed to []SiteConfig — exports no longer carry
zero-valued runtime fields. Import backward-compatible (json
ignores unknown fields).
2026-06-11 17:13:09 -04:00
lerko 2b357341c8 refactor(store): shared storetest.BaseMock replaces 5 duplicated mocks
CI / test (pull_request) Successful in 1m57s
CI / lint (pull_request) Successful in 1m16s
CI / vulncheck (pull_request) Successful in 1m1s
New internal/store/storetest/mock.go provides BaseMock implementing
the full Store interface with no-op defaults and optional Func field
overrides. Each test file embeds BaseMock and shadows only the methods
it needs.

Removes ~400 lines of duplicated stub methods across 6 test files.
Adding a Store method now requires one addition (BaseMock) instead
of editing 6 files.
2026-06-11 16:09:29 -04:00
lerko 70a83a1da9 refactor(store): propagate context.Context through all Store methods
Every Store interface method (except Close) now takes context.Context
as first parameter. All 54 db.Query/Exec/QueryRow calls in SQLStore
replaced with their *Context variants. DB operations now respect
cancellation and deadlines.

Context sources by caller:
- Engine dbWriter/poll/pruner: engine ctx from Start()
- HTTP handlers: r.Context()
- config.Apply/Export: caller-provided ctx
- TUI/main.go init: context.Background()

RunCheck and all sub-checks (HTTP/ping/port/DNS) accept parent ctx.
HTTP checks now inherit shutdown cancellation instead of rooting in
context.Background(). dbWrite.exec takes ctx so the writer goroutine
can cancel stuck DB operations.

DeleteSite/ImportData use BeginTx(ctx) instead of Begin().
2026-06-11 14:40:30 -04:00
lerko 5d5153351e fix(engine): six correctness fixes for the state machine
CI / test (pull_request) Successful in 1m59s
CI / lint (pull_request) Successful in 1m17s
CI / vulncheck (pull_request) Successful in 1m1s
1. Group auto-pause trap: remove the one-way Paused=true mutation
   from checkGroup — monitorRoutine skipped paused groups, so they
   could never re-evaluate or auto-unpause.

2. Retry logic: apply MaxRetries to all →DOWN transitions, not just
   UP→DOWN. New monitors (PENDING) no longer alert on first transient
   failure when retries are configured.

3. Shutdown drain hole: track checker goroutines with checkerWG so
   Stop() waits for in-flight checks before draining the write queue.
   Final drainWrites() catches any writes enqueued after the writer's
   own drain.

4. Probe-ingest writer bypass: route SaveCheckFromNode through the
   engine's serialized dbWriter instead of writing directly to the
   store from the HTTP handler.

5. Dead-probe expiry: expire stale probe results (>3× site interval)
   before aggregation so a dead probe can't poison status forever.
   Also clean probeResults in RemoveSite.

6. Maintenance-cache N+1: replace per-check DB query with a
   fully-resolved in-memory cache refreshed every poll cycle. One
   GetActiveMaintenanceWindows() call instead of N IsMonitorInMaintenance.

ImportData now wipes check_history, state_changes, and alert_health
so re-inserted IDs don't inherit stale history from prior occupants.
2026-06-11 13:57:03 -04:00
lerko 8b39d4c1a1 fix(monitor): serialize DB writes through a single drained writer
CI / test (pull_request) Successful in 2m36s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 51s
Every check spawned `go e.db.Save*(...)` with the error discarded: a
fire-and-forget goroutine per log line, check, state change, and alert
health update. SaveLog ran a full-table prune DELETE on every insert and
SaveCheck a COUNT + conditional prune on every check, so the hot path
amplified each write into several statements. Nothing tracked these
goroutines, so at shutdown they raced the store's Close() — writes to a
closing DB, silently swallowed.

Introduce a single writer goroutine that drains a buffered channel of
typed dbWrite values (log/check/state-change/alert-health). Writes are
enqueued non-blocking; a saturated queue drops and notes it in the
in-memory log rather than blocking the check loop. Write errors are now
logged instead of discarded. Retention moves off the hot path: SaveLog
and SaveCheck become plain INSERTs, and PruneLogs/PruneCheckHistory/
PruneStateChanges run on a 10-minute timer inside the writer (single
keep-newest-N-per-site pass via a window function). state_changes was
previously never pruned — now bounded.

Add Engine.Stop(): cancels the engine's context, then waits for the
writer to drain every buffered write before returning. main wires it in
before the deferred store Close() so no write races a closed DB.

SQLite gains busy_timeout=5000 and synchronous=NORMAL, applied via the
DSN so every pooled connection inherits them (a post-open PRAGMA only
touches one connection); WAL moves to the DSN too. :memory: test DBs are
left as-is.

Tests: writer drains on Stop, Stop is idempotent, and the prune queries
keep newest-N per site / N logs on real SQLite. Full suite green under
-race.
2026-06-10 18:14:28 -04:00
lerko 5e7faf9ea7 fix(monitor): merge check results into live state, never overwrite
CI / test (pull_request) Successful in 2m51s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 51s
checkByID snapshotted a Site under RLock, ran a network check for
seconds, then handleStatusChange wrote the entire stale struct back into
liveState. Any concurrent mutation during the check — a user pause, a
config edit, or a push heartbeat — was silently reverted. Worst case: a
heartbeat set UP and an in-flight checkPush overwrote it with a stale
DOWN, firing a false alert.

Introduce applyState(id, mutate): a single read-modify-write helper that
runs the mutator against the CURRENT live entry under the write lock, so
config and Paused are preserved automatically and status transitions are
computed from the true current status. Route handleStatusChange,
RecordHeartbeat, ToggleSitePause and checkGroup through it. Logs and
alerts now fire after the lock is released, off the critical section.

Push false-DOWN is closed by a guard: a non-UP result whose snapshot
LastCheck predates the live LastCheck is dropped, since a heartbeat (or
newer check) superseded it. HTTP/probe stamp LastCheck=now before the
call, so they are unaffected (and serial per site anyway).

Also fixes a latent bug where RecordHeartbeat read StatusChangedAt after
overwriting it, always reporting "was down 0s"; downSince is now captured
before mutation.

Adds regression tests for pause/config-edit/heartbeat-during-check and
removed-site-dropped. Full suite green under -race.
2026-06-10 16:04:00 -04:00
lerko 21a1563e53 feat(monitor): auto-prune expired maintenance windows
CI / test (pull_request) Successful in 2m33s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 50s
Background goroutine runs every 15 minutes, deletes maintenance windows
that expired beyond the retention period (default 7 days). Configurable
via UPTOP_MAINT_RETENTION env var (Go duration format).

Closes #72
2026-06-05 18:27:42 -04:00
lerko 60592ef810 feat(tui): add SLA reporting view
CI / test (pull_request) Successful in 2m35s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 41s
Full-screen SLA report accessible via [s] from detail panel.
Computes uptime%, downtime, outage count, longest outage, MTTR,
and MTBF from state_changes table. Includes daily breakdown with
bar chart, switchable time periods (24h/7d/30d/90d), and
scrollable viewport. LATE/STALE treated as UP for SLA purposes.
2026-06-04 14:24:39 -04:00
lerko ca43621c44 feat(monitor): add STALE state for push monitors
New intermediate state between LATE and DOWN at the midpoint of
the grace period. Gives operators earlier warning that a push
monitor has gone quiet. Includes dedicated orange theme color
across all 5 themes and proper styling in dashboard, detail
panel, and history view.
2026-06-04 18:23:57 +00:00
lerko 8f17deba67 chore: migrate module path to lerkolabs org
CI / test (pull_request) Successful in 2m39s
CI / lint (pull_request) Successful in 1m6s
CI / vulncheck (pull_request) Successful in 46s
Move Go module from gitea.lerkolabs.com/lerko/uptop to
gitea.lerkolabs.com/lerkolabs/uptop. Updates all imports,
go.mod, goreleaser owner, and README links.
2026-05-29 14:22:49 -04:00
lerko 026e969b74 chore: TUI screenshots, README polish, changelog rewrite (#32)
CI / test (push) Successful in 2m41s
CI / lint (push) Successful in 1m11s
CI / vulncheck (push) Successful in 56s
- Add 6 TUI screenshots to assets/ (monitors, alerts, logs, nodes, detail, theme)
- Rewrite README with hero image, badges, collapsible install sections
- Rewrite changelog to match actual CalVer tag history
- VHS tooling extracted to lerko/uptop-vhs

Reviewed-on: lerko/uptop#32
2026-05-29 17:45:31 +00:00
lerko 5dc31108f8 feat: proper push monitor lifecycle — PENDING, LATE, DOWN states
CI / test (pull_request) Successful in 2m41s
CI / lint (pull_request) Successful in 1m7s
CI / vulncheck (pull_request) Successful in 46s
Push monitors no longer lie about status:

- PENDING stays until first heartbeat (no auto-promote to UP)
- LATE state (amber) when overdue but within grace period
- DOWN only after grace period expires
- Grace period = interval/2, minimum 60s

RecordHeartbeat now handles all transitions:
- PENDING → UP (first heartbeat, logged)
- LATE → UP (late arrival, logged)
- DOWN → UP (recovery, alert + state change persisted)

TUI updates:
- LATE rendered in amber/warning color
- Status bar shows LATE count separately
- Tab badge shows ⚠ for late monitors
- Sort order: DOWN > LATE > UP > PENDING > PAUSED
- Detail panel shows error for LATE monitors

Inspired by Healthchecks.io state machine (new/up/grace/down).
2026-05-27 19:56:50 -04:00
lerko bc3a44beac feat: show error reason when monitors go DOWN
CI / test (pull_request) Successful in 2m42s
CI / lint (pull_request) Successful in 1m11s
CI / vulncheck (pull_request) Successful in 51s
Propagate check failure reasons through the entire stack:
- Checker captures specific errors (DNS, timeout, HTTP status, SSL, etc.)
- Engine tracks LastError, StatusChangedAt, LastSuccessAt per monitor
- State transitions persisted to new state_changes table
- Detail panel shows error reason, HTTP code, state duration, last
  success time, and last 5 state change events
- Monitor table shows inline error preview for DOWN services
- Alert messages include error reason
- Probe nodes forward error reasons to leader

15 files changed across models, checker, engine, store, TUI, and probes.
2026-05-27 19:32:30 -04:00
lerko 9d12e3ecf1 chore: complete rename from go-upkeep to uptop
CI / test (pull_request) Successful in 4m26s
CI / lint (pull_request) Successful in 1m11s
- Module path: gitea.lerkolabs.com/lerko/uptop
- Binary: cmd/uptop/
- All imports updated to full module path
- Env vars: UPKEEP_* → UPTOP_*
- Prometheus metrics: upkeep_* → uptop_*
- Default DB: uptop.db
- Docker image: lerko/uptop
- All docs, compose files, CI updated

Only remaining "go-upkeep" reference is the fork attribution in README.
2026-05-24 20:20:35 -04:00
lerko 94296e8286 test(monitor): add comprehensive test suite for engine and checkers
55 tests covering state machine transitions, heartbeat handling, push
deadline checks, group aggregation, history recording, probe aggregation,
log management, state management, and concurrency safety.

Checker tests cover HTTP (via httptest), port (via net.Listen),
isCodeAccepted ranges, and siteTimeout defaults. Ping and DNS
checkers skipped (need ICMP privileges and DNS server).

Coverage: 64.2% overall, 100% on handleStatusChange, triggerAlert,
checkPush, recordCheck, and AggregateStatus.
2026-05-23 21:06:28 -04:00