uptop

Author	SHA1	Message	Date
lerko	52ccd7ad91	refactor(models): split Site into SiteConfig + SiteState CI / test (pull_request) Successful in 1m58s Details CI / lint (pull_request) Successful in 1m21s Details CI / vulncheck (pull_request) Successful in 1m2s Details Site now embeds SiteConfig (22 persistent fields) and SiteState (11 ephemeral runtime fields). Field access unchanged via promotion — site.Name and site.Status still work. Store layer deals exclusively in SiteConfig — the DB never sees runtime state. Engine's liveState keeps full Site composites. UpdateSiteConfig reduced from 11-line field-by-field copy to `existing.SiteConfig = cfg`. RunCheck takes SiteConfig (only needs config fields). Checker is now statically prevented from reading/writing runtime state. Backup.Sites changed to []SiteConfig — exports no longer carry zero-valued runtime fields. Import backward-compatible (json ignores unknown fields).	2026-06-11 17:13:09 -04:00
lerko	2b357341c8	refactor(store): shared storetest.BaseMock replaces 5 duplicated mocks CI / test (pull_request) Successful in 1m57s Details CI / lint (pull_request) Successful in 1m16s Details CI / vulncheck (pull_request) Successful in 1m1s Details New internal/store/storetest/mock.go provides BaseMock implementing the full Store interface with no-op defaults and optional Func field overrides. Each test file embeds BaseMock and shadows only the methods it needs. Removes ~400 lines of duplicated stub methods across 6 test files. Adding a Store method now requires one addition (BaseMock) instead of editing 6 files.	2026-06-11 16:09:29 -04:00
lerko	70a83a1da9	refactor(store): propagate context.Context through all Store methods Every Store interface method (except Close) now takes context.Context as first parameter. All 54 db.Query/Exec/QueryRow calls in SQLStore replaced with their *Context variants. DB operations now respect cancellation and deadlines. Context sources by caller: - Engine dbWriter/poll/pruner: engine ctx from Start() - HTTP handlers: r.Context() - config.Apply/Export: caller-provided ctx - TUI/main.go init: context.Background() RunCheck and all sub-checks (HTTP/ping/port/DNS) accept parent ctx. HTTP checks now inherit shutdown cancellation instead of rooting in context.Background(). dbWrite.exec takes ctx so the writer goroutine can cancel stuck DB operations. DeleteSite/ImportData use BeginTx(ctx) instead of Begin().	2026-06-11 14:40:30 -04:00
lerko	5d5153351e	fix(engine): six correctness fixes for the state machine CI / test (pull_request) Successful in 1m59s Details CI / lint (pull_request) Successful in 1m17s Details CI / vulncheck (pull_request) Successful in 1m1s Details 1. Group auto-pause trap: remove the one-way Paused=true mutation from checkGroup — monitorRoutine skipped paused groups, so they could never re-evaluate or auto-unpause. 2. Retry logic: apply MaxRetries to all →DOWN transitions, not just UP→DOWN. New monitors (PENDING) no longer alert on first transient failure when retries are configured. 3. Shutdown drain hole: track checker goroutines with checkerWG so Stop() waits for in-flight checks before draining the write queue. Final drainWrites() catches any writes enqueued after the writer's own drain. 4. Probe-ingest writer bypass: route SaveCheckFromNode through the engine's serialized dbWriter instead of writing directly to the store from the HTTP handler. 5. Dead-probe expiry: expire stale probe results (>3× site interval) before aggregation so a dead probe can't poison status forever. Also clean probeResults in RemoveSite. 6. Maintenance-cache N+1: replace per-check DB query with a fully-resolved in-memory cache refreshed every poll cycle. One GetActiveMaintenanceWindows() call instead of N IsMonitorInMaintenance. ImportData now wipes check_history, state_changes, and alert_health so re-inserted IDs don't inherit stale history from prior occupants.	2026-06-11 13:57:03 -04:00
lerko	8b39d4c1a1	fix(monitor): serialize DB writes through a single drained writer CI / test (pull_request) Successful in 2m36s Details CI / lint (pull_request) Successful in 56s Details CI / vulncheck (pull_request) Successful in 51s Details Every check spawned `go e.db.Save*(...)` with the error discarded: a fire-and-forget goroutine per log line, check, state change, and alert health update. SaveLog ran a full-table prune DELETE on every insert and SaveCheck a COUNT + conditional prune on every check, so the hot path amplified each write into several statements. Nothing tracked these goroutines, so at shutdown they raced the store's Close() — writes to a closing DB, silently swallowed. Introduce a single writer goroutine that drains a buffered channel of typed dbWrite values (log/check/state-change/alert-health). Writes are enqueued non-blocking; a saturated queue drops and notes it in the in-memory log rather than blocking the check loop. Write errors are now logged instead of discarded. Retention moves off the hot path: SaveLog and SaveCheck become plain INSERTs, and PruneLogs/PruneCheckHistory/ PruneStateChanges run on a 10-minute timer inside the writer (single keep-newest-N-per-site pass via a window function). state_changes was previously never pruned — now bounded. Add Engine.Stop(): cancels the engine's context, then waits for the writer to drain every buffered write before returning. main wires it in before the deferred store Close() so no write races a closed DB. SQLite gains busy_timeout=5000 and synchronous=NORMAL, applied via the DSN so every pooled connection inherits them (a post-open PRAGMA only touches one connection); WAL moves to the DSN too. :memory: test DBs are left as-is. Tests: writer drains on Stop, Stop is idempotent, and the prune queries keep newest-N per site / N logs on real SQLite. Full suite green under -race.	2026-06-10 18:14:28 -04:00
lerko	5e7faf9ea7	fix(monitor): merge check results into live state, never overwrite CI / test (pull_request) Successful in 2m51s Details CI / lint (pull_request) Successful in 56s Details CI / vulncheck (pull_request) Successful in 51s Details checkByID snapshotted a Site under RLock, ran a network check for seconds, then handleStatusChange wrote the entire stale struct back into liveState. Any concurrent mutation during the check — a user pause, a config edit, or a push heartbeat — was silently reverted. Worst case: a heartbeat set UP and an in-flight checkPush overwrote it with a stale DOWN, firing a false alert. Introduce applyState(id, mutate): a single read-modify-write helper that runs the mutator against the CURRENT live entry under the write lock, so config and Paused are preserved automatically and status transitions are computed from the true current status. Route handleStatusChange, RecordHeartbeat, ToggleSitePause and checkGroup through it. Logs and alerts now fire after the lock is released, off the critical section. Push false-DOWN is closed by a guard: a non-UP result whose snapshot LastCheck predates the live LastCheck is dropped, since a heartbeat (or newer check) superseded it. HTTP/probe stamp LastCheck=now before the call, so they are unaffected (and serial per site anyway). Also fixes a latent bug where RecordHeartbeat read StatusChangedAt after overwriting it, always reporting "was down 0s"; downSince is now captured before mutation. Adds regression tests for pause/config-edit/heartbeat-during-check and removed-site-dropped. Full suite green under -race.	2026-06-10 16:04:00 -04:00
lerko	21a1563e53	feat(monitor): auto-prune expired maintenance windows CI / test (pull_request) Successful in 2m33s Details CI / lint (pull_request) Successful in 56s Details CI / vulncheck (pull_request) Successful in 50s Details Background goroutine runs every 15 minutes, deletes maintenance windows that expired beyond the retention period (default 7 days). Configurable via UPTOP_MAINT_RETENTION env var (Go duration format). Closes #72	2026-06-05 18:27:42 -04:00
lerko	60592ef810	feat(tui): add SLA reporting view CI / test (pull_request) Successful in 2m35s Details CI / lint (pull_request) Successful in 56s Details CI / vulncheck (pull_request) Successful in 41s Details Full-screen SLA report accessible via [s] from detail panel. Computes uptime%, downtime, outage count, longest outage, MTTR, and MTBF from state_changes table. Includes daily breakdown with bar chart, switchable time periods (24h/7d/30d/90d), and scrollable viewport. LATE/STALE treated as UP for SLA purposes.	2026-06-04 14:24:39 -04:00
lerko	ca43621c44	feat(monitor): add STALE state for push monitors New intermediate state between LATE and DOWN at the midpoint of the grace period. Gives operators earlier warning that a push monitor has gone quiet. Includes dedicated orange theme color across all 5 themes and proper styling in dashboard, detail panel, and history view.	2026-06-04 18:23:57 +00:00
lerko	8f17deba67	chore: migrate module path to lerkolabs org CI / test (pull_request) Successful in 2m39s Details CI / lint (pull_request) Successful in 1m6s Details CI / vulncheck (pull_request) Successful in 46s Details Move Go module from gitea.lerkolabs.com/lerko/uptop to gitea.lerkolabs.com/lerkolabs/uptop. Updates all imports, go.mod, goreleaser owner, and README links.	2026-05-29 14:22:49 -04:00
lerko	026e969b74	chore: TUI screenshots, README polish, changelog rewrite (#32 ) CI / test (push) Successful in 2m41s Details CI / lint (push) Successful in 1m11s Details CI / vulncheck (push) Successful in 56s Details - Add 6 TUI screenshots to assets/ (monitors, alerts, logs, nodes, detail, theme) - Rewrite README with hero image, badges, collapsible install sections - Rewrite changelog to match actual CalVer tag history - VHS tooling extracted to lerko/uptop-vhs Reviewed-on: lerko/uptop#32	2026-05-29 17:45:31 +00:00
lerko	5dc31108f8	feat: proper push monitor lifecycle — PENDING, LATE, DOWN states CI / test (pull_request) Successful in 2m41s Details CI / lint (pull_request) Successful in 1m7s Details CI / vulncheck (pull_request) Successful in 46s Details Push monitors no longer lie about status: - PENDING stays until first heartbeat (no auto-promote to UP) - LATE state (amber) when overdue but within grace period - DOWN only after grace period expires - Grace period = interval/2, minimum 60s RecordHeartbeat now handles all transitions: - PENDING → UP (first heartbeat, logged) - LATE → UP (late arrival, logged) - DOWN → UP (recovery, alert + state change persisted) TUI updates: - LATE rendered in amber/warning color - Status bar shows LATE count separately - Tab badge shows ⚠ for late monitors - Sort order: DOWN > LATE > UP > PENDING > PAUSED - Detail panel shows error for LATE monitors Inspired by Healthchecks.io state machine (new/up/grace/down).	2026-05-27 19:56:50 -04:00
lerko	bc3a44beac	feat: show error reason when monitors go DOWN CI / test (pull_request) Successful in 2m42s Details CI / lint (pull_request) Successful in 1m11s Details CI / vulncheck (pull_request) Successful in 51s Details Propagate check failure reasons through the entire stack: - Checker captures specific errors (DNS, timeout, HTTP status, SSL, etc.) - Engine tracks LastError, StatusChangedAt, LastSuccessAt per monitor - State transitions persisted to new state_changes table - Detail panel shows error reason, HTTP code, state duration, last success time, and last 5 state change events - Monitor table shows inline error preview for DOWN services - Alert messages include error reason - Probe nodes forward error reasons to leader 15 files changed across models, checker, engine, store, TUI, and probes.	2026-05-27 19:32:30 -04:00
lerko	9d12e3ecf1	chore: complete rename from go-upkeep to uptop CI / test (pull_request) Successful in 4m26s Details CI / lint (pull_request) Successful in 1m11s Details - Module path: gitea.lerkolabs.com/lerko/uptop - Binary: cmd/uptop/ - All imports updated to full module path - Env vars: UPKEEP_* → UPTOP_* - Prometheus metrics: upkeep_* → uptop_* - Default DB: uptop.db - Docker image: lerko/uptop - All docs, compose files, CI updated Only remaining "go-upkeep" reference is the fork attribution in README.	2026-05-24 20:20:35 -04:00
lerko	94296e8286	test(monitor): add comprehensive test suite for engine and checkers 55 tests covering state machine transitions, heartbeat handling, push deadline checks, group aggregation, history recording, probe aggregation, log management, state management, and concurrency safety. Checker tests cover HTTP (via httptest), port (via net.Listen), isCodeAccepted ranges, and siteTimeout defaults. Ping and DNS checkers skipped (need ICMP privileges and DNS server). Coverage: 64.2% overall, 100% on handleStatusChange, triggerAlert, checkPush, recordCheck, and AggregateStatus.	2026-05-23 21:06:28 -04:00

15 Commits