uptop

Author	SHA1	Message	Date
lerko	70a83a1da9	refactor(store): propagate context.Context through all Store methods Every Store interface method (except Close) now takes context.Context as first parameter. All 54 db.Query/Exec/QueryRow calls in SQLStore replaced with their *Context variants. DB operations now respect cancellation and deadlines. Context sources by caller: - Engine dbWriter/poll/pruner: engine ctx from Start() - HTTP handlers: r.Context() - config.Apply/Export: caller-provided ctx - TUI/main.go init: context.Background() RunCheck and all sub-checks (HTTP/ping/port/DNS) accept parent ctx. HTTP checks now inherit shutdown cancellation instead of rooting in context.Background(). dbWrite.exec takes ctx so the writer goroutine can cancel stuck DB operations. DeleteSite/ImportData use BeginTx(ctx) instead of Begin().	2026-06-11 14:40:30 -04:00
lerko	5d5153351e	fix(engine): six correctness fixes for the state machine CI / test (pull_request) Successful in 1m59s Details CI / lint (pull_request) Successful in 1m17s Details CI / vulncheck (pull_request) Successful in 1m1s Details 1. Group auto-pause trap: remove the one-way Paused=true mutation from checkGroup — monitorRoutine skipped paused groups, so they could never re-evaluate or auto-unpause. 2. Retry logic: apply MaxRetries to all →DOWN transitions, not just UP→DOWN. New monitors (PENDING) no longer alert on first transient failure when retries are configured. 3. Shutdown drain hole: track checker goroutines with checkerWG so Stop() waits for in-flight checks before draining the write queue. Final drainWrites() catches any writes enqueued after the writer's own drain. 4. Probe-ingest writer bypass: route SaveCheckFromNode through the engine's serialized dbWriter instead of writing directly to the store from the HTTP handler. 5. Dead-probe expiry: expire stale probe results (>3× site interval) before aggregation so a dead probe can't poison status forever. Also clean probeResults in RemoveSite. 6. Maintenance-cache N+1: replace per-check DB query with a fully-resolved in-memory cache refreshed every poll cycle. One GetActiveMaintenanceWindows() call instead of N IsMonitorInMaintenance. ImportData now wipes check_history, state_changes, and alert_health so re-inserted IDs don't inherit stale history from prior occupants.	2026-06-11 13:57:03 -04:00
lerko	8b39d4c1a1	fix(monitor): serialize DB writes through a single drained writer CI / test (pull_request) Successful in 2m36s Details CI / lint (pull_request) Successful in 56s Details CI / vulncheck (pull_request) Successful in 51s Details Every check spawned `go e.db.Save*(...)` with the error discarded: a fire-and-forget goroutine per log line, check, state change, and alert health update. SaveLog ran a full-table prune DELETE on every insert and SaveCheck a COUNT + conditional prune on every check, so the hot path amplified each write into several statements. Nothing tracked these goroutines, so at shutdown they raced the store's Close() — writes to a closing DB, silently swallowed. Introduce a single writer goroutine that drains a buffered channel of typed dbWrite values (log/check/state-change/alert-health). Writes are enqueued non-blocking; a saturated queue drops and notes it in the in-memory log rather than blocking the check loop. Write errors are now logged instead of discarded. Retention moves off the hot path: SaveLog and SaveCheck become plain INSERTs, and PruneLogs/PruneCheckHistory/ PruneStateChanges run on a 10-minute timer inside the writer (single keep-newest-N-per-site pass via a window function). state_changes was previously never pruned — now bounded. Add Engine.Stop(): cancels the engine's context, then waits for the writer to drain every buffered write before returning. main wires it in before the deferred store Close() so no write races a closed DB. SQLite gains busy_timeout=5000 and synchronous=NORMAL, applied via the DSN so every pooled connection inherits them (a post-open PRAGMA only touches one connection); WAL moves to the DSN too. :memory: test DBs are left as-is. Tests: writer drains on Stop, Stop is idempotent, and the prune queries keep newest-N per site / N logs on real SQLite. Full suite green under -race.	2026-06-10 18:14:28 -04:00
lerko	5e7faf9ea7	fix(monitor): merge check results into live state, never overwrite CI / test (pull_request) Successful in 2m51s Details CI / lint (pull_request) Successful in 56s Details CI / vulncheck (pull_request) Successful in 51s Details checkByID snapshotted a Site under RLock, ran a network check for seconds, then handleStatusChange wrote the entire stale struct back into liveState. Any concurrent mutation during the check — a user pause, a config edit, or a push heartbeat — was silently reverted. Worst case: a heartbeat set UP and an in-flight checkPush overwrote it with a stale DOWN, firing a false alert. Introduce applyState(id, mutate): a single read-modify-write helper that runs the mutator against the CURRENT live entry under the write lock, so config and Paused are preserved automatically and status transitions are computed from the true current status. Route handleStatusChange, RecordHeartbeat, ToggleSitePause and checkGroup through it. Logs and alerts now fire after the lock is released, off the critical section. Push false-DOWN is closed by a guard: a non-UP result whose snapshot LastCheck predates the live LastCheck is dropped, since a heartbeat (or newer check) superseded it. HTTP/probe stamp LastCheck=now before the call, so they are unaffected (and serial per site anyway). Also fixes a latent bug where RecordHeartbeat read StatusChangedAt after overwriting it, always reporting "was down 0s"; downSince is now captured before mutation. Adds regression tests for pause/config-edit/heartbeat-during-check and removed-site-dropped. Full suite green under -race.	2026-06-10 16:04:00 -04:00
lerko	21a1563e53	feat(monitor): auto-prune expired maintenance windows CI / test (pull_request) Successful in 2m33s Details CI / lint (pull_request) Successful in 56s Details CI / vulncheck (pull_request) Successful in 50s Details Background goroutine runs every 15 minutes, deletes maintenance windows that expired beyond the retention period (default 7 days). Configurable via UPTOP_MAINT_RETENTION env var (Go duration format). Closes #72	2026-06-05 18:27:42 -04:00
lerko	d53a4e6ac3	refactor(monitor): extract magic numbers into named constants Closes #75	2026-06-05 16:32:12 -04:00
lerko	00fa381a7c	fix(monitor): log STALE recovery in push heartbeat handler	2026-06-04 17:41:39 -04:00
lerko	60592ef810	feat(tui): add SLA reporting view CI / test (pull_request) Successful in 2m35s Details CI / lint (pull_request) Successful in 56s Details CI / vulncheck (pull_request) Successful in 41s Details Full-screen SLA report accessible via [s] from detail panel. Computes uptime%, downtime, outage count, longest outage, MTTR, and MTBF from state_changes table. Includes daily breakdown with bar chart, switchable time periods (24h/7d/30d/90d), and scrollable viewport. LATE/STALE treated as UP for SLA purposes.	2026-06-04 14:24:39 -04:00
lerko	b2e92e8a2a	fix(monitor): propagate STALE/LATE child status to group checkGroup only checked for DOWN/SSL EXP and PENDING. Groups now reflect STALE and LATE children with proper priority: DOWN > STALE > LATE > PENDING > UP.	2026-06-04 18:23:57 +00:00
lerko	ca43621c44	feat(monitor): add STALE state for push monitors New intermediate state between LATE and DOWN at the midpoint of the grace period. Gives operators earlier warning that a push monitor has gone quiet. Includes dedicated orange theme color across all 5 themes and proper styling in dashboard, detail panel, and history view.	2026-06-04 18:23:57 +00:00
lerko	eb9546c97e	fix(monitor): trigger immediate recheck after site config edit CI / test (pull_request) Successful in 2m32s Details CI / lint (pull_request) Successful in 1m1s Details CI / vulncheck (pull_request) Successful in 51s Details Monitor goroutine slept for the full check interval after a config edit, so hostname/URL changes wouldn't take effect until the next scheduled check. Added per-site recheck channel that wakes the goroutine immediately when UpdateSiteConfig is called.	2026-06-04 12:23:04 -04:00
lerko	8f17deba67	chore: migrate module path to lerkolabs org CI / test (pull_request) Successful in 2m39s Details CI / lint (pull_request) Successful in 1m6s Details CI / vulncheck (pull_request) Successful in 46s Details Move Go module from gitea.lerkolabs.com/lerko/uptop to gitea.lerkolabs.com/lerkolabs/uptop. Updates all imports, go.mod, goreleaser owner, and README links.	2026-05-29 14:22:49 -04:00
lerko	026e969b74	chore: TUI screenshots, README polish, changelog rewrite (#32 ) CI / test (push) Successful in 2m41s Details CI / lint (push) Successful in 1m11s Details CI / vulncheck (push) Successful in 56s Details - Add 6 TUI screenshots to assets/ (monitors, alerts, logs, nodes, detail, theme) - Rewrite README with hero image, badges, collapsible install sections - Rewrite changelog to match actual CalVer tag history - VHS tooling extracted to lerko/uptop-vhs Reviewed-on: lerko/uptop#32	2026-05-29 17:45:31 +00:00
lerko	0aa2f9cd8a	feat: alert channel health indicator + test alerts CI / test (pull_request) Successful in 2m46s Details CI / lint (pull_request) Successful in 1m1s Details CI / vulncheck (pull_request) Successful in 51s Details Track alert delivery health at runtime: - AlertHealth struct: LastSendAt, LastSendOK, LastError, SendCount, FailCount - triggerAlert records success/failure after each Send() - Health data exposed via GetAlertHealth() for TUI Alerts tab enriched: - Health dot column: green (OK), red (failed), gray (never sent) - LAST SENT column: relative time ("2m ago", "never") - [t] key sends test notification through selected channel Inspired by Grafana's contact point health columns.	2026-05-27 21:23:06 -04:00
lerko	b14d5e19db	feat: logs tab overhaul — severity tags, filtering, recovery durations CI / test (pull_request) Successful in 2m36s Details CI / lint (pull_request) Successful in 1m1s Details CI / vulncheck (pull_request) Successful in 51s Details Logs tab visual overhaul: - Severity-classified entries: DOWN (red), UP (green), WARN (amber), SYS (cyan), info (gray) — rendered as inline tags, not whole-line color - Column-aligned format: [timestamp] [severity tag] [message] - Filter toggle (f key): All vs Important only (hides retry noise) - Header shows entry count, filter state, hidden count Engine log improvements: - Recovery messages include downtime duration ("was down 14m") - LATE transition logged ("heartbeat overdue") - Push monitor recovery includes downtime duration	2026-05-27 20:14:43 -04:00
lerko	5dc31108f8	feat: proper push monitor lifecycle — PENDING, LATE, DOWN states CI / test (pull_request) Successful in 2m41s Details CI / lint (pull_request) Successful in 1m7s Details CI / vulncheck (pull_request) Successful in 46s Details Push monitors no longer lie about status: - PENDING stays until first heartbeat (no auto-promote to UP) - LATE state (amber) when overdue but within grace period - DOWN only after grace period expires - Grace period = interval/2, minimum 60s RecordHeartbeat now handles all transitions: - PENDING → UP (first heartbeat, logged) - LATE → UP (late arrival, logged) - DOWN → UP (recovery, alert + state change persisted) TUI updates: - LATE rendered in amber/warning color - Status bar shows LATE count separately - Tab badge shows ⚠ for late monitors - Sort order: DOWN > LATE > UP > PENDING > PAUSED - Detail panel shows error for LATE monitors Inspired by Healthchecks.io state machine (new/up/grace/down).	2026-05-27 19:56:50 -04:00
lerko	bc3a44beac	feat: show error reason when monitors go DOWN CI / test (pull_request) Successful in 2m42s Details CI / lint (pull_request) Successful in 1m11s Details CI / vulncheck (pull_request) Successful in 51s Details Propagate check failure reasons through the entire stack: - Checker captures specific errors (DNS, timeout, HTTP status, SSL, etc.) - Engine tracks LastError, StatusChangedAt, LastSuccessAt per monitor - State transitions persisted to new state_changes table - Detail panel shows error reason, HTTP code, state duration, last success time, and last 5 state change events - Monitor table shows inline error preview for DOWN services - Alert messages include error reason - Probe nodes forward error reasons to leader 15 files changed across models, checker, engine, store, TUI, and probes.	2026-05-27 19:32:30 -04:00
lerko	986f9f1d55	fix(security): phase 4 code quality and low-severity fixes CI / test (pull_request) Successful in 4m24s Details CI / lint (pull_request) Successful in 1m1s Details - Fix limitStr to handle multi-byte UTF-8 characters correctly - Sanitize log messages: strip ANSI escape sequences and newlines - URL-encode probe node_id instead of string concatenation - Fix follower resp.Body leak on non-200 responses - Make SSH host key path configurable via UPTOP_SSH_HOST_KEY env var - Add HTTP method checks on GET-only endpoints (405 for wrong methods) - Extract magic numbers into named constants across monitor/store/server - Standardize error output to stderr for all startup errors	2026-05-26 17:25:47 -04:00
lerko	60b30935b3	fix(security): phase 1 critical fixes for public release CI / test (pull_request) Successful in 4m40s Details CI / lint (pull_request) Successful in 1m2s Details - Redact PostgreSQL DSN password from stdout/logs - Harden .dockerignore to exclude .ssh/, .claude/, .db, .local files - SSRF protection: block private/loopback/link-local IPs by default (UPTOP_ALLOW_PRIVATE_TARGETS=true to override for homelab use) - Fix email header injection via CRLF in monitor names - AES-256-GCM encryption for alert credentials at rest (UPTOP_ENCRYPTION_KEY env var, migrate-secrets subcommand) - TLS support for HTTP server (UPTOP_TLS_CERT/UPTOP_TLS_KEY) with HSTS header when TLS enabled	2026-05-25 11:26:47 -04:00
lerko	9d12e3ecf1	chore: complete rename from go-upkeep to uptop CI / test (pull_request) Successful in 4m26s Details CI / lint (pull_request) Successful in 1m11s Details - Module path: gitea.lerkolabs.com/lerko/uptop - Binary: cmd/uptop/ - All imports updated to full module path - Env vars: UPKEEP_* → UPTOP_* - Prometheus metrics: upkeep_* → uptop_* - Default DB: uptop.db - Docker image: lerko/uptop - All docs, compose files, CI updated Only remaining "go-upkeep" reference is the fork attribution in README.	2026-05-24 20:20:35 -04:00
lerko	359cff7292	chore: add golangci-lint config and fix all lint issues Add .golangci.yml enabling errcheck, staticcheck, govet, gosec, ineffassign, and unused linters. Fix 66 issues across 16 files: - Check all unchecked errors (errcheck) - Use HTTP status constants instead of numeric literals (staticcheck) - Replace deprecated LineUp/LineDown with ScrollUp/ScrollDown (staticcheck) - Convert sprintf+write patterns to fmt.Fprintf (staticcheck) - Add ReadHeaderTimeout to http.Server (gosec) - Remove unused types and functions (unused) - Add nolint comments for intentional patterns (InsecureSkipVerify, math/rand for jitter, dialect-only SQL formatting)	2026-05-23 22:02:06 -04:00
lerko	4b5495fb49	fix(monitor): add jitter to check intervals and stagger startup Monitors with the same interval no longer fire simultaneously. Each tick adds up to 10% random jitter. Initial checks stagger over 0-3s to avoid thundering herd on startup.	2026-05-23 20:05:30 -04:00
lerko	8e6d97710b	fix(alert): add context to Provider.Send, log alert failures Provider.Send now accepts context.Context for timeout/cancellation. HTTPProvider and NtfyProvider use NewRequestWithContext so HTTP alerts respect the 30s deadline. triggerAlert logs send failures and config load errors instead of silently swallowing them.	2026-05-23 13:18:04 -04:00
lerko	a89584dac1	fix(engine): skip children in maintenance when computing group status Group status now treats maintenance'd children like paused ones — they're excluded from the UP/DOWN calculation. Prevents group from showing DOWN when its only failing child is under maintenance.	2026-05-22 19:19:08 -04:00
lerko	b146f34d19	feat: add incident management and maintenance windows Maintenance windows suppress alerts during planned downtime while checks continue running. Incidents provide informational tracking. Supports targeting all monitors, single monitor, or group (applies to children). New Maint tab in TUI with create/end/delete. Status page, JSON API, and Prometheus metrics all reflect maintenance state.	2026-05-22 18:45:02 -04:00
lerko	ed082e4080	feat: persist logs to DB, load on startup	2026-05-16 15:25:08 -04:00
lerko	4d375cf874	fix: seed status and latency from DB history on startup	2026-05-16 15:05:28 -04:00
lerko	ca5a42314f	feat(cluster): add probe execution mode, check extraction, and result aggregation Phase 2 of distributed probing: - Extract check logic into standalone RunCheck() for use by probes - Add probe cluster mode: stateless nodes that fetch assignments, execute checks, and report results to the leader - Add multi-node result aggregation with configurable strategy (any-down, majority-down, all-down) - Leader ingests probe results into engine live state and triggers alerts - New env vars: UPKEEP_NODE_ID, UPKEEP_NODE_NAME, UPKEEP_NODE_REGION, UPKEEP_AGG_STRATEGY - Example docker-compose.probe.yml with leader + 2 regional probes	2026-05-16 11:19:57 -04:00
lerko	f023e38fdc	refactor(monitor): encapsulate engine state, add graceful shutdown and tests Replace all monitor package-level mutable state with Engine struct. All state (liveState, logStore, histories, tokenIndex, HTTP clients) is now encapsulated in Engine, created via NewEngine(store). Key changes: - Engine struct holds all monitor state with proper mutex protection - Engine.Start(ctx) and monitorRoutine respect context cancellation for graceful shutdown — no more leaked goroutines - cluster.runFollowerLoop also respects context for clean exit - Token index (map[string]int) for O(1) push heartbeat lookup, replacing O(n) linear scan through LiveState - UpdateSiteConfig preserves 8 runtime fields instead of copying 17 config fields individually - triggerAlert goroutines get 30s timeout context - All consumers (TUI, server, cluster, main) receive *Engine via constructor/parameter — no package-level state access - main.go creates context.WithCancel, passes to engine and cluster First test suite: 12 tests across store and alert packages - Store: CRUD for sites/alerts/users, push token generation, import/export round-trip, check history persistence - Alert: Discord/Slack/Webhook payload format, HTTP 4xx error propagation, Ntfy headers, unknown provider returns nil	2026-05-15 08:21:17 -04:00
lerko	a6bb9a7aff	refactor(core): remove store global singleton, thread store explicitly Remove store.Get()/SetGlobal()/Current. Store is now passed explicitly to all consumers via constructor parameters and function arguments. - TUI Model holds store field, set via InitialModel(isAdmin, store) - monitor.StartEngine(s) and InitHistoryFromStore(s) accept store - server.Start(cfg, s) closes over store in HTTP handlers - main.go threads store to SSH server, TUI, monitor, server - isKeyAllowed receives store as parameter No more hidden dependency on package-level mutable state in store pkg. Monitor package still uses package-level state (LiveState, etc.) — will be encapsulated into Engine struct in Phase 7.	2026-05-15 00:45:07 -04:00
lerko	d4f4012c8a	refactor(store): add error returns to all Store interface methods Every Store method now returns an error. Callers handle errors gracefully — TUI logs to event log, server returns HTTP 500, monitor engine logs and retries. All rows.Scan() errors are now checked in sqlstore.go instead of silently appending corrupt data. - GetSites, GetAllAlerts, GetAllUsers return ([]T, error) - GetAlert returns (AlertConfig, error) instead of (AlertConfig, bool) - AddSite, UpdateSite, DeleteSite, etc. all return error - SaveCheck, LoadAllHistory, ExportData return error - ~25 caller sites updated across tui, server, monitor, main	2026-05-15 00:37:20 -04:00
lerko	4d5116644f	fix(core): correctness and robustness fixes across all subsystems - Move status page template to package-level template.Must (panic on parse error at init instead of nil deref at runtime) - Fix XSS in import error responses (log detail server-side, return generic message to client) - Handle ListenAndServe errors in HTTP and SSH servers - Use defer resp.Body.Close() in all alert providers, check json.Marshal errors - Share HTTP clients across checks instead of creating per-request - Use http.NewRequestWithContext for per-site timeout control - Support HTTP method field (was always GET despite DB storing method) - Implement AcceptedCodes validation (was hardcoded >= 400 despite DB storing accepted code ranges) - Add defer tx.Rollback() to ImportData for transaction safety	2026-05-15 00:00:02 -04:00
lerko	c480f519c4	feat(tui): add monitor groups with collapse/expand and tree view Groups act as visual organizers in the sites table. Monitors can be assigned to a parent group via the form. Group rows show aggregated worst-child status, children render with tree chars (├/└), and Space toggles collapse/expand. Group form hides irrelevant connection and advanced sections.	2026-05-14 21:15:34 -04:00
lerko	d5ab3a18a4	feat(tui,status): add per-site pause, fix viewport, polish status page Per-site pause: [p] key toggles pause for selected monitor in TUI. Paused monitors skip checks, persist to DB, show on status page. Status page: replace full-page reload with fetch-based DOM updates to eliminate scroll-jump on refresh. Add summary bar (UP/DOWN/PAUSED counts), stale-data indicator, and fix SSL EXP CSS class bug. TUI: constrain tables to terminal width via lipgloss .Width() to prevent row wrapping that pushed header off-screen. Add MaxHeight safety net. Bump subtle style from #383838 to #565f89 for readability on dark terminals.	2026-05-14 18:46:17 -04:00
lerko	93fe372497	feat(monitor): add ping, port, and DNS check routines Implement checkPing (pro-bing ICMP), checkPort (TCP dial), and checkDNS (miekg/dns) with per-monitor timeout, configurable DNS record types, and fallback defaults. Groups skip checks entirely.	2026-05-14 17:22:57 -04:00
lerko	f06dd5702b	feat(models): widen Site struct and DB schema for ping, port, dns, group monitor types Add Hostname, Port, Timeout, Method, Description, ParentID, AcceptedCodes, DNSResolveType, DNSServer, and IgnoreTLS fields. Refactor AddSite/UpdateSite to accept models.Site instead of individual params. Includes DB migrations for existing databases, per-monitor timeout/TLS in the engine, new type options in TUI forms, and TYPE column in the sites table.	2026-05-14 17:10:56 -04:00
lerko	11848ce674	fix(security): harden TLS, timeouts, validation, logging, and token generation - Default TLS verification on, opt-in UPKEEP_INSECURE_SKIP_VERIFY - Alert webhooks use 10s timeout client, close response bodies - URL input validates http/https scheme for HTTP monitors - Stdlib logs route to stderr instead of discard - Panic on crypto/rand failure in token generation - Cluster startup warnings for non-HTTPS and missing secret - Replace demo SMTP creds with obvious placeholders - Color-coded log entries and scroll hints in logs tab	2026-05-14 15:28:04 -04:00
lerko	02f0a39d97	feat: initial commit — uptime monitor (forked from go-upkeep) Go-based uptime monitor with SQLite/Postgres storage, TUI dashboard, SSH server, alerting, and clustering support.	2026-05-14 11:05:10 -04:00

38 Commits