uptop

Author	SHA1	Message	Date
lerko	4328d25f22	fix(security): API import no longer replaces user accounts Cluster-secret holder could POST a backup with their own admin key to /api/backup/import, replacing all users — privilege escalation from cluster-auth to admin. Also, Kuma imports produced zero users but ImportWipe unconditionally deleted the users table — locking out all accounts until restart reseeded UPTOP_ADMIN_KEY. - Server handlers strip data.Users (set nil) before calling ImportData - ImportData only wipes+replaces users when data.Users != nil - New ImportWipeUsers dialect method separates user wipe from data wipe - CLI restore (main.go) unchanged — full import still replaces users	2026-06-12 12:45:16 -04:00
lerko	c3eac80e14	fix(store): chmod SQLite DB files to 0600 on open CI / test (pull_request) Successful in 1m57s Details CI / lint (pull_request) Successful in 1m26s Details CI / vulncheck (pull_request) Successful in 1m2s Details Bare-metal installs created the DB with process umask (often 022), making uptop.db, -wal, and -shm world-readable. These files contain alert credentials and config. Now chmod 0600 after open. Missing WAL/SHM siblings (not yet created) are silently skipped. Docker installs were already mitigated by the non-root UID.	2026-06-12 09:51:11 -04:00
lerko	341d60d2fe	refactor: unify logging with log/slog CI / test (pull_request) Successful in 1m57s Details CI / lint (pull_request) Successful in 1m22s Details CI / vulncheck (pull_request) Successful in 56s Details Replace three uncoordinated logging systems (log.Printf, fmt.Fprintf to stderr, fmt.Println warnings) with structured slog calls. 68 log calls migrated: - log.Printf → slog.Error/Warn/Info (45 calls across 5 files) - fmt.Fprintf(os.Stderr) → slog.Error (23 calls in main.go) Kept unchanged: - fmt.Println/Printf for CLI user output (version, banners, import results) - engine.AddLog for TUI-visible ring buffer (monitoring events) Store migration diagnostics demoted to slog.Debug (silent at default info level). HTTP request logging now structured with method/path/ status/duration/ip attributes.	2026-06-11 18:00:19 -04:00
lerko	0974ab2b4c	refactor(store): schema_version migration table + DeleteAlert FK fix Replace the error-string-matching migration runner with a proper schema_version table. Migrations are now numbered and recorded; only unapplied versions run. Fresh databases seed at baseline version (CREATE TABLE already includes all columns). CREATE TABLE statements updated to include regions (sites) and node_id (check_history) — previously only added via ALTER. DeleteAlert now nulls sites.alert_id before deleting, preventing dangling references that caused every incident to hit the error path instead of alerting.	2026-06-11 16:02:17 -04:00
lerko	5d5153351e	fix(engine): six correctness fixes for the state machine CI / test (pull_request) Successful in 1m59s Details CI / lint (pull_request) Successful in 1m17s Details CI / vulncheck (pull_request) Successful in 1m1s Details 1. Group auto-pause trap: remove the one-way Paused=true mutation from checkGroup — monitorRoutine skipped paused groups, so they could never re-evaluate or auto-unpause. 2. Retry logic: apply MaxRetries to all →DOWN transitions, not just UP→DOWN. New monitors (PENDING) no longer alert on first transient failure when retries are configured. 3. Shutdown drain hole: track checker goroutines with checkerWG so Stop() waits for in-flight checks before draining the write queue. Final drainWrites() catches any writes enqueued after the writer's own drain. 4. Probe-ingest writer bypass: route SaveCheckFromNode through the engine's serialized dbWriter instead of writing directly to the store from the HTTP handler. 5. Dead-probe expiry: expire stale probe results (>3× site interval) before aggregation so a dead probe can't poison status forever. Also clean probeResults in RemoveSite. 6. Maintenance-cache N+1: replace per-check DB query with a fully-resolved in-memory cache refreshed every poll cycle. One GetActiveMaintenanceWindows() call instead of N IsMonitorInMaintenance. ImportData now wipes check_history, state_changes, and alert_health so re-inserted IDs don't inherit stale history from prior occupants.	2026-06-11 13:57:03 -04:00
lerko	61c28fac62	refactor(store): swap mattn/go-sqlite3 for modernc.org/sqlite CI / test (pull_request) Successful in 2m0s Details CI / lint (pull_request) Successful in 1m22s Details CI / vulncheck (pull_request) Successful in 56s Details Pure-Go SQLite driver — no CGO, no C compiler. Enables: - static binaries (verified: `file` shows statically linked) - cross-compile to linux/arm64, darwin/amd64+arm64, windows/amd64+arm64 - goreleaser now builds 6 OS/arch targets + windows .zip - Dockerfile drops gcc/musl-dev, sets CGO_ENABLED=0 - release-binaries drops gcc/musl-dev Driver name changes sqlite3 → sqlite, DSN pragmas use _pragma=name(value) format. All tests pass CGO=0 and CGO=1 -race. Homebrew cask block removed (was skip_upload:true dead config).	2026-06-11 13:10:05 -04:00
lerko	8b39d4c1a1	fix(monitor): serialize DB writes through a single drained writer CI / test (pull_request) Successful in 2m36s Details CI / lint (pull_request) Successful in 56s Details CI / vulncheck (pull_request) Successful in 51s Details Every check spawned `go e.db.Save*(...)` with the error discarded: a fire-and-forget goroutine per log line, check, state change, and alert health update. SaveLog ran a full-table prune DELETE on every insert and SaveCheck a COUNT + conditional prune on every check, so the hot path amplified each write into several statements. Nothing tracked these goroutines, so at shutdown they raced the store's Close() — writes to a closing DB, silently swallowed. Introduce a single writer goroutine that drains a buffered channel of typed dbWrite values (log/check/state-change/alert-health). Writes are enqueued non-blocking; a saturated queue drops and notes it in the in-memory log rather than blocking the check loop. Write errors are now logged instead of discarded. Retention moves off the hot path: SaveLog and SaveCheck become plain INSERTs, and PruneLogs/PruneCheckHistory/ PruneStateChanges run on a 10-minute timer inside the writer (single keep-newest-N-per-site pass via a window function). state_changes was previously never pruned — now bounded. Add Engine.Stop(): cancels the engine's context, then waits for the writer to drain every buffered write before returning. main wires it in before the deferred store Close() so no write races a closed DB. SQLite gains busy_timeout=5000 and synchronous=NORMAL, applied via the DSN so every pooled connection inherits them (a post-open PRAGMA only touches one connection); WAL moves to the DSN too. :memory: test DBs are left as-is. Tests: writer drains on Stop, Stop is idempotent, and the prune queries keep newest-N per site / N logs on real SQLite. Full suite green under -race.	2026-06-10 18:14:28 -04:00
lerko	026e969b74	chore: TUI screenshots, README polish, changelog rewrite (#32 ) CI / test (push) Successful in 2m41s Details CI / lint (push) Successful in 1m11s Details CI / vulncheck (push) Successful in 56s Details - Add 6 TUI screenshots to assets/ (monitors, alerts, logs, nodes, detail, theme) - Rewrite README with hero image, badges, collapsible install sections - Rewrite changelog to match actual CalVer tag history - VHS tooling extracted to lerko/uptop-vhs Reviewed-on: lerko/uptop#32	2026-05-29 17:45:31 +00:00
lerko	bc3a44beac	feat: show error reason when monitors go DOWN CI / test (pull_request) Successful in 2m42s Details CI / lint (pull_request) Successful in 1m11s Details CI / vulncheck (pull_request) Successful in 51s Details Propagate check failure reasons through the entire stack: - Checker captures specific errors (DNS, timeout, HTTP status, SSL, etc.) - Engine tracks LastError, StatusChangedAt, LastSuccessAt per monitor - State transitions persisted to new state_changes table - Detail panel shows error reason, HTTP code, state duration, last success time, and last 5 state change events - Monitor table shows inline error preview for DOWN services - Alert messages include error reason - Probe nodes forward error reasons to leader 15 files changed across models, checker, engine, store, TUI, and probes.	2026-05-27 19:32:30 -04:00
lerko	bd561d9a5e	fix(security): phase 3 medium reliability and hardening CI / test (pull_request) Successful in 4m23s Details CI / lint (pull_request) Successful in 1m11s Details - Fail hard on critical migration errors (ignore only "already exists") - Cache SSH user keys with 30s TTL (avoid DB query per auth attempt) - Configure DB connection pooling (25 open, 5 idle, 5m lifetime) - Enable SQLite WAL mode for concurrent read/write - Optimize check history pruning (only prune above 1100 rows) - Add security headers: X-Content-Type-Options, X-Frame-Options, CSP, Referrer-Policy - Add CORS policy on /status/json via UPTOP_CORS_ORIGIN env var - Add HTTP request logging middleware (method, path, status, duration, IP) - Fix config file permissions from 0644 to 0600 - Pin Docker images: golang:1.24-alpine3.21, alpine:3.21 - Fix Docker CI tag pattern for CalVer (was semver) - Pass build args (VERSION, COMMIT, BUILD_DATE) to Docker build	2026-05-26 16:57:03 -04:00
lerko	359cff7292	chore: add golangci-lint config and fix all lint issues Add .golangci.yml enabling errcheck, staticcheck, govet, gosec, ineffassign, and unused linters. Fix 66 issues across 16 files: - Check all unchecked errors (errcheck) - Use HTTP status constants instead of numeric literals (staticcheck) - Replace deprecated LineUp/LineDown with ScrollUp/ScrollDown (staticcheck) - Convert sprintf+write patterns to fmt.Fprintf (staticcheck) - Add ReadHeaderTimeout to http.Server (gosec) - Remove unused types and functions (unused) - Add nolint comments for intentional patterns (InsecureSkipVerify, math/rand for jitter, dialect-only SQL formatting)	2026-05-23 22:02:06 -04:00
lerko	e84b64f8ed	feat(tui): zebra striping, detail breadcrumb, sparkline stats, collapse persistence Add alternating row backgrounds for easier table scanning. Detail panel now shows breadcrumb path (Sites > Group > Name) and min/avg/max latency stats below the sparkline. Group collapse state persists across restarts via new preferences table in both SQLite and Postgres.	2026-05-22 20:53:23 -04:00
lerko	b146f34d19	feat: add incident management and maintenance windows Maintenance windows suppress alerts during planned downtime while checks continue running. Incidents provide informational tracking. Supports targeting all monitors, single monitor, or group (applies to children). New Maint tab in TUI with create/end/delete. Status page, JSON API, and Prometheus metrics all reflect maintenance state.	2026-05-22 18:45:02 -04:00
lerko	ed082e4080	feat: persist logs to DB, load on startup	2026-05-16 15:25:08 -04:00
lerko	0396acdc59	feat(cluster): add region affinity, Nodes TUI tab, and probe metrics Phase 3 of distributed probing: - Add regions column to sites table for per-monitor probe affinity - Region-filtered probe assignments (empty regions = all probes) - New Nodes TUI tab showing connected probes with status/region/last-seen - Regions input field in site form for configuring probe affinity - Config-as-code support for regions (export/import/diff) - Prometheus upkeep_probe_up metric with per-node labels - Reindex TUI tabs: Sites, Alerts, Logs, Nodes, Users	2026-05-16 11:50:16 -04:00
lerko	ca9faa0acd	feat(cluster): add distributed probing foundation — schema, models, and probe APIs Add node-aware check history and probe registration infrastructure: - ProbeNode model and nodes table (SQLite + Postgres) - node_id column on check_history for multi-source tracking - Store interface: RegisterNode, GetNode, GetAllNodes, DeleteNode, SaveCheckFromNode - Dialect: UpsertNodeSQL (INSERT OR REPLACE / ON CONFLICT) - API endpoints: POST /api/probe/register, GET /api/probe/assignments, POST /api/probe/results - Backward compatible: existing SaveCheck wraps SaveCheckFromNode with empty node_id	2026-05-16 11:05:06 -04:00
lerko	ab75f61c6b	refactor(store): unify SQLite and Postgres into dialect-based SQLStore Extract shared SQLStore with Dialect interface for the ~5% that differs between backends (DDL, placeholders, sequence resets). - New dialect.go: Dialect interface + placeholder rewriter (? → $N) - New sqlstore.go: single implementation of all 19 Store methods - sqlite.go: reduced from 286 to 83 lines (SQLiteDialect only) - postgres.go: reduced from 266 to 78 lines (PostgresDialect only) - main.go: use NewSQLiteStore/NewPostgresStore constructors Zero CRUD logic duplication. Every future schema change written once.	2026-05-15 00:31:44 -04:00
lerko	4d5116644f	fix(core): correctness and robustness fixes across all subsystems - Move status page template to package-level template.Must (panic on parse error at init instead of nil deref at runtime) - Fix XSS in import error responses (log detail server-side, return generic message to client) - Handle ListenAndServe errors in HTTP and SSH servers - Use defer resp.Body.Close() in all alert providers, check json.Marshal errors - Share HTTP clients across checks instead of creating per-request - Use http.NewRequestWithContext for per-site timeout control - Support HTTP method field (was always GET despite DB storing method) - Implement AcceptedCodes validation (was hardcoded >= 400 despite DB storing accepted code ranges) - Add defer tx.Rollback() to ImportData for transaction safety	2026-05-15 00:00:02 -04:00
lerko	e97780ad38	fix(tui,status,store): add delete confirm, input validation, XSS fix, history persistence Prevent accidental deletes with y/n confirmation dialog. Validate all numeric form inputs (interval, port, timeout, threshold, retries) with range checks instead of silently defaulting to zero. Escape user-supplied data in status page JavaScript to close XSS via monitor names. Persist check history to new check_history table so sparklines and uptime percentages survive restarts.	2026-05-14 20:51:06 -04:00
lerko	d5ab3a18a4	feat(tui,status): add per-site pause, fix viewport, polish status page Per-site pause: [p] key toggles pause for selected monitor in TUI. Paused monitors skip checks, persist to DB, show on status page. Status page: replace full-page reload with fetch-based DOM updates to eliminate scroll-jump on refresh. Add summary bar (UP/DOWN/PAUSED counts), stale-data indicator, and fix SSL EXP CSS class bug. TUI: constrain tables to terminal width via lipgloss .Width() to prevent row wrapping that pushed header off-screen. Add MaxHeight safety net. Bump subtle style from #383838 to #565f89 for readability on dark terminals.	2026-05-14 18:46:17 -04:00
lerko	f06dd5702b	feat(models): widen Site struct and DB schema for ping, port, dns, group monitor types Add Hostname, Port, Timeout, Method, Description, ParentID, AcceptedCodes, DNSResolveType, DNSServer, and IgnoreTLS fields. Refactor AddSite/UpdateSite to accept models.Site instead of individual params. Includes DB migrations for existing databases, per-monitor timeout/TLS in the engine, new type options in TUI forms, and TYPE column in the sites table.	2026-05-14 17:10:56 -04:00
lerko	b7592ee9e5	feat(tui): upgrade users tab with lipgloss table, edit support, role select Users tab now matches sites/alerts quality: lipgloss bordered table, click-to-select zones, edit form with role picker, and UpdateUser support across both store backends.	2026-05-14 15:28:04 -04:00
lerko	02f0a39d97	feat: initial commit — uptime monitor (forked from go-upkeep) Go-based uptime monitor with SQLite/Postgres storage, TUI dashboard, SSH server, alerting, and clustering support.	2026-05-14 11:05:10 -04:00

23 Commits