uptop

Author	SHA1	Message	Date
lerko	f7da69f25f	fix(security): SSRF guard gaps + DNS port restriction + metrics auth CI / test (pull_request) Successful in 1m54s Details CI / lint (pull_request) Successful in 1m27s Details CI / vulncheck (pull_request) Successful in 1m1s Details 1. SSRF guard now blocks 0.0.0.0/8 (routes to localhost on Linux) and 100.64.0.0/10 (CGNAT). Also rejects unspecified, multicast, and loopback IPs via net.IP methods for defense in depth. 2. DNS monitor type no longer bypasses SSRF guard. The DNSServer address is resolved and validated against isPrivateIP before use. Port restricted to 53 — prevents arbitrary internal port probing via crafted DNSServer values. 3. /metrics now default-deny when MetricsPublic is false, regardless of whether UPTOP_CLUSTER_SECRET is set. Previously, no secret = no auth check = metrics exposed to everyone.	2026-06-11 18:57:37 -04:00
lerko	5d2b7a3e66	fix: seven quick-win bug fixes across engine, server, TUI, CLI CI / test (pull_request) Successful in 1m55s Details CI / lint (pull_request) Successful in 1m27s Details CI / vulncheck (pull_request) Successful in 1m1s Details 1. Alertless monitors no longer spam error logs — triggerAlert returns early when alertID <= 0. 2. HTTP response body drained before close — enables connection reuse via keep-alive instead of fresh TCP+TLS per check. 3. /api/backup/export enforces GET — was the only endpoint accepting any HTTP method. 4. limitStr guards against max < 3 — prevents negative slice index panic on very narrow terminals. 5. Filter input accepts multibyte characters — len(msg.Runes) instead of len(msg.String()) for proper Unicode support. 6. Startup warning corrected — with no UPTOP_CLUSTER_SECRET, endpoints reject (401), not accept. Warning now says so. 7. UPTOP_KEYS file open failure logged — was silently swallowed, leaving operators with no admin seeded and no message.	2026-06-11 18:28:32 -04:00
lerko	341d60d2fe	refactor: unify logging with log/slog CI / test (pull_request) Successful in 1m57s Details CI / lint (pull_request) Successful in 1m22s Details CI / vulncheck (pull_request) Successful in 56s Details Replace three uncoordinated logging systems (log.Printf, fmt.Fprintf to stderr, fmt.Println warnings) with structured slog calls. 68 log calls migrated: - log.Printf → slog.Error/Warn/Info (45 calls across 5 files) - fmt.Fprintf(os.Stderr) → slog.Error (23 calls in main.go) Kept unchanged: - fmt.Println/Printf for CLI user output (version, banners, import results) - engine.AddLog for TUI-visible ring buffer (monitoring events) Store migration diagnostics demoted to slog.Debug (silent at default info level). HTTP request logging now structured with method/path/ status/duration/ip attributes.	2026-06-11 18:00:19 -04:00
lerko	ba4465daa2	refactor(server): extract Server type with named handler methods CI / test (pull_request) Successful in 1m53s Details CI / lint (pull_request) Successful in 1m21s Details CI / vulncheck (pull_request) Successful in 1m2s Details Replace the 328-line Start() god function with a Server struct + 11 named handler methods. Routes registered in routes(), middleware applied in one place. Start() kept as a convenience wrapper (NewServer + Start) so existing callers don't need to change unless they want the Server reference. Each handler is now independently readable and testable without parsing a 300-line closure nest.	2026-06-11 16:32:38 -04:00
lerko	f00acbc280	refactor(models): typed Status constants with IsBroken() predicate Replace ~150 bare status string comparisons with typed models.Status constants (StatusUp, StatusDown, StatusPending, StatusLate, StatusStale, StatusSSLExp). Single IsBroken() method replaces the duplicated isBroken lambda in monitor.go and isDown function in sla.go. Adding a new status value (e.g. DEGRADED) now requires one constant definition instead of grep-and-pray across 16 files. CheckResult.Status stays string — the checker is the boundary between raw protocol results and typed status. Cast happens at the edge in handleStatusChange.	2026-06-11 15:56:51 -04:00
lerko	70a83a1da9	refactor(store): propagate context.Context through all Store methods Every Store interface method (except Close) now takes context.Context as first parameter. All 54 db.Query/Exec/QueryRow calls in SQLStore replaced with their *Context variants. DB operations now respect cancellation and deadlines. Context sources by caller: - Engine dbWriter/poll/pruner: engine ctx from Start() - HTTP handlers: r.Context() - config.Apply/Export: caller-provided ctx - TUI/main.go init: context.Background() RunCheck and all sub-checks (HTTP/ping/port/DNS) accept parent ctx. HTTP checks now inherit shutdown cancellation instead of rooting in context.Background(). dbWrite.exec takes ctx so the writer goroutine can cancel stuck DB operations. DeleteSite/ImportData use BeginTx(ctx) instead of Begin().	2026-06-11 14:40:30 -04:00
lerko	5d5153351e	fix(engine): six correctness fixes for the state machine CI / test (pull_request) Successful in 1m59s Details CI / lint (pull_request) Successful in 1m17s Details CI / vulncheck (pull_request) Successful in 1m1s Details 1. Group auto-pause trap: remove the one-way Paused=true mutation from checkGroup — monitorRoutine skipped paused groups, so they could never re-evaluate or auto-unpause. 2. Retry logic: apply MaxRetries to all →DOWN transitions, not just UP→DOWN. New monitors (PENDING) no longer alert on first transient failure when retries are configured. 3. Shutdown drain hole: track checker goroutines with checkerWG so Stop() waits for in-flight checks before draining the write queue. Final drainWrites() catches any writes enqueued after the writer's own drain. 4. Probe-ingest writer bypass: route SaveCheckFromNode through the engine's serialized dbWriter instead of writing directly to the store from the HTTP handler. 5. Dead-probe expiry: expire stale probe results (>3× site interval) before aggregation so a dead probe can't poison status forever. Also clean probeResults in RemoveSite. 6. Maintenance-cache N+1: replace per-check DB query with a fully-resolved in-memory cache refreshed every poll cycle. One GetActiveMaintenanceWindows() call instead of N IsMonitorInMaintenance. ImportData now wipes check_history, state_changes, and alert_health so re-inserted IDs don't inherit stale history from prior occupants.	2026-06-11 13:57:03 -04:00
lerko	0b64d13bb6	fix(security): serve /status/json through a public DTO The handler serialized raw models.Site — LastError internals, Hostname, Port, DNSServer, AlertID, intervals all public, and every future Site field public the day it's added. statusSite now exposes exactly what the status page renders: Name, Type, URL, Status, Paused, LastCheck, Latency. Replaces the vacuous TestStatusJSON_TokensStripped, which injected via UpdateSiteConfig (a no-op for unknown IDs) and asserted over zero sites. The new test seeds the store, starts the engine, waits for live state, and asserts internal fields are absent from the raw JSON.	2026-06-11 12:26:40 -04:00
lerko	809620340e	fix(security): close XFF bypass and three secret-leak paths CI / test (pull_request) Successful in 2m36s Details CI / lint (pull_request) Successful in 56s Details CI / vulncheck (pull_request) Successful in 46s Details Four fixes hardening the secrets and rate-limit posture a prior audit left or that regressed: X-Forwarded-For rate-limit bypass + memory DoS (ratelimit.go): clientIP returned the raw XFF header, so an attacker rotating it minted unlimited distinct limiter keys — never tripping the limit and growing the visitors map without bound. XFF is now honored only when the immediate peer is a configured trusted proxy (UPTOP_TRUSTED_PROXIES, CIDRs or bare IPs), using the right-most non-trusted hop; otherwise the key is the real RemoteAddr. The visitors map is bounded with LRU eviction as defense in depth. Export redaction denylist -> per-provider allowlist (server.go): the old six-key denylist missed the actual credentials — the webhook URL for discord/slack/webhook/ntfy/gotify and api_key for opsgenie — exporting them in the clear. redactByProvider keeps only known-safe keys per provider type and redacts everything else, so unknown/new keys fail safe. ImportData plaintext secrets (sqlstore.go): import inserted raw json.Marshal(settings), bypassing the encryption AddAlert/UpdateAlert use. It now routes through marshalSettings, so a restore with UPTOP_ENCRYPTION_KEY set stores enc:-prefixed ciphertext, not plaintext. Alert error credential leak (alert.go): provider Send returned the raw *url.Error, whose URL carries the secret (Telegram bot token in the path, webhook secrets in the URL); it was persisted to AlertHealth.LastError and shown in the TUI. sanitizeError strips the URL, keeping the operation and underlying cause. Tests cover trusted/untrusted XFF + spoofed-bypass + map bound, the allowlist per provider, encrypted-on-import round-trip, and URL-stripped errors. README documents UPTOP_TRUSTED_PROXIES. Full suite green under -race; golangci-lint clean.	2026-06-10 18:50:19 -04:00
lerko	8f17deba67	chore: migrate module path to lerkolabs org CI / test (pull_request) Successful in 2m39s Details CI / lint (pull_request) Successful in 1m6s Details CI / vulncheck (pull_request) Successful in 46s Details Move Go module from gitea.lerkolabs.com/lerko/uptop to gitea.lerkolabs.com/lerkolabs/uptop. Updates all imports, go.mod, goreleaser owner, and README links.	2026-05-29 14:22:49 -04:00
lerko	5dc31108f8	feat: proper push monitor lifecycle — PENDING, LATE, DOWN states CI / test (pull_request) Successful in 2m41s Details CI / lint (pull_request) Successful in 1m7s Details CI / vulncheck (pull_request) Successful in 46s Details Push monitors no longer lie about status: - PENDING stays until first heartbeat (no auto-promote to UP) - LATE state (amber) when overdue but within grace period - DOWN only after grace period expires - Grace period = interval/2, minimum 60s RecordHeartbeat now handles all transitions: - PENDING → UP (first heartbeat, logged) - LATE → UP (late arrival, logged) - DOWN → UP (recovery, alert + state change persisted) TUI updates: - LATE rendered in amber/warning color - Status bar shows LATE count separately - Tab badge shows ⚠ for late monitors - Sort order: DOWN > LATE > UP > PENDING > PAUSED - Detail panel shows error for LATE monitors Inspired by Healthchecks.io state machine (new/up/grace/down).	2026-05-27 19:56:50 -04:00
lerko	bc3a44beac	feat: show error reason when monitors go DOWN CI / test (pull_request) Successful in 2m42s Details CI / lint (pull_request) Successful in 1m11s Details CI / vulncheck (pull_request) Successful in 51s Details Propagate check failure reasons through the entire stack: - Checker captures specific errors (DNS, timeout, HTTP status, SSL, etc.) - Engine tracks LastError, StatusChangedAt, LastSuccessAt per monitor - State transitions persisted to new state_changes table - Detail panel shows error reason, HTTP code, state duration, last success time, and last 5 state change events - Monitor table shows inline error preview for DOWN services - Alert messages include error reason - Probe nodes forward error reasons to leader 15 files changed across models, checker, engine, store, TUI, and probes.	2026-05-27 19:32:30 -04:00
lerko	986f9f1d55	fix(security): phase 4 code quality and low-severity fixes CI / test (pull_request) Successful in 4m24s Details CI / lint (pull_request) Successful in 1m1s Details - Fix limitStr to handle multi-byte UTF-8 characters correctly - Sanitize log messages: strip ANSI escape sequences and newlines - URL-encode probe node_id instead of string concatenation - Fix follower resp.Body leak on non-200 responses - Make SSH host key path configurable via UPTOP_SSH_HOST_KEY env var - Add HTTP method checks on GET-only endpoints (405 for wrong methods) - Extract magic numbers into named constants across monitor/store/server - Standardize error output to stderr for all startup errors	2026-05-26 17:25:47 -04:00
lerko	bd561d9a5e	fix(security): phase 3 medium reliability and hardening CI / test (pull_request) Successful in 4m23s Details CI / lint (pull_request) Successful in 1m11s Details - Fail hard on critical migration errors (ignore only "already exists") - Cache SSH user keys with 30s TTL (avoid DB query per auth attempt) - Configure DB connection pooling (25 open, 5 idle, 5m lifetime) - Enable SQLite WAL mode for concurrent read/write - Optimize check history pruning (only prune above 1100 rows) - Add security headers: X-Content-Type-Options, X-Frame-Options, CSP, Referrer-Policy - Add CORS policy on /status/json via UPTOP_CORS_ORIGIN env var - Add HTTP request logging middleware (method, path, status, duration, IP) - Fix config file permissions from 0644 to 0600 - Pin Docker images: golang:1.24-alpine3.21, alpine:3.21 - Fix Docker CI tag pattern for CalVer (was semver) - Pass build args (VERSION, COMMIT, BUILD_DATE) to Docker build	2026-05-26 16:57:03 -04:00
lerko	d30d1460bd	fix(security): phase 2 high-severity hardening CI / test (pull_request) Successful in 4m31s Details CI / lint (pull_request) Successful in 56s Details - Push heartbeat accepts Authorization: Bearer header (query string deprecated) - Gotify alerts use X-Gotify-Key header instead of token in URL - Per-IP rate limiting on all API endpoints (token-bucket) - /metrics gated behind cluster secret (UPTOP_METRICS_PUBLIC=true to opt out) - Config export redacts passwords/tokens by default (redact_secrets=false to override) - Fix rewritePlaceholders for 100+ SQL parameters - Fix AddSiteReturningID/AddAlertReturningID race with LastInsertId/RETURNING - HTTP server timeouts: read 30s, write 60s, idle 120s	2026-05-25 21:15:33 -04:00
lerko	60b30935b3	fix(security): phase 1 critical fixes for public release CI / test (pull_request) Successful in 4m40s Details CI / lint (pull_request) Successful in 1m2s Details - Redact PostgreSQL DSN password from stdout/logs - Harden .dockerignore to exclude .ssh/, .claude/, .db, .local files - SSRF protection: block private/loopback/link-local IPs by default (UPTOP_ALLOW_PRIVATE_TARGETS=true to override for homelab use) - Fix email header injection via CRLF in monitor names - AES-256-GCM encryption for alert credentials at rest (UPTOP_ENCRYPTION_KEY env var, migrate-secrets subcommand) - TLS support for HTTP server (UPTOP_TLS_CERT/UPTOP_TLS_KEY) with HSTS header when TLS enabled	2026-05-25 11:26:47 -04:00
lerko	9d12e3ecf1	chore: complete rename from go-upkeep to uptop CI / test (pull_request) Successful in 4m26s Details CI / lint (pull_request) Successful in 1m11s Details - Module path: gitea.lerkolabs.com/lerko/uptop - Binary: cmd/uptop/ - All imports updated to full module path - Env vars: UPKEEP_* → UPTOP_* - Prometheus metrics: upkeep_* → uptop_* - Default DB: uptop.db - Docker image: lerko/uptop - All docs, compose files, CI updated Only remaining "go-upkeep" reference is the fork attribution in README.	2026-05-24 20:20:35 -04:00
lerko	359cff7292	chore: add golangci-lint config and fix all lint issues Add .golangci.yml enabling errcheck, staticcheck, govet, gosec, ineffassign, and unused linters. Fix 66 issues across 16 files: - Check all unchecked errors (errcheck) - Use HTTP status constants instead of numeric literals (staticcheck) - Replace deprecated LineUp/LineDown with ScrollUp/ScrollDown (staticcheck) - Convert sprintf+write patterns to fmt.Fprintf (staticcheck) - Add ReadHeaderTimeout to http.Server (gosec) - Remove unused types and functions (unused) - Add nolint comments for intentional patterns (InsecureSkipVerify, math/rand for jitter, dialect-only SQL formatting)	2026-05-23 22:02:06 -04:00
lerko	93c5b638cf	fix(server): constant-time secret comparison, request size limits Replace string equality checks on cluster secret with crypto/subtle.ConstantTimeCompare to prevent timing attacks. Add http.MaxBytesReader (1MB) to all POST endpoints that decode JSON bodies. Change Start() to return *http.Server for graceful shutdown support. Replace log.Fatalf with log.Printf in HTTP server goroutine.	2026-05-23 13:20:28 -04:00
lerko	b146f34d19	feat: add incident management and maintenance windows Maintenance windows suppress alerts during planned downtime while checks continue running. Incidents provide informational tracking. Supports targeting all monitors, single monitor, or group (applies to children). New Maint tab in TUI with create/end/delete. Status page, JSON API, and Prometheus metrics all reflect maintenance state.	2026-05-22 18:45:02 -04:00
lerko	025b1b61d0	fix(security): strip push tokens from /status/json response The public status JSON endpoint was serializing full Site structs including heartbeat tokens. An attacker could extract tokens and forge heartbeats to suppress DOWN alerts. Now tokens are stripped before encoding. Backup/export endpoint is unaffected.	2026-05-16 15:45:09 -04:00
lerko	0396acdc59	feat(cluster): add region affinity, Nodes TUI tab, and probe metrics Phase 3 of distributed probing: - Add regions column to sites table for per-monitor probe affinity - Region-filtered probe assignments (empty regions = all probes) - New Nodes TUI tab showing connected probes with status/region/last-seen - Regions input field in site form for configuring probe affinity - Config-as-code support for regions (export/import/diff) - Prometheus upkeep_probe_up metric with per-node labels - Reindex TUI tabs: Sites, Alerts, Logs, Nodes, Users	2026-05-16 11:50:16 -04:00
lerko	ca5a42314f	feat(cluster): add probe execution mode, check extraction, and result aggregation Phase 2 of distributed probing: - Extract check logic into standalone RunCheck() for use by probes - Add probe cluster mode: stateless nodes that fetch assignments, execute checks, and report results to the leader - Add multi-node result aggregation with configurable strategy (any-down, majority-down, all-down) - Leader ingests probe results into engine live state and triggers alerts - New env vars: UPKEEP_NODE_ID, UPKEEP_NODE_NAME, UPKEEP_NODE_REGION, UPKEEP_AGG_STRATEGY - Example docker-compose.probe.yml with leader + 2 regional probes	2026-05-16 11:19:57 -04:00
lerko	ca9faa0acd	feat(cluster): add distributed probing foundation — schema, models, and probe APIs Add node-aware check history and probe registration infrastructure: - ProbeNode model and nodes table (SQLite + Postgres) - node_id column on check_history for multi-source tracking - Store interface: RegisterNode, GetNode, GetAllNodes, DeleteNode, SaveCheckFromNode - Dialect: UpsertNodeSQL (INSERT OR REPLACE / ON CONFLICT) - API endpoints: POST /api/probe/register, GET /api/probe/assignments, POST /api/probe/results - Backward compatible: existing SaveCheck wraps SaveCheckFromNode with empty node_id	2026-05-16 11:05:06 -04:00
lerko	b7b8aa6f03	feat(metrics): add Prometheus /metrics endpoint Zero-dependency Prometheus text exposition format. Exposes monitor up/down, latency, status code, check timestamps, pause state, SSL cert expiry, and check counters — all from in-memory state.	2026-05-15 11:26:21 -04:00
lerko	f023e38fdc	refactor(monitor): encapsulate engine state, add graceful shutdown and tests Replace all monitor package-level mutable state with Engine struct. All state (liveState, logStore, histories, tokenIndex, HTTP clients) is now encapsulated in Engine, created via NewEngine(store). Key changes: - Engine struct holds all monitor state with proper mutex protection - Engine.Start(ctx) and monitorRoutine respect context cancellation for graceful shutdown — no more leaked goroutines - cluster.runFollowerLoop also respects context for clean exit - Token index (map[string]int) for O(1) push heartbeat lookup, replacing O(n) linear scan through LiveState - UpdateSiteConfig preserves 8 runtime fields instead of copying 17 config fields individually - triggerAlert goroutines get 30s timeout context - All consumers (TUI, server, cluster, main) receive *Engine via constructor/parameter — no package-level state access - main.go creates context.WithCancel, passes to engine and cluster First test suite: 12 tests across store and alert packages - Store: CRUD for sites/alerts/users, push token generation, import/export round-trip, check history persistence - Alert: Discord/Slack/Webhook payload format, HTTP 4xx error propagation, Ntfy headers, unknown provider returns nil	2026-05-15 08:21:17 -04:00
lerko	a6bb9a7aff	refactor(core): remove store global singleton, thread store explicitly Remove store.Get()/SetGlobal()/Current. Store is now passed explicitly to all consumers via constructor parameters and function arguments. - TUI Model holds store field, set via InitialModel(isAdmin, store) - monitor.StartEngine(s) and InitHistoryFromStore(s) accept store - server.Start(cfg, s) closes over store in HTTP handlers - main.go threads store to SSH server, TUI, monitor, server - isKeyAllowed receives store as parameter No more hidden dependency on package-level mutable state in store pkg. Monitor package still uses package-level state (LiveState, etc.) — will be encapsulated into Engine struct in Phase 7.	2026-05-15 00:45:07 -04:00
lerko	d4f4012c8a	refactor(store): add error returns to all Store interface methods Every Store method now returns an error. Callers handle errors gracefully — TUI logs to event log, server returns HTTP 500, monitor engine logs and retries. All rows.Scan() errors are now checked in sqlstore.go instead of silently appending corrupt data. - GetSites, GetAllAlerts, GetAllUsers return ([]T, error) - GetAlert returns (AlertConfig, error) instead of (AlertConfig, bool) - AddSite, UpdateSite, DeleteSite, etc. all return error - SaveCheck, LoadAllHistory, ExportData return error - ~25 caller sites updated across tui, server, monitor, main	2026-05-15 00:37:20 -04:00
lerko	4d5116644f	fix(core): correctness and robustness fixes across all subsystems - Move status page template to package-level template.Must (panic on parse error at init instead of nil deref at runtime) - Fix XSS in import error responses (log detail server-side, return generic message to client) - Handle ListenAndServe errors in HTTP and SSH servers - Use defer resp.Body.Close() in all alert providers, check json.Marshal errors - Share HTTP clients across checks instead of creating per-request - Use http.NewRequestWithContext for per-site timeout control - Support HTTP method field (was always GET despite DB storing method) - Implement AcceptedCodes validation (was hardcoded >= 400 despite DB storing accepted code ranges) - Add defer tx.Rollback() to ImportData for transaction safety	2026-05-15 00:00:02 -04:00
lerko	e97780ad38	fix(tui,status,store): add delete confirm, input validation, XSS fix, history persistence Prevent accidental deletes with y/n confirmation dialog. Validate all numeric form inputs (interval, port, timeout, threshold, retries) with range checks instead of silently defaulting to zero. Escape user-supplied data in status page JavaScript to close XSS via monitor names. Persist check history to new check_history table so sparklines and uptime percentages survive restarts.	2026-05-14 20:51:06 -04:00
lerko	d5ab3a18a4	feat(tui,status): add per-site pause, fix viewport, polish status page Per-site pause: [p] key toggles pause for selected monitor in TUI. Paused monitors skip checks, persist to DB, show on status page. Status page: replace full-page reload with fetch-based DOM updates to eliminate scroll-jump on refresh. Add summary bar (UP/DOWN/PAUSED counts), stale-data indicator, and fix SSL EXP CSS class bug. TUI: constrain tables to terminal width via lipgloss .Width() to prevent row wrapping that pushed header off-screen. Add MaxHeight safety net. Bump subtle style from #383838 to #565f89 for readability on dark terminals.	2026-05-14 18:46:17 -04:00
lerko	6d92df4f46	feat(importer): add Uptime Kuma backup converter with CLI and API Convert Kuma monitorList/notificationList to go-upkeep Backup format. Maps all monitor types (http, ping, port, dns, group), ntfy notifications with auth, parent IDs, and alert assignments. Available via --import-kuma flag and POST /api/import/kuma endpoint.	2026-05-14 17:30:17 -04:00
lerko	11848ce674	fix(security): harden TLS, timeouts, validation, logging, and token generation - Default TLS verification on, opt-in UPKEEP_INSECURE_SKIP_VERIFY - Alert webhooks use 10s timeout client, close response bodies - URL input validates http/https scheme for HTTP monitors - Stdlib logs route to stderr instead of discard - Panic on crypto/rand failure in token generation - Cluster startup warnings for non-HTTPS and missing secret - Replace demo SMTP creds with obvious placeholders - Color-coded log entries and scroll hints in logs tab	2026-05-14 15:28:04 -04:00
lerko	02f0a39d97	feat: initial commit — uptime monitor (forked from go-upkeep) Go-based uptime monitor with SQLite/Postgres storage, TUI dashboard, SSH server, alerting, and clustering support.	2026-05-14 11:05:10 -04:00

34 Commits