Commit Graph

24 Commits

Author SHA1 Message Date
lerko 70a83a1da9 refactor(store): propagate context.Context through all Store methods
Every Store interface method (except Close) now takes context.Context
as first parameter. All 54 db.Query/Exec/QueryRow calls in SQLStore
replaced with their *Context variants. DB operations now respect
cancellation and deadlines.

Context sources by caller:
- Engine dbWriter/poll/pruner: engine ctx from Start()
- HTTP handlers: r.Context()
- config.Apply/Export: caller-provided ctx
- TUI/main.go init: context.Background()

RunCheck and all sub-checks (HTTP/ping/port/DNS) accept parent ctx.
HTTP checks now inherit shutdown cancellation instead of rooting in
context.Background(). dbWrite.exec takes ctx so the writer goroutine
can cancel stuck DB operations.

DeleteSite/ImportData use BeginTx(ctx) instead of Begin().
2026-06-11 14:40:30 -04:00
lerko 809620340e fix(security): close XFF bypass and three secret-leak paths
CI / test (pull_request) Successful in 2m36s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 46s
Four fixes hardening the secrets and rate-limit posture a prior audit
left or that regressed:

X-Forwarded-For rate-limit bypass + memory DoS (ratelimit.go): clientIP
returned the raw XFF header, so an attacker rotating it minted unlimited
distinct limiter keys — never tripping the limit and growing the visitors
map without bound. XFF is now honored only when the immediate peer is a
configured trusted proxy (UPTOP_TRUSTED_PROXIES, CIDRs or bare IPs), using
the right-most non-trusted hop; otherwise the key is the real RemoteAddr.
The visitors map is bounded with LRU eviction as defense in depth.

Export redaction denylist -> per-provider allowlist (server.go): the old
six-key denylist missed the actual credentials — the webhook URL for
discord/slack/webhook/ntfy/gotify and api_key for opsgenie — exporting
them in the clear. redactByProvider keeps only known-safe keys per
provider type and redacts everything else, so unknown/new keys fail safe.

ImportData plaintext secrets (sqlstore.go): import inserted raw
json.Marshal(settings), bypassing the encryption AddAlert/UpdateAlert
use. It now routes through marshalSettings, so a restore with
UPTOP_ENCRYPTION_KEY set stores enc:-prefixed ciphertext, not plaintext.

Alert error credential leak (alert.go): provider Send returned the raw
*url.Error, whose URL carries the secret (Telegram bot token in the path,
webhook secrets in the URL); it was persisted to AlertHealth.LastError
and shown in the TUI. sanitizeError strips the URL, keeping the operation
and underlying cause.

Tests cover trusted/untrusted XFF + spoofed-bypass + map bound, the
allowlist per provider, encrypted-on-import round-trip, and URL-stripped
errors. README documents UPTOP_TRUSTED_PROXIES. Full suite green under
-race; golangci-lint clean.
2026-06-10 18:50:19 -04:00
lerko 8b39d4c1a1 fix(monitor): serialize DB writes through a single drained writer
CI / test (pull_request) Successful in 2m36s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 51s
Every check spawned `go e.db.Save*(...)` with the error discarded: a
fire-and-forget goroutine per log line, check, state change, and alert
health update. SaveLog ran a full-table prune DELETE on every insert and
SaveCheck a COUNT + conditional prune on every check, so the hot path
amplified each write into several statements. Nothing tracked these
goroutines, so at shutdown they raced the store's Close() — writes to a
closing DB, silently swallowed.

Introduce a single writer goroutine that drains a buffered channel of
typed dbWrite values (log/check/state-change/alert-health). Writes are
enqueued non-blocking; a saturated queue drops and notes it in the
in-memory log rather than blocking the check loop. Write errors are now
logged instead of discarded. Retention moves off the hot path: SaveLog
and SaveCheck become plain INSERTs, and PruneLogs/PruneCheckHistory/
PruneStateChanges run on a 10-minute timer inside the writer (single
keep-newest-N-per-site pass via a window function). state_changes was
previously never pruned — now bounded.

Add Engine.Stop(): cancels the engine's context, then waits for the
writer to drain every buffered write before returning. main wires it in
before the deferred store Close() so no write races a closed DB.

SQLite gains busy_timeout=5000 and synchronous=NORMAL, applied via the
DSN so every pooled connection inherits them (a post-open PRAGMA only
touches one connection); WAL moves to the DSN too. :memory: test DBs are
left as-is.

Tests: writer drains on Stop, Stop is idempotent, and the prune queries
keep newest-N per site / N logs on real SQLite. Full suite green under
-race.
2026-06-10 18:14:28 -04:00
lerko 21a1563e53 feat(monitor): auto-prune expired maintenance windows
CI / test (pull_request) Successful in 2m33s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 50s
Background goroutine runs every 15 minutes, deletes maintenance windows
that expired beyond the retention period (default 7 days). Configurable
via UPTOP_MAINT_RETENTION env var (Go duration format).

Closes #72
2026-06-05 18:27:42 -04:00
lerko 65a83368bf fix(store): cascade delete related rows when removing a site
CI / test (pull_request) Successful in 2m32s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 51s
DeleteSite now removes maintenance_windows, check_history, and
state_changes for the site within a transaction before deleting
the site itself. Prevents orphaned rows.

Closes #71
2026-06-05 12:55:43 -04:00
lerko 60592ef810 feat(tui): add SLA reporting view
CI / test (pull_request) Successful in 2m35s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 41s
Full-screen SLA report accessible via [s] from detail panel.
Computes uptime%, downtime, outage count, longest outage, MTTR,
and MTBF from state_changes table. Includes daily breakdown with
bar chart, switchable time periods (24h/7d/30d/90d), and
scrollable viewport. LATE/STALE treated as UP for SLA purposes.
2026-06-04 14:24:39 -04:00
lerko 8f17deba67 chore: migrate module path to lerkolabs org
CI / test (pull_request) Successful in 2m39s
CI / lint (pull_request) Successful in 1m6s
CI / vulncheck (pull_request) Successful in 46s
Move Go module from gitea.lerkolabs.com/lerko/uptop to
gitea.lerkolabs.com/lerkolabs/uptop. Updates all imports,
go.mod, goreleaser owner, and README links.
2026-05-29 14:22:49 -04:00
lerko 026e969b74 chore: TUI screenshots, README polish, changelog rewrite (#32)
CI / test (push) Successful in 2m41s
CI / lint (push) Successful in 1m11s
CI / vulncheck (push) Successful in 56s
- Add 6 TUI screenshots to assets/ (monitors, alerts, logs, nodes, detail, theme)
- Rewrite README with hero image, badges, collapsible install sections
- Rewrite changelog to match actual CalVer tag history
- VHS tooling extracted to lerko/uptop-vhs

Reviewed-on: lerko/uptop#32
2026-05-29 17:45:31 +00:00
lerko bc3a44beac feat: show error reason when monitors go DOWN
CI / test (pull_request) Successful in 2m42s
CI / lint (pull_request) Successful in 1m11s
CI / vulncheck (pull_request) Successful in 51s
Propagate check failure reasons through the entire stack:
- Checker captures specific errors (DNS, timeout, HTTP status, SSL, etc.)
- Engine tracks LastError, StatusChangedAt, LastSuccessAt per monitor
- State transitions persisted to new state_changes table
- Detail panel shows error reason, HTTP code, state duration, last
  success time, and last 5 state change events
- Monitor table shows inline error preview for DOWN services
- Alert messages include error reason
- Probe nodes forward error reasons to leader

15 files changed across models, checker, engine, store, TUI, and probes.
2026-05-27 19:32:30 -04:00
lerko 986f9f1d55 fix(security): phase 4 code quality and low-severity fixes
CI / test (pull_request) Successful in 4m24s
CI / lint (pull_request) Successful in 1m1s
- Fix limitStr to handle multi-byte UTF-8 characters correctly
- Sanitize log messages: strip ANSI escape sequences and newlines
- URL-encode probe node_id instead of string concatenation
- Fix follower resp.Body leak on non-200 responses
- Make SSH host key path configurable via UPTOP_SSH_HOST_KEY env var
- Add HTTP method checks on GET-only endpoints (405 for wrong methods)
- Extract magic numbers into named constants across monitor/store/server
- Standardize error output to stderr for all startup errors
2026-05-26 17:25:47 -04:00
lerko bd561d9a5e fix(security): phase 3 medium reliability and hardening
CI / test (pull_request) Successful in 4m23s
CI / lint (pull_request) Successful in 1m11s
- Fail hard on critical migration errors (ignore only "already exists")
- Cache SSH user keys with 30s TTL (avoid DB query per auth attempt)
- Configure DB connection pooling (25 open, 5 idle, 5m lifetime)
- Enable SQLite WAL mode for concurrent read/write
- Optimize check history pruning (only prune above 1100 rows)
- Add security headers: X-Content-Type-Options, X-Frame-Options, CSP, Referrer-Policy
- Add CORS policy on /status/json via UPTOP_CORS_ORIGIN env var
- Add HTTP request logging middleware (method, path, status, duration, IP)
- Fix config file permissions from 0644 to 0600
- Pin Docker images: golang:1.24-alpine3.21, alpine:3.21
- Fix Docker CI tag pattern for CalVer (was semver)
- Pass build args (VERSION, COMMIT, BUILD_DATE) to Docker build
2026-05-26 16:57:03 -04:00
lerko d30d1460bd fix(security): phase 2 high-severity hardening
CI / test (pull_request) Successful in 4m31s
CI / lint (pull_request) Successful in 56s
- Push heartbeat accepts Authorization: Bearer header (query string deprecated)
- Gotify alerts use X-Gotify-Key header instead of token in URL
- Per-IP rate limiting on all API endpoints (token-bucket)
- /metrics gated behind cluster secret (UPTOP_METRICS_PUBLIC=true to opt out)
- Config export redacts passwords/tokens by default (redact_secrets=false to override)
- Fix rewritePlaceholders for 100+ SQL parameters
- Fix AddSiteReturningID/AddAlertReturningID race with LastInsertId/RETURNING
- HTTP server timeouts: read 30s, write 60s, idle 120s
2026-05-25 21:15:33 -04:00
lerko 60b30935b3 fix(security): phase 1 critical fixes for public release
CI / test (pull_request) Successful in 4m40s
CI / lint (pull_request) Successful in 1m2s
- Redact PostgreSQL DSN password from stdout/logs
- Harden .dockerignore to exclude .ssh/, .claude/, *.db, *.local files
- SSRF protection: block private/loopback/link-local IPs by default
  (UPTOP_ALLOW_PRIVATE_TARGETS=true to override for homelab use)
- Fix email header injection via CRLF in monitor names
- AES-256-GCM encryption for alert credentials at rest
  (UPTOP_ENCRYPTION_KEY env var, migrate-secrets subcommand)
- TLS support for HTTP server (UPTOP_TLS_CERT/UPTOP_TLS_KEY)
  with HSTS header when TLS enabled
2026-05-25 11:26:47 -04:00
lerko 9d12e3ecf1 chore: complete rename from go-upkeep to uptop
CI / test (pull_request) Successful in 4m26s
CI / lint (pull_request) Successful in 1m11s
- Module path: gitea.lerkolabs.com/lerko/uptop
- Binary: cmd/uptop/
- All imports updated to full module path
- Env vars: UPKEEP_* → UPTOP_*
- Prometheus metrics: upkeep_* → uptop_*
- Default DB: uptop.db
- Docker image: lerko/uptop
- All docs, compose files, CI updated

Only remaining "go-upkeep" reference is the fork attribution in README.
2026-05-24 20:20:35 -04:00
lerko 359cff7292 chore: add golangci-lint config and fix all lint issues
Add .golangci.yml enabling errcheck, staticcheck, govet, gosec,
ineffassign, and unused linters. Fix 66 issues across 16 files:
- Check all unchecked errors (errcheck)
- Use HTTP status constants instead of numeric literals (staticcheck)
- Replace deprecated LineUp/LineDown with ScrollUp/ScrollDown (staticcheck)
- Convert sprintf+write patterns to fmt.Fprintf (staticcheck)
- Add ReadHeaderTimeout to http.Server (gosec)
- Remove unused types and functions (unused)
- Add nolint comments for intentional patterns (InsecureSkipVerify,
  math/rand for jitter, dialect-only SQL formatting)
2026-05-23 22:02:06 -04:00
lerko ae141c62ba fix(store): replace panic with error return, handle unmarshal errors
generateToken() now returns (string, error) instead of panicking on
crypto/rand failure. All json.Unmarshal calls for alert settings now
check and propagate errors instead of silently ignoring them.

Adds Close() to Store interface for graceful shutdown support.
Skips malformed notification entries during Kuma import.
2026-05-23 13:15:39 -04:00
lerko e84b64f8ed feat(tui): zebra striping, detail breadcrumb, sparkline stats, collapse persistence
Add alternating row backgrounds for easier table scanning. Detail panel
now shows breadcrumb path (Sites > Group > Name) and min/avg/max latency
stats below the sparkline. Group collapse state persists across restarts
via new preferences table in both SQLite and Postgres.
2026-05-22 20:53:23 -04:00
lerko b146f34d19 feat: add incident management and maintenance windows
Maintenance windows suppress alerts during planned downtime while checks
continue running. Incidents provide informational tracking. Supports
targeting all monitors, single monitor, or group (applies to children).

New Maint tab in TUI with create/end/delete. Status page, JSON API, and
Prometheus metrics all reflect maintenance state.
2026-05-22 18:45:02 -04:00
lerko ed082e4080 feat: persist logs to DB, load on startup 2026-05-16 15:25:08 -04:00
lerko 0396acdc59 feat(cluster): add region affinity, Nodes TUI tab, and probe metrics
Phase 3 of distributed probing:
- Add regions column to sites table for per-monitor probe affinity
- Region-filtered probe assignments (empty regions = all probes)
- New Nodes TUI tab showing connected probes with status/region/last-seen
- Regions input field in site form for configuring probe affinity
- Config-as-code support for regions (export/import/diff)
- Prometheus upkeep_probe_up metric with per-node labels
- Reindex TUI tabs: Sites, Alerts, Logs, Nodes, Users
2026-05-16 11:50:16 -04:00
lerko ca9faa0acd feat(cluster): add distributed probing foundation — schema, models, and probe APIs
Add node-aware check history and probe registration infrastructure:
- ProbeNode model and nodes table (SQLite + Postgres)
- node_id column on check_history for multi-source tracking
- Store interface: RegisterNode, GetNode, GetAllNodes, DeleteNode, SaveCheckFromNode
- Dialect: UpsertNodeSQL (INSERT OR REPLACE / ON CONFLICT)
- API endpoints: POST /api/probe/register, GET /api/probe/assignments, POST /api/probe/results
- Backward compatible: existing SaveCheck wraps SaveCheckFromNode with empty node_id
2026-05-16 11:05:06 -04:00
lerko 5b01b9ee30 feat(config): add config-as-code YAML import/export
Add declarative config-as-code support via YAML files. Monitors and
alerts can be exported, version controlled, and applied across instances.

- goupkeep export [-o file.yaml] dumps current state
- goupkeep apply -f file.yaml creates/updates to match desired state
- --dry-run shows planned changes without applying
- --prune deletes monitors/alerts not in the YAML
- Matching by name, alert references by name, nested group children
- CLI refactored to subcommands (apply, export, serve) with backward compat
- 24 tests covering apply, export, validation, round-trip idempotency
2026-05-15 20:40:49 -04:00
lerko d4f4012c8a refactor(store): add error returns to all Store interface methods
Every Store method now returns an error. Callers handle errors
gracefully — TUI logs to event log, server returns HTTP 500,
monitor engine logs and retries. All rows.Scan() errors are now
checked in sqlstore.go instead of silently appending corrupt data.

- GetSites, GetAllAlerts, GetAllUsers return ([]T, error)
- GetAlert returns (AlertConfig, error) instead of (AlertConfig, bool)
- AddSite, UpdateSite, DeleteSite, etc. all return error
- SaveCheck, LoadAllHistory, ExportData return error
- ~25 caller sites updated across tui, server, monitor, main
2026-05-15 00:37:20 -04:00
lerko ab75f61c6b refactor(store): unify SQLite and Postgres into dialect-based SQLStore
Extract shared SQLStore with Dialect interface for the ~5% that
differs between backends (DDL, placeholders, sequence resets).

- New dialect.go: Dialect interface + placeholder rewriter (? → $N)
- New sqlstore.go: single implementation of all 19 Store methods
- sqlite.go: reduced from 286 to 83 lines (SQLiteDialect only)
- postgres.go: reduced from 266 to 78 lines (PostgresDialect only)
- main.go: use NewSQLiteStore/NewPostgresStore constructors

Zero CRUD logic duplication. Every future schema change written once.
2026-05-15 00:31:44 -04:00