Commit Graph

12 Commits

Author SHA1 Message Date
lerko 4328d25f22 fix(security): API import no longer replaces user accounts
Cluster-secret holder could POST a backup with their own admin key to
/api/backup/import, replacing all users — privilege escalation from
cluster-auth to admin. Also, Kuma imports produced zero users but
ImportWipe unconditionally deleted the users table — locking out all
accounts until restart reseeded UPTOP_ADMIN_KEY.

- Server handlers strip data.Users (set nil) before calling ImportData
- ImportData only wipes+replaces users when data.Users != nil
- New ImportWipeUsers dialect method separates user wipe from data wipe
- CLI restore (main.go) unchanged — full import still replaces users
2026-06-12 12:45:16 -04:00
lerko 52ccd7ad91 refactor(models): split Site into SiteConfig + SiteState
CI / test (pull_request) Successful in 1m58s
CI / lint (pull_request) Successful in 1m21s
CI / vulncheck (pull_request) Successful in 1m2s
Site now embeds SiteConfig (22 persistent fields) and SiteState
(11 ephemeral runtime fields). Field access unchanged via promotion
— site.Name and site.Status still work.

Store layer deals exclusively in SiteConfig — the DB never sees
runtime state. Engine's liveState keeps full Site composites.
UpdateSiteConfig reduced from 11-line field-by-field copy to
`existing.SiteConfig = cfg`.

RunCheck takes SiteConfig (only needs config fields). Checker is
now statically prevented from reading/writing runtime state.

Backup.Sites changed to []SiteConfig — exports no longer carry
zero-valued runtime fields. Import backward-compatible (json
ignores unknown fields).
2026-06-11 17:13:09 -04:00
lerko 70a83a1da9 refactor(store): propagate context.Context through all Store methods
Every Store interface method (except Close) now takes context.Context
as first parameter. All 54 db.Query/Exec/QueryRow calls in SQLStore
replaced with their *Context variants. DB operations now respect
cancellation and deadlines.

Context sources by caller:
- Engine dbWriter/poll/pruner: engine ctx from Start()
- HTTP handlers: r.Context()
- config.Apply/Export: caller-provided ctx
- TUI/main.go init: context.Background()

RunCheck and all sub-checks (HTTP/ping/port/DNS) accept parent ctx.
HTTP checks now inherit shutdown cancellation instead of rooting in
context.Background(). dbWrite.exec takes ctx so the writer goroutine
can cancel stuck DB operations.

DeleteSite/ImportData use BeginTx(ctx) instead of Begin().
2026-06-11 14:40:30 -04:00
lerko 5d5153351e fix(engine): six correctness fixes for the state machine
CI / test (pull_request) Successful in 1m59s
CI / lint (pull_request) Successful in 1m17s
CI / vulncheck (pull_request) Successful in 1m1s
1. Group auto-pause trap: remove the one-way Paused=true mutation
   from checkGroup — monitorRoutine skipped paused groups, so they
   could never re-evaluate or auto-unpause.

2. Retry logic: apply MaxRetries to all →DOWN transitions, not just
   UP→DOWN. New monitors (PENDING) no longer alert on first transient
   failure when retries are configured.

3. Shutdown drain hole: track checker goroutines with checkerWG so
   Stop() waits for in-flight checks before draining the write queue.
   Final drainWrites() catches any writes enqueued after the writer's
   own drain.

4. Probe-ingest writer bypass: route SaveCheckFromNode through the
   engine's serialized dbWriter instead of writing directly to the
   store from the HTTP handler.

5. Dead-probe expiry: expire stale probe results (>3× site interval)
   before aggregation so a dead probe can't poison status forever.
   Also clean probeResults in RemoveSite.

6. Maintenance-cache N+1: replace per-check DB query with a
   fully-resolved in-memory cache refreshed every poll cycle. One
   GetActiveMaintenanceWindows() call instead of N IsMonitorInMaintenance.

ImportData now wipes check_history, state_changes, and alert_health
so re-inserted IDs don't inherit stale history from prior occupants.
2026-06-11 13:57:03 -04:00
lerko 809620340e fix(security): close XFF bypass and three secret-leak paths
CI / test (pull_request) Successful in 2m36s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 46s
Four fixes hardening the secrets and rate-limit posture a prior audit
left or that regressed:

X-Forwarded-For rate-limit bypass + memory DoS (ratelimit.go): clientIP
returned the raw XFF header, so an attacker rotating it minted unlimited
distinct limiter keys — never tripping the limit and growing the visitors
map without bound. XFF is now honored only when the immediate peer is a
configured trusted proxy (UPTOP_TRUSTED_PROXIES, CIDRs or bare IPs), using
the right-most non-trusted hop; otherwise the key is the real RemoteAddr.
The visitors map is bounded with LRU eviction as defense in depth.

Export redaction denylist -> per-provider allowlist (server.go): the old
six-key denylist missed the actual credentials — the webhook URL for
discord/slack/webhook/ntfy/gotify and api_key for opsgenie — exporting
them in the clear. redactByProvider keeps only known-safe keys per
provider type and redacts everything else, so unknown/new keys fail safe.

ImportData plaintext secrets (sqlstore.go): import inserted raw
json.Marshal(settings), bypassing the encryption AddAlert/UpdateAlert
use. It now routes through marshalSettings, so a restore with
UPTOP_ENCRYPTION_KEY set stores enc:-prefixed ciphertext, not plaintext.

Alert error credential leak (alert.go): provider Send returned the raw
*url.Error, whose URL carries the secret (Telegram bot token in the path,
webhook secrets in the URL); it was persisted to AlertHealth.LastError
and shown in the TUI. sanitizeError strips the URL, keeping the operation
and underlying cause.

Tests cover trusted/untrusted XFF + spoofed-bypass + map bound, the
allowlist per provider, encrypted-on-import round-trip, and URL-stripped
errors. README documents UPTOP_TRUSTED_PROXIES. Full suite green under
-race; golangci-lint clean.
2026-06-10 18:50:19 -04:00
lerko 8b39d4c1a1 fix(monitor): serialize DB writes through a single drained writer
CI / test (pull_request) Successful in 2m36s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 51s
Every check spawned `go e.db.Save*(...)` with the error discarded: a
fire-and-forget goroutine per log line, check, state change, and alert
health update. SaveLog ran a full-table prune DELETE on every insert and
SaveCheck a COUNT + conditional prune on every check, so the hot path
amplified each write into several statements. Nothing tracked these
goroutines, so at shutdown they raced the store's Close() — writes to a
closing DB, silently swallowed.

Introduce a single writer goroutine that drains a buffered channel of
typed dbWrite values (log/check/state-change/alert-health). Writes are
enqueued non-blocking; a saturated queue drops and notes it in the
in-memory log rather than blocking the check loop. Write errors are now
logged instead of discarded. Retention moves off the hot path: SaveLog
and SaveCheck become plain INSERTs, and PruneLogs/PruneCheckHistory/
PruneStateChanges run on a 10-minute timer inside the writer (single
keep-newest-N-per-site pass via a window function). state_changes was
previously never pruned — now bounded.

Add Engine.Stop(): cancels the engine's context, then waits for the
writer to drain every buffered write before returning. main wires it in
before the deferred store Close() so no write races a closed DB.

SQLite gains busy_timeout=5000 and synchronous=NORMAL, applied via the
DSN so every pooled connection inherits them (a post-open PRAGMA only
touches one connection); WAL moves to the DSN too. :memory: test DBs are
left as-is.

Tests: writer drains on Stop, Stop is idempotent, and the prune queries
keep newest-N per site / N logs on real SQLite. Full suite green under
-race.
2026-06-10 18:14:28 -04:00
lerko 21a1563e53 feat(monitor): auto-prune expired maintenance windows
CI / test (pull_request) Successful in 2m33s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 50s
Background goroutine runs every 15 minutes, deletes maintenance windows
that expired beyond the retention period (default 7 days). Configurable
via UPTOP_MAINT_RETENTION env var (Go duration format).

Closes #72
2026-06-05 18:27:42 -04:00
lerko cc139bdb73 refactor(store): check all discarded errors in sqlstore_test.go
CI / test (pull_request) Successful in 2m37s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 46s
Closes #74
2026-06-05 16:32:35 -04:00
lerko 65a83368bf fix(store): cascade delete related rows when removing a site
CI / test (pull_request) Successful in 2m32s
CI / lint (pull_request) Successful in 56s
CI / vulncheck (pull_request) Successful in 51s
DeleteSite now removes maintenance_windows, check_history, and
state_changes for the site within a transaction before deleting
the site itself. Prevents orphaned rows.

Closes #71
2026-06-05 12:55:43 -04:00
lerko 8f17deba67 chore: migrate module path to lerkolabs org
CI / test (pull_request) Successful in 2m39s
CI / lint (pull_request) Successful in 1m6s
CI / vulncheck (pull_request) Successful in 46s
Move Go module from gitea.lerkolabs.com/lerko/uptop to
gitea.lerkolabs.com/lerkolabs/uptop. Updates all imports,
go.mod, goreleaser owner, and README links.
2026-05-29 14:22:49 -04:00
lerko 9d12e3ecf1 chore: complete rename from go-upkeep to uptop
CI / test (pull_request) Successful in 4m26s
CI / lint (pull_request) Successful in 1m11s
- Module path: gitea.lerkolabs.com/lerko/uptop
- Binary: cmd/uptop/
- All imports updated to full module path
- Env vars: UPKEEP_* → UPTOP_*
- Prometheus metrics: upkeep_* → uptop_*
- Default DB: uptop.db
- Docker image: lerko/uptop
- All docs, compose files, CI updated

Only remaining "go-upkeep" reference is the fork attribution in README.
2026-05-24 20:20:35 -04:00
lerko f023e38fdc refactor(monitor): encapsulate engine state, add graceful shutdown and tests
Replace all monitor package-level mutable state with Engine struct.
All state (liveState, logStore, histories, tokenIndex, HTTP clients)
is now encapsulated in Engine, created via NewEngine(store).

Key changes:
- Engine struct holds all monitor state with proper mutex protection
- Engine.Start(ctx) and monitorRoutine respect context cancellation
  for graceful shutdown — no more leaked goroutines
- cluster.runFollowerLoop also respects context for clean exit
- Token index (map[string]int) for O(1) push heartbeat lookup,
  replacing O(n) linear scan through LiveState
- UpdateSiteConfig preserves 8 runtime fields instead of copying
  17 config fields individually
- triggerAlert goroutines get 30s timeout context
- All consumers (TUI, server, cluster, main) receive *Engine via
  constructor/parameter — no package-level state access
- main.go creates context.WithCancel, passes to engine and cluster

First test suite: 12 tests across store and alert packages
- Store: CRUD for sites/alerts/users, push token generation,
  import/export round-trip, check history persistence
- Alert: Discord/Slack/Webhook payload format, HTTP 4xx error
  propagation, Ntfy headers, unknown provider returns nil
2026-05-15 08:21:17 -04:00