Cluster-secret holder could POST a backup with their own admin key to
/api/backup/import, replacing all users — privilege escalation from
cluster-auth to admin. Also, Kuma imports produced zero users but
ImportWipe unconditionally deleted the users table — locking out all
accounts until restart reseeded UPTOP_ADMIN_KEY.
- Server handlers strip data.Users (set nil) before calling ImportData
- ImportData only wipes+replaces users when data.Users != nil
- New ImportWipeUsers dialect method separates user wipe from data wipe
- CLI restore (main.go) unchanged — full import still replaces users
Bare-metal installs created the DB with process umask (often 022),
making uptop.db, -wal, and -shm world-readable. These files contain
alert credentials and config. Now chmod 0600 after open. Missing
WAL/SHM siblings (not yet created) are silently skipped. Docker
installs were already mitigated by the non-root UID.
1. UpdateSite handles token-read Scan error instead of ignoring it.
sql.ErrNoRows (nonexistent site) passes through; real DB errors
surface.
2. RunCheck allowPrivate changed from variadic to real bool param.
Dead maxRequestBody duplicate removed from sqlstore.go.
3. Footer help bar documents [Space] for group collapse.
4. adjustCursor unified with clampCursor — one clamping path
instead of two with different semantics.
5. Compose cluster/probe example files annotate hardcoded secrets
with "EXAMPLE ONLY — rotate before use".
6. huhForm.WithHeight moved from View() to handleResize — no longer
mutates form state during render.
7. maxTableRows recalculated on filter enter/exit via recalcLayout()
— was only recalculated on resize, causing off-by-one when the
filter bar appeared/disappeared.
Replace three uncoordinated logging systems (log.Printf, fmt.Fprintf
to stderr, fmt.Println warnings) with structured slog calls.
68 log calls migrated:
- log.Printf → slog.Error/Warn/Info (45 calls across 5 files)
- fmt.Fprintf(os.Stderr) → slog.Error (23 calls in main.go)
Kept unchanged:
- fmt.Println/Printf for CLI user output (version, banners, import results)
- engine.AddLog for TUI-visible ring buffer (monitoring events)
Store migration diagnostics demoted to slog.Debug (silent at default
info level). HTTP request logging now structured with method/path/
status/duration/ip attributes.
Site now embeds SiteConfig (22 persistent fields) and SiteState
(11 ephemeral runtime fields). Field access unchanged via promotion
— site.Name and site.Status still work.
Store layer deals exclusively in SiteConfig — the DB never sees
runtime state. Engine's liveState keeps full Site composites.
UpdateSiteConfig reduced from 11-line field-by-field copy to
`existing.SiteConfig = cfg`.
RunCheck takes SiteConfig (only needs config fields). Checker is
now statically prevented from reading/writing runtime state.
Backup.Sites changed to []SiteConfig — exports no longer carry
zero-valued runtime fields. Import backward-compatible (json
ignores unknown fields).
New internal/store/storetest/mock.go provides BaseMock implementing
the full Store interface with no-op defaults and optional Func field
overrides. Each test file embeds BaseMock and shadows only the methods
it needs.
Removes ~400 lines of duplicated stub methods across 6 test files.
Adding a Store method now requires one addition (BaseMock) instead
of editing 6 files.
Replace the error-string-matching migration runner with a proper
schema_version table. Migrations are now numbered and recorded;
only unapplied versions run. Fresh databases seed at baseline
version (CREATE TABLE already includes all columns).
CREATE TABLE statements updated to include regions (sites) and
node_id (check_history) — previously only added via ALTER.
DeleteAlert now nulls sites.alert_id before deleting, preventing
dangling references that caused every incident to hit the error
path instead of alerting.
All 8 TIMESTAMP columns in Postgres CREATE TABLE statements changed to
TIMESTAMPTZ. Migration ALTER TYPE statements added for existing databases
(converts assuming stored values are UTC).
Prevents timezone-shifted instants on non-UTC Postgres servers, which
would skew SLA math and maintenance-window checks. SQLite unaffected —
DATETIME is typeless.
Every Store interface method (except Close) now takes context.Context
as first parameter. All 54 db.Query/Exec/QueryRow calls in SQLStore
replaced with their *Context variants. DB operations now respect
cancellation and deadlines.
Context sources by caller:
- Engine dbWriter/poll/pruner: engine ctx from Start()
- HTTP handlers: r.Context()
- config.Apply/Export: caller-provided ctx
- TUI/main.go init: context.Background()
RunCheck and all sub-checks (HTTP/ping/port/DNS) accept parent ctx.
HTTP checks now inherit shutdown cancellation instead of rooting in
context.Background(). dbWrite.exec takes ctx so the writer goroutine
can cancel stuck DB operations.
DeleteSite/ImportData use BeginTx(ctx) instead of Begin().
1. Group auto-pause trap: remove the one-way Paused=true mutation
from checkGroup — monitorRoutine skipped paused groups, so they
could never re-evaluate or auto-unpause.
2. Retry logic: apply MaxRetries to all →DOWN transitions, not just
UP→DOWN. New monitors (PENDING) no longer alert on first transient
failure when retries are configured.
3. Shutdown drain hole: track checker goroutines with checkerWG so
Stop() waits for in-flight checks before draining the write queue.
Final drainWrites() catches any writes enqueued after the writer's
own drain.
4. Probe-ingest writer bypass: route SaveCheckFromNode through the
engine's serialized dbWriter instead of writing directly to the
store from the HTTP handler.
5. Dead-probe expiry: expire stale probe results (>3× site interval)
before aggregation so a dead probe can't poison status forever.
Also clean probeResults in RemoveSite.
6. Maintenance-cache N+1: replace per-check DB query with a
fully-resolved in-memory cache refreshed every poll cycle. One
GetActiveMaintenanceWindows() call instead of N IsMonitorInMaintenance.
ImportData now wipes check_history, state_changes, and alert_health
so re-inserted IDs don't inherit stale history from prior occupants.
Four fixes hardening the secrets and rate-limit posture a prior audit
left or that regressed:
X-Forwarded-For rate-limit bypass + memory DoS (ratelimit.go): clientIP
returned the raw XFF header, so an attacker rotating it minted unlimited
distinct limiter keys — never tripping the limit and growing the visitors
map without bound. XFF is now honored only when the immediate peer is a
configured trusted proxy (UPTOP_TRUSTED_PROXIES, CIDRs or bare IPs), using
the right-most non-trusted hop; otherwise the key is the real RemoteAddr.
The visitors map is bounded with LRU eviction as defense in depth.
Export redaction denylist -> per-provider allowlist (server.go): the old
six-key denylist missed the actual credentials — the webhook URL for
discord/slack/webhook/ntfy/gotify and api_key for opsgenie — exporting
them in the clear. redactByProvider keeps only known-safe keys per
provider type and redacts everything else, so unknown/new keys fail safe.
ImportData plaintext secrets (sqlstore.go): import inserted raw
json.Marshal(settings), bypassing the encryption AddAlert/UpdateAlert
use. It now routes through marshalSettings, so a restore with
UPTOP_ENCRYPTION_KEY set stores enc:-prefixed ciphertext, not plaintext.
Alert error credential leak (alert.go): provider Send returned the raw
*url.Error, whose URL carries the secret (Telegram bot token in the path,
webhook secrets in the URL); it was persisted to AlertHealth.LastError
and shown in the TUI. sanitizeError strips the URL, keeping the operation
and underlying cause.
Tests cover trusted/untrusted XFF + spoofed-bypass + map bound, the
allowlist per provider, encrypted-on-import round-trip, and URL-stripped
errors. README documents UPTOP_TRUSTED_PROXIES. Full suite green under
-race; golangci-lint clean.
Every check spawned `go e.db.Save*(...)` with the error discarded: a
fire-and-forget goroutine per log line, check, state change, and alert
health update. SaveLog ran a full-table prune DELETE on every insert and
SaveCheck a COUNT + conditional prune on every check, so the hot path
amplified each write into several statements. Nothing tracked these
goroutines, so at shutdown they raced the store's Close() — writes to a
closing DB, silently swallowed.
Introduce a single writer goroutine that drains a buffered channel of
typed dbWrite values (log/check/state-change/alert-health). Writes are
enqueued non-blocking; a saturated queue drops and notes it in the
in-memory log rather than blocking the check loop. Write errors are now
logged instead of discarded. Retention moves off the hot path: SaveLog
and SaveCheck become plain INSERTs, and PruneLogs/PruneCheckHistory/
PruneStateChanges run on a 10-minute timer inside the writer (single
keep-newest-N-per-site pass via a window function). state_changes was
previously never pruned — now bounded.
Add Engine.Stop(): cancels the engine's context, then waits for the
writer to drain every buffered write before returning. main wires it in
before the deferred store Close() so no write races a closed DB.
SQLite gains busy_timeout=5000 and synchronous=NORMAL, applied via the
DSN so every pooled connection inherits them (a post-open PRAGMA only
touches one connection); WAL moves to the DSN too. :memory: test DBs are
left as-is.
Tests: writer drains on Stop, Stop is idempotent, and the prune queries
keep newest-N per site / N logs on real SQLite. Full suite green under
-race.
Background goroutine runs every 15 minutes, deletes maintenance windows
that expired beyond the retention period (default 7 days). Configurable
via UPTOP_MAINT_RETENTION env var (Go duration format).
Closes#72
DeleteSite now removes maintenance_windows, check_history, and
state_changes for the site within a transaction before deleting
the site itself. Prevents orphaned rows.
Closes#71
Full-screen SLA report accessible via [s] from detail panel.
Computes uptime%, downtime, outage count, longest outage, MTTR,
and MTBF from state_changes table. Includes daily breakdown with
bar chart, switchable time periods (24h/7d/30d/90d), and
scrollable viewport. LATE/STALE treated as UP for SLA purposes.
Move Go module from gitea.lerkolabs.com/lerko/uptop to
gitea.lerkolabs.com/lerkolabs/uptop. Updates all imports,
go.mod, goreleaser owner, and README links.
- Add 6 TUI screenshots to assets/ (monitors, alerts, logs, nodes, detail, theme)
- Rewrite README with hero image, badges, collapsible install sections
- Rewrite changelog to match actual CalVer tag history
- VHS tooling extracted to lerko/uptop-vhs
Reviewed-on: lerko/uptop#32
Propagate check failure reasons through the entire stack:
- Checker captures specific errors (DNS, timeout, HTTP status, SSL, etc.)
- Engine tracks LastError, StatusChangedAt, LastSuccessAt per monitor
- State transitions persisted to new state_changes table
- Detail panel shows error reason, HTTP code, state duration, last
success time, and last 5 state change events
- Monitor table shows inline error preview for DOWN services
- Alert messages include error reason
- Probe nodes forward error reasons to leader
15 files changed across models, checker, engine, store, TUI, and probes.
- Redact PostgreSQL DSN password from stdout/logs
- Harden .dockerignore to exclude .ssh/, .claude/, *.db, *.local files
- SSRF protection: block private/loopback/link-local IPs by default
(UPTOP_ALLOW_PRIVATE_TARGETS=true to override for homelab use)
- Fix email header injection via CRLF in monitor names
- AES-256-GCM encryption for alert credentials at rest
(UPTOP_ENCRYPTION_KEY env var, migrate-secrets subcommand)
- TLS support for HTTP server (UPTOP_TLS_CERT/UPTOP_TLS_KEY)
with HSTS header when TLS enabled
generateToken() now returns (string, error) instead of panicking on
crypto/rand failure. All json.Unmarshal calls for alert settings now
check and propagate errors instead of silently ignoring them.
Adds Close() to Store interface for graceful shutdown support.
Skips malformed notification entries during Kuma import.
Add alternating row backgrounds for easier table scanning. Detail panel
now shows breadcrumb path (Sites > Group > Name) and min/avg/max latency
stats below the sparkline. Group collapse state persists across restarts
via new preferences table in both SQLite and Postgres.
Maintenance windows suppress alerts during planned downtime while checks
continue running. Incidents provide informational tracking. Supports
targeting all monitors, single monitor, or group (applies to children).
New Maint tab in TUI with create/end/delete. Status page, JSON API, and
Prometheus metrics all reflect maintenance state.
Phase 3 of distributed probing:
- Add regions column to sites table for per-monitor probe affinity
- Region-filtered probe assignments (empty regions = all probes)
- New Nodes TUI tab showing connected probes with status/region/last-seen
- Regions input field in site form for configuring probe affinity
- Config-as-code support for regions (export/import/diff)
- Prometheus upkeep_probe_up metric with per-node labels
- Reindex TUI tabs: Sites, Alerts, Logs, Nodes, Users
Add node-aware check history and probe registration infrastructure:
- ProbeNode model and nodes table (SQLite + Postgres)
- node_id column on check_history for multi-source tracking
- Store interface: RegisterNode, GetNode, GetAllNodes, DeleteNode, SaveCheckFromNode
- Dialect: UpsertNodeSQL (INSERT OR REPLACE / ON CONFLICT)
- API endpoints: POST /api/probe/register, GET /api/probe/assignments, POST /api/probe/results
- Backward compatible: existing SaveCheck wraps SaveCheckFromNode with empty node_id
Add declarative config-as-code support via YAML files. Monitors and
alerts can be exported, version controlled, and applied across instances.
- goupkeep export [-o file.yaml] dumps current state
- goupkeep apply -f file.yaml creates/updates to match desired state
- --dry-run shows planned changes without applying
- --prune deletes monitors/alerts not in the YAML
- Matching by name, alert references by name, nested group children
- CLI refactored to subcommands (apply, export, serve) with backward compat
- 24 tests covering apply, export, validation, round-trip idempotency
Replace all monitor package-level mutable state with Engine struct.
All state (liveState, logStore, histories, tokenIndex, HTTP clients)
is now encapsulated in Engine, created via NewEngine(store).
Key changes:
- Engine struct holds all monitor state with proper mutex protection
- Engine.Start(ctx) and monitorRoutine respect context cancellation
for graceful shutdown — no more leaked goroutines
- cluster.runFollowerLoop also respects context for clean exit
- Token index (map[string]int) for O(1) push heartbeat lookup,
replacing O(n) linear scan through LiveState
- UpdateSiteConfig preserves 8 runtime fields instead of copying
17 config fields individually
- triggerAlert goroutines get 30s timeout context
- All consumers (TUI, server, cluster, main) receive *Engine via
constructor/parameter — no package-level state access
- main.go creates context.WithCancel, passes to engine and cluster
First test suite: 12 tests across store and alert packages
- Store: CRUD for sites/alerts/users, push token generation,
import/export round-trip, check history persistence
- Alert: Discord/Slack/Webhook payload format, HTTP 4xx error
propagation, Ntfy headers, unknown provider returns nil
Remove store.Get()/SetGlobal()/Current. Store is now passed explicitly
to all consumers via constructor parameters and function arguments.
- TUI Model holds store field, set via InitialModel(isAdmin, store)
- monitor.StartEngine(s) and InitHistoryFromStore(s) accept store
- server.Start(cfg, s) closes over store in HTTP handlers
- main.go threads store to SSH server, TUI, monitor, server
- isKeyAllowed receives store as parameter
No more hidden dependency on package-level mutable state in store pkg.
Monitor package still uses package-level state (LiveState, etc.) — will
be encapsulated into Engine struct in Phase 7.
Every Store method now returns an error. Callers handle errors
gracefully — TUI logs to event log, server returns HTTP 500,
monitor engine logs and retries. All rows.Scan() errors are now
checked in sqlstore.go instead of silently appending corrupt data.
- GetSites, GetAllAlerts, GetAllUsers return ([]T, error)
- GetAlert returns (AlertConfig, error) instead of (AlertConfig, bool)
- AddSite, UpdateSite, DeleteSite, etc. all return error
- SaveCheck, LoadAllHistory, ExportData return error
- ~25 caller sites updated across tui, server, monitor, main
Extract shared SQLStore with Dialect interface for the ~5% that
differs between backends (DDL, placeholders, sequence resets).
- New dialect.go: Dialect interface + placeholder rewriter (? → $N)
- New sqlstore.go: single implementation of all 19 Store methods
- sqlite.go: reduced from 286 to 83 lines (SQLiteDialect only)
- postgres.go: reduced from 266 to 78 lines (PostgresDialect only)
- main.go: use NewSQLiteStore/NewPostgresStore constructors
Zero CRUD logic duplication. Every future schema change written once.
- Move status page template to package-level template.Must (panic on
parse error at init instead of nil deref at runtime)
- Fix XSS in import error responses (log detail server-side, return
generic message to client)
- Handle ListenAndServe errors in HTTP and SSH servers
- Use defer resp.Body.Close() in all alert providers, check
json.Marshal errors
- Share HTTP clients across checks instead of creating per-request
- Use http.NewRequestWithContext for per-site timeout control
- Support HTTP method field (was always GET despite DB storing method)
- Implement AcceptedCodes validation (was hardcoded >= 400 despite DB
storing accepted code ranges)
- Add defer tx.Rollback() to ImportData for transaction safety
Prevent accidental deletes with y/n confirmation dialog. Validate all
numeric form inputs (interval, port, timeout, threshold, retries) with
range checks instead of silently defaulting to zero. Escape user-supplied
data in status page JavaScript to close XSS via monitor names. Persist
check history to new check_history table so sparklines and uptime
percentages survive restarts.
Per-site pause: [p] key toggles pause for selected monitor in TUI.
Paused monitors skip checks, persist to DB, show on status page.
Status page: replace full-page reload with fetch-based DOM updates
to eliminate scroll-jump on refresh. Add summary bar (UP/DOWN/PAUSED
counts), stale-data indicator, and fix SSL EXP CSS class bug.
TUI: constrain tables to terminal width via lipgloss .Width() to
prevent row wrapping that pushed header off-screen. Add MaxHeight
safety net. Bump subtle style from #383838 to #565f89 for
readability on dark terminals.
Add Hostname, Port, Timeout, Method, Description, ParentID, AcceptedCodes,
DNSResolveType, DNSServer, and IgnoreTLS fields. Refactor AddSite/UpdateSite
to accept models.Site instead of individual params. Includes DB migrations
for existing databases, per-monitor timeout/TLS in the engine, new type
options in TUI forms, and TYPE column in the sites table.
Users tab now matches sites/alerts quality: lipgloss bordered table,
click-to-select zones, edit form with role picker, and UpdateUser
support across both store backends.