Replace the error-string-matching migration runner with a proper
schema_version table. Migrations are now numbered and recorded;
only unapplied versions run. Fresh databases seed at baseline
version (CREATE TABLE already includes all columns).
CREATE TABLE statements updated to include regions (sites) and
node_id (check_history) — previously only added via ALTER.
DeleteAlert now nulls sites.alert_id before deleting, preventing
dangling references that caused every incident to hit the error
path instead of alerting.
Replace ~150 bare status string comparisons with typed models.Status
constants (StatusUp, StatusDown, StatusPending, StatusLate, StatusStale,
StatusSSLExp). Single IsBroken() method replaces the duplicated
isBroken lambda in monitor.go and isDown function in sla.go.
Adding a new status value (e.g. DEGRADED) now requires one constant
definition instead of grep-and-pray across 16 files.
CheckResult.Status stays string — the checker is the boundary between
raw protocol results and typed status. Cast happens at the edge in
handleStatusChange.
All 8 TIMESTAMP columns in Postgres CREATE TABLE statements changed to
TIMESTAMPTZ. Migration ALTER TYPE statements added for existing databases
(converts assuming stored values are UTC).
Prevents timezone-shifted instants on non-UTC Postgres servers, which
would skew SLA math and maintenance-window checks. SQLite unaffected —
DATETIME is typeless.
Every Store interface method (except Close) now takes context.Context
as first parameter. All 54 db.Query/Exec/QueryRow calls in SQLStore
replaced with their *Context variants. DB operations now respect
cancellation and deadlines.
Context sources by caller:
- Engine dbWriter/poll/pruner: engine ctx from Start()
- HTTP handlers: r.Context()
- config.Apply/Export: caller-provided ctx
- TUI/main.go init: context.Background()
RunCheck and all sub-checks (HTTP/ping/port/DNS) accept parent ctx.
HTTP checks now inherit shutdown cancellation instead of rooting in
context.Background(). dbWrite.exec takes ctx so the writer goroutine
can cancel stuck DB operations.
DeleteSite/ImportData use BeginTx(ctx) instead of Begin().
1. Group auto-pause trap: remove the one-way Paused=true mutation
from checkGroup — monitorRoutine skipped paused groups, so they
could never re-evaluate or auto-unpause.
2. Retry logic: apply MaxRetries to all →DOWN transitions, not just
UP→DOWN. New monitors (PENDING) no longer alert on first transient
failure when retries are configured.
3. Shutdown drain hole: track checker goroutines with checkerWG so
Stop() waits for in-flight checks before draining the write queue.
Final drainWrites() catches any writes enqueued after the writer's
own drain.
4. Probe-ingest writer bypass: route SaveCheckFromNode through the
engine's serialized dbWriter instead of writing directly to the
store from the HTTP handler.
5. Dead-probe expiry: expire stale probe results (>3× site interval)
before aggregation so a dead probe can't poison status forever.
Also clean probeResults in RemoveSite.
6. Maintenance-cache N+1: replace per-check DB query with a
fully-resolved in-memory cache refreshed every poll cycle. One
GetActiveMaintenanceWindows() call instead of N IsMonitorInMaintenance.
ImportData now wipes check_history, state_changes, and alert_health
so re-inserted IDs don't inherit stale history from prior occupants.
The handler serialized raw models.Site — LastError internals,
Hostname, Port, DNSServer, AlertID, intervals all public, and every
future Site field public the day it's added. statusSite now exposes
exactly what the status page renders: Name, Type, URL, Status, Paused,
LastCheck, Latency.
Replaces the vacuous TestStatusJSON_TokensStripped, which injected via
UpdateSiteConfig (a no-op for unknown IDs) and asserted over zero
sites. The new test seeds the store, starts the engine, waits for live
state, and asserts internal fields are absent from the raw JSON.
The alert detail panel dumped a.Settings raw — SMTP passwords, bot
tokens, API keys on screen and into any recording or screen share. The
table view leaked the PagerDuty routing key, Pushover user key, and
full discord/slack/webhook URLs (the URL path is the credential).
The redaction allowlist moves from internal/server to
models.RedactAlertSettings so the backup export and the TUI render
through one policy. Panel keys are sorted so rows stop reshuffling
every tick; webhook URLs show scheme+host only; keys show
first4…last4.
Deletes, pause toggles, maintenance end, theme/collapse prefs, and all
four form submits wrote to the store synchronously on the UI goroutine;
with busy_timeout=5000 a contended DB froze input for up to 5s.
Writes now run through a writeCmd helper returning writeDoneMsg. The
in-memory engine/model mutations stay in Update so rows react
instantly; the reply logs failures and reloads tab data, so the UI
converges on what was actually written. Closures capture snapshotted
values only — never the model.
The #101 refactor stopped at the tick path; 'h' history and the SLA
view still queried state changes synchronously in Update, freezing the
UI for up to busy_timeout on a contended DB. Both now load through
Cmds with loading placeholders.
Also closes the remaining staleness holes in the async data flow:
- tabDataMsg carries a sequence number; out-of-order replies from
slower earlier loads are dropped instead of overwriting newer data
- history/SLA replies are dropped when the user has navigated to a
different site or period
- the open detail panel refreshes on the tab-data cadence instead of
loading once on entry and going stale
- initSiteHuhForm reads the m.alerts cache instead of hitting the store
applyTheme mutated ~18 package-global lipgloss styles while every SSH
session's tea.Program read them concurrently from its own goroutine.
Pressing T or opening a new connection raced other sessions' View and
bled themes across users.
Styles now live in an immutable per-Model struct built by newStyles;
free formatter helpers that consumed the globals became Model methods.
The TUI ran database queries on the UI goroutine: handleTick called
refreshData every second, which issued four blocking SQLite queries
(GetAllAlerts/GetAllUsers/GetAllNodes/GetAllMaintenanceWindows) and
swallowed their errors; viewDetailPanel ran GetStateChanges — a DB query
— inside View(), on every render (tick, keypress, mouse). A slow disk
stalled input and animation.
Split refreshData into refreshLive() (in-memory engine copies only —
sites + logs — safe every tick) and loadTabDataCmd(), a tea.Cmd that
loads the four DB-backed tables off the UI goroutine and returns a
tabDataMsg. handleTick now refreshes live state every tick but dispatches
the tab-data load only when older than tabRefreshTTL (5s), so tab-bar
counts stay fresh without a per-second query storm. Errors surface to the
log instead of being dropped, and a transient failure keeps the previous
data rather than blanking the view.
The detail panel's state-change history is loaded once on enter via
loadDetailCmd and cached on the model; viewDetailPanel reads the cache,
so View no longer touches the database. Init kicks an initial load so the
dashboard isn't empty on the first frame, and the bare time.Time tick
message is now a named tickMsg (no cross-message collision). The
test-alert handler's raw goroutine becomes a tea.Cmd.
Adds the package's first Update()-driven tests: tab-data load + apply,
error-keeps-previous-data, detail cache with a store-hit counter proving
View does zero IO across repeated renders, and the handleTick throttle.
Full suite green under -race; golangci-lint clean.
Four fixes hardening the secrets and rate-limit posture a prior audit
left or that regressed:
X-Forwarded-For rate-limit bypass + memory DoS (ratelimit.go): clientIP
returned the raw XFF header, so an attacker rotating it minted unlimited
distinct limiter keys — never tripping the limit and growing the visitors
map without bound. XFF is now honored only when the immediate peer is a
configured trusted proxy (UPTOP_TRUSTED_PROXIES, CIDRs or bare IPs), using
the right-most non-trusted hop; otherwise the key is the real RemoteAddr.
The visitors map is bounded with LRU eviction as defense in depth.
Export redaction denylist -> per-provider allowlist (server.go): the old
six-key denylist missed the actual credentials — the webhook URL for
discord/slack/webhook/ntfy/gotify and api_key for opsgenie — exporting
them in the clear. redactByProvider keeps only known-safe keys per
provider type and redacts everything else, so unknown/new keys fail safe.
ImportData plaintext secrets (sqlstore.go): import inserted raw
json.Marshal(settings), bypassing the encryption AddAlert/UpdateAlert
use. It now routes through marshalSettings, so a restore with
UPTOP_ENCRYPTION_KEY set stores enc:-prefixed ciphertext, not plaintext.
Alert error credential leak (alert.go): provider Send returned the raw
*url.Error, whose URL carries the secret (Telegram bot token in the path,
webhook secrets in the URL); it was persisted to AlertHealth.LastError
and shown in the TUI. sanitizeError strips the URL, keeping the operation
and underlying cause.
Tests cover trusted/untrusted XFF + spoofed-bypass + map bound, the
allowlist per provider, encrypted-on-import round-trip, and URL-stripped
errors. README documents UPTOP_TRUSTED_PROXIES. Full suite green under
-race; golangci-lint clean.
Every check spawned `go e.db.Save*(...)` with the error discarded: a
fire-and-forget goroutine per log line, check, state change, and alert
health update. SaveLog ran a full-table prune DELETE on every insert and
SaveCheck a COUNT + conditional prune on every check, so the hot path
amplified each write into several statements. Nothing tracked these
goroutines, so at shutdown they raced the store's Close() — writes to a
closing DB, silently swallowed.
Introduce a single writer goroutine that drains a buffered channel of
typed dbWrite values (log/check/state-change/alert-health). Writes are
enqueued non-blocking; a saturated queue drops and notes it in the
in-memory log rather than blocking the check loop. Write errors are now
logged instead of discarded. Retention moves off the hot path: SaveLog
and SaveCheck become plain INSERTs, and PruneLogs/PruneCheckHistory/
PruneStateChanges run on a 10-minute timer inside the writer (single
keep-newest-N-per-site pass via a window function). state_changes was
previously never pruned — now bounded.
Add Engine.Stop(): cancels the engine's context, then waits for the
writer to drain every buffered write before returning. main wires it in
before the deferred store Close() so no write races a closed DB.
SQLite gains busy_timeout=5000 and synchronous=NORMAL, applied via the
DSN so every pooled connection inherits them (a post-open PRAGMA only
touches one connection); WAL moves to the DSN too. :memory: test DBs are
left as-is.
Tests: writer drains on Stop, Stop is idempotent, and the prune queries
keep newest-N per site / N logs on real SQLite. Full suite green under
-race.
checkByID snapshotted a Site under RLock, ran a network check for
seconds, then handleStatusChange wrote the entire stale struct back into
liveState. Any concurrent mutation during the check — a user pause, a
config edit, or a push heartbeat — was silently reverted. Worst case: a
heartbeat set UP and an in-flight checkPush overwrote it with a stale
DOWN, firing a false alert.
Introduce applyState(id, mutate): a single read-modify-write helper that
runs the mutator against the CURRENT live entry under the write lock, so
config and Paused are preserved automatically and status transitions are
computed from the true current status. Route handleStatusChange,
RecordHeartbeat, ToggleSitePause and checkGroup through it. Logs and
alerts now fire after the lock is released, off the critical section.
Push false-DOWN is closed by a guard: a non-UP result whose snapshot
LastCheck predates the live LastCheck is dropped, since a heartbeat (or
newer check) superseded it. HTTP/probe stamp LastCheck=now before the
call, so they are unaffected (and serial per site anyway).
Also fixes a latent bug where RecordHeartbeat read StatusChangedAt after
overwriting it, always reporting "was down 0s"; downSince is now captured
before mutation.
Adds regression tests for pause/config-edit/heartbeat-during-check and
removed-site-dropped. Full suite green under -race.
Click any sparkline character to see data point details — approximate
time, latency, and up/down status. Esc dismisses tooltip without
leaving detail view. Uses existing BubbleZone infrastructure with
zone-relative coordinate math for index resolution.
Background goroutine runs every 15 minutes, deletes maintenance windows
that expired beyond the retention period (default 7 days). Configurable
via UPTOP_MAINT_RETENTION env var (Go duration format).
Closes#72
DeleteSite now removes maintenance_windows, check_history, and
state_changes for the site within a transaction before deleting
the site itself. Prevents orphaned rows.
Closes#71
Replace misleading relative-only sparkline with dual-channel design:
bar height uses relative scaling (shows stability and anomalies),
color+brightness uses absolute thresholds (shows fast vs slow).
- Add brightness gradient within color bands (dim→bright as latency
increases toward the next threshold)
- Pass row background through sparkline rendering so zebra stripes
and selection highlights carry through ANSI sequences
- Cap sparkline width to 60 (matches maxHistoryLen) and column
width to 62 to eliminate trailing dead space
- Quiet group sparkline: subtle dots for healthy, bold red for down
- Add braille subpixel canvas (ported from meridian) for future
multi-row graph use
- STATUS column shows icon + clean state only (▲ UP, ▼ DOWN, ◆ LATE,
◆ STALE, ◇ PAUSED, ◼ MAINT, ○ PENDING). Error classification
(DNS/TLS/TMO) removed from STATUS — stays in NAME inline hint.
- Detail panel Last Check shows relative time ("12s ago") instead of
absolute timestamp.
- Extract shared fmtTimeAgo() to format.go, consolidate duplicate
formatters in tab_alerts.go and tab_nodes.go.
Sites table with many rows exceeded the fixed content height,
pushing footer down. MaxHeight now clips content that overflows
while Height still pads shorter content upward.
Each tab returned different leading newlines (Sites/tables: 1,
Logs: 3, empty states: varies). TrimSpace content before layout
so JoinVertical controls all spacing. Remove leading \n from
footer since JoinVertical handles gaps.
Replace string concatenation layout with lipgloss.JoinVertical
and fixed-height content area. Footer now stays at the same
vertical position regardless of tab content height. Uses
lipgloss.Height() to dynamically measure header/footer and
fill remaining space.
Logs were dumping all lines directly, pushing the dashboard
footer off screen. Now uses logViewport with proper height
accounting so footer stays visible and scrolling works.
- Extract divider() and emptyState() helpers to format.go
- All empty states now use bordered box with accent color
- Detail and alert detail panels get header/section dividers
- SLA label width 14→16 to match detail/alert panels
- Logs key hints moved from content to dashboard footer
- History/SLA panels use shared divider helper
- Show version in dashboard footer (wired from goreleaser ldflags)
- Cap name column at 35, raise sparkline minimum to 15 chars
- Preserve zebra background on group rows (was lost by style override)
- Group sparkline uses bullet • instead of heavy circle ●
Full-screen SLA report accessible via [s] from detail panel.
Computes uptime%, downtime, outage count, longest outage, MTTR,
and MTBF from state_changes table. Includes daily breakdown with
bar chart, switchable time periods (24h/7d/30d/90d), and
scrollable viewport. LATE/STALE treated as UP for SLA purposes.
checkGroup only checked for DOWN/SSL EXP and PENDING. Groups
now reflect STALE and LATE children with proper priority:
DOWN > STALE > LATE > PENDING > UP.
New intermediate state between LATE and DOWN at the midpoint of
the grace period. Gives operators earlier warning that a push
monitor has gone quiet. Includes dedicated orange theme color
across all 5 themes and proper styling in dashboard, detail
panel, and history view.
Support Opsgenie Alert API v2 with US/EU endpoint selection,
configurable priority (P1-P5), and GenieKey auth. TUI form
includes API key, priority picker, and EU instance toggle.
Update() routed form and confirm-delete states before handling
time.Time ticks, so those handlers swallowed the tick command and
permanently broke the refresh loop. After opening any form or
delete dialog, the TUI stopped auto-refreshing until restarted.
Move time.Time and WindowSizeMsg handling before state dispatch
so ticks always fire regardless of view state.
Also fix fmtRetries off-by-one: FailureCount=1 displayed as 0/N
instead of 1/N due to an erroneous subtract-one.
Monitor goroutine slept for the full check interval after a config
edit, so hostname/URL changes wouldn't take effect until the next
scheduled check. Added per-site recheck channel that wakes the
goroutine immediately when UpdateSiteConfig is called.
- Replace deprecated LineUp/LineDown/HalfViewUp/HalfViewDown with
ScrollUp/ScrollDown/HalfPageUp/HalfPageDown
- Use tagged switch for mouse button dispatch
- Use fmt.Fprintf instead of WriteString(Sprintf)
Full-screen scrollable history view accessible via [h] from detail
panel. Shows all state transitions with computed outage durations,
event density sparkline for flapping detection, and summary stats.
- Detail panel STATE CHANGES now shows outage duration per recovery
- Event density sparkline highlights flapping periods
- Summary footer: event count, outage count, avg outage duration
- Vim-style navigation (j/k/g/G) + mouse scroll in history view
Categorize raw error strings into DNS/TCP/TLS/HTTP/ICMP/TMO/PRIV
so users get instant triage from the monitor list without opening
the detail panel.
- Status column shows DOWN:DNS, DOWN:TLS, DOWN:HTTP, etc.
- Inline NAME column errors prefixed with category tag [DNS], [TLS]
- Detail panel shows connection chain checklist for HTTP monitors
(✓ DNS → ✓ TCP → ✗ TLS → · HTTP) pinpointing failure layer
- All display-side only — no database or model changes
tui.go (1032→164) and tab_sites.go (993→482) violated "small functions"
and "testable in isolation" standards. Extracted 6 new files by concern:
- format.go: pure formatting functions (fmtLatency, fmtUptime, etc.)
- sparkline.go: sparkline rendering (latency, heartbeat, group)
- update.go: Update method decomposed into 15 named handlers
- view_dashboard.go: View, dashboard composition, tab bar, footer
- view_detail.go: site detail panel
- data.go: data refresh with extracted sortSitesForDisplay/filterSites
Added 17 unit tests for the newly-testable pure functions covering
format, sparkline, sort ordering, and filter logic. No behavioral
changes — strict move-and-extract refactor.