feat: show error reason when monitors go DOWN #33

Merged
lerko merged 1 commits from feat/error-reason into main 2026-05-27 23:38:26 +00:00
Owner

Summary

When a monitor goes DOWN, the app now tells you WHY — the #1 UX gap from the screenshot review.

What changed

Error propagation — checker captures specific failure reasons at all 6 check types:

  • HTTP: connection refused, timeout, HTTP 503 (expected 200-299), SSL certificate expired
  • Ping: no ICMP response, ping failed: <reason>
  • Port: dial tcp: connection refused
  • DNS: DNS query failed: <reason>, DNS RCODE: NXDOMAIN
  • Push: heartbeat missed

State tracking — engine tracks per-monitor:

  • LastError — most recent error string
  • StatusChangedAt — when status last changed
  • LastSuccessAt — last UP check timestamp

State change persistence — new state_changes table records every UP↔DOWN transition with timestamp + error. Survives restarts.

TUI detail panel — now shows:

  • Error reason (red, when DOWN)
  • HTTP status code
  • State duration ("State Since: 2026-05-27 14:32 (14m)")
  • Last successful check time
  • Last 5 state change events with timestamps + reasons

Monitor table — DOWN rows show truncated error inline after the name in subtle gray

Alert messages — include error reason: Monitor 'X' is DOWN: connection refused

Probe nodes — forward error reasons to leader via error_reason JSON field (backward-compatible)

Files touched (15)

  • internal/models/models.go — Site fields + StateChange struct
  • internal/monitor/checker.go — ErrorReason on CheckResult
  • internal/monitor/monitor.go — handleStatusChange, state tracking, state persistence
  • internal/monitor/aggregator.go — ErrorReason on NodeResult
  • internal/store/store.go — SaveStateChange + GetStateChanges interface
  • internal/store/sqlstore.go — implementations
  • internal/store/sqlite.go + postgres.go — state_changes table
  • internal/tui/tab_sites.go — detail panel, inline error, fmtDuration
  • internal/cluster/probe.go — error_reason in probe results
  • internal/server/server.go — accept + forward probe errors
  • 4 test files — mock store updates
## Summary When a monitor goes DOWN, the app now tells you WHY — the #1 UX gap from the screenshot review. ### What changed **Error propagation** — checker captures specific failure reasons at all 6 check types: - HTTP: `connection refused`, `timeout`, `HTTP 503 (expected 200-299)`, `SSL certificate expired` - Ping: `no ICMP response`, `ping failed: <reason>` - Port: `dial tcp: connection refused` - DNS: `DNS query failed: <reason>`, `DNS RCODE: NXDOMAIN` - Push: `heartbeat missed` **State tracking** — engine tracks per-monitor: - `LastError` — most recent error string - `StatusChangedAt` — when status last changed - `LastSuccessAt` — last UP check timestamp **State change persistence** — new `state_changes` table records every UP↔DOWN transition with timestamp + error. Survives restarts. **TUI detail panel** — now shows: - Error reason (red, when DOWN) - HTTP status code - State duration ("State Since: 2026-05-27 14:32 (14m)") - Last successful check time - Last 5 state change events with timestamps + reasons **Monitor table** — DOWN rows show truncated error inline after the name in subtle gray **Alert messages** — include error reason: `Monitor 'X' is DOWN: connection refused` **Probe nodes** — forward error reasons to leader via `error_reason` JSON field (backward-compatible) ### Files touched (15) - `internal/models/models.go` — Site fields + StateChange struct - `internal/monitor/checker.go` — ErrorReason on CheckResult - `internal/monitor/monitor.go` — handleStatusChange, state tracking, state persistence - `internal/monitor/aggregator.go` — ErrorReason on NodeResult - `internal/store/store.go` — SaveStateChange + GetStateChanges interface - `internal/store/sqlstore.go` — implementations - `internal/store/sqlite.go` + `postgres.go` — state_changes table - `internal/tui/tab_sites.go` — detail panel, inline error, fmtDuration - `internal/cluster/probe.go` — error_reason in probe results - `internal/server/server.go` — accept + forward probe errors - 4 test files — mock store updates
lerko added 1 commit 2026-05-27 23:32:57 +00:00
feat: show error reason when monitors go DOWN
CI / test (pull_request) Successful in 2m42s
CI / lint (pull_request) Successful in 1m11s
CI / vulncheck (pull_request) Successful in 51s
bc3a44beac
Propagate check failure reasons through the entire stack:
- Checker captures specific errors (DNS, timeout, HTTP status, SSL, etc.)
- Engine tracks LastError, StatusChangedAt, LastSuccessAt per monitor
- State transitions persisted to new state_changes table
- Detail panel shows error reason, HTTP code, state duration, last
  success time, and last 5 state change events
- Monitor table shows inline error preview for DOWN services
- Alert messages include error reason
- Probe nodes forward error reasons to leader

15 files changed across models, checker, engine, store, TUI, and probes.
lerko merged commit 63773b13d0 into main 2026-05-27 23:38:26 +00:00
lerko deleted branch feat/error-reason 2026-05-27 23:38:27 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: lerkolabs/uptop#33