Commit Graph

13 Commits

Author SHA1 Message Date
lerko 0396acdc59 feat(cluster): add region affinity, Nodes TUI tab, and probe metrics
Phase 3 of distributed probing:
- Add regions column to sites table for per-monitor probe affinity
- Region-filtered probe assignments (empty regions = all probes)
- New Nodes TUI tab showing connected probes with status/region/last-seen
- Regions input field in site form for configuring probe affinity
- Config-as-code support for regions (export/import/diff)
- Prometheus upkeep_probe_up metric with per-node labels
- Reindex TUI tabs: Sites, Alerts, Logs, Nodes, Users
2026-05-16 11:50:16 -04:00
lerko ca9faa0acd feat(cluster): add distributed probing foundation — schema, models, and probe APIs
Add node-aware check history and probe registration infrastructure:
- ProbeNode model and nodes table (SQLite + Postgres)
- node_id column on check_history for multi-source tracking
- Store interface: RegisterNode, GetNode, GetAllNodes, DeleteNode, SaveCheckFromNode
- Dialect: UpsertNodeSQL (INSERT OR REPLACE / ON CONFLICT)
- API endpoints: POST /api/probe/register, GET /api/probe/assignments, POST /api/probe/results
- Backward compatible: existing SaveCheck wraps SaveCheckFromNode with empty node_id
2026-05-16 11:05:06 -04:00
lerko 5b01b9ee30 feat(config): add config-as-code YAML import/export
Add declarative config-as-code support via YAML files. Monitors and
alerts can be exported, version controlled, and applied across instances.

- goupkeep export [-o file.yaml] dumps current state
- goupkeep apply -f file.yaml creates/updates to match desired state
- --dry-run shows planned changes without applying
- --prune deletes monitors/alerts not in the YAML
- Matching by name, alert references by name, nested group children
- CLI refactored to subcommands (apply, export, serve) with backward compat
- 24 tests covering apply, export, validation, round-trip idempotency
2026-05-15 20:40:49 -04:00
lerko f023e38fdc refactor(monitor): encapsulate engine state, add graceful shutdown and tests
Replace all monitor package-level mutable state with Engine struct.
All state (liveState, logStore, histories, tokenIndex, HTTP clients)
is now encapsulated in Engine, created via NewEngine(store).

Key changes:
- Engine struct holds all monitor state with proper mutex protection
- Engine.Start(ctx) and monitorRoutine respect context cancellation
  for graceful shutdown — no more leaked goroutines
- cluster.runFollowerLoop also respects context for clean exit
- Token index (map[string]int) for O(1) push heartbeat lookup,
  replacing O(n) linear scan through LiveState
- UpdateSiteConfig preserves 8 runtime fields instead of copying
  17 config fields individually
- triggerAlert goroutines get 30s timeout context
- All consumers (TUI, server, cluster, main) receive *Engine via
  constructor/parameter — no package-level state access
- main.go creates context.WithCancel, passes to engine and cluster

First test suite: 12 tests across store and alert packages
- Store: CRUD for sites/alerts/users, push token generation,
  import/export round-trip, check history persistence
- Alert: Discord/Slack/Webhook payload format, HTTP 4xx error
  propagation, Ntfy headers, unknown provider returns nil
2026-05-15 08:21:17 -04:00
lerko a6bb9a7aff refactor(core): remove store global singleton, thread store explicitly
Remove store.Get()/SetGlobal()/Current. Store is now passed explicitly
to all consumers via constructor parameters and function arguments.

- TUI Model holds store field, set via InitialModel(isAdmin, store)
- monitor.StartEngine(s) and InitHistoryFromStore(s) accept store
- server.Start(cfg, s) closes over store in HTTP handlers
- main.go threads store to SSH server, TUI, monitor, server
- isKeyAllowed receives store as parameter

No more hidden dependency on package-level mutable state in store pkg.
Monitor package still uses package-level state (LiveState, etc.) — will
be encapsulated into Engine struct in Phase 7.
2026-05-15 00:45:07 -04:00
lerko d4f4012c8a refactor(store): add error returns to all Store interface methods
Every Store method now returns an error. Callers handle errors
gracefully — TUI logs to event log, server returns HTTP 500,
monitor engine logs and retries. All rows.Scan() errors are now
checked in sqlstore.go instead of silently appending corrupt data.

- GetSites, GetAllAlerts, GetAllUsers return ([]T, error)
- GetAlert returns (AlertConfig, error) instead of (AlertConfig, bool)
- AddSite, UpdateSite, DeleteSite, etc. all return error
- SaveCheck, LoadAllHistory, ExportData return error
- ~25 caller sites updated across tui, server, monitor, main
2026-05-15 00:37:20 -04:00
lerko ab75f61c6b refactor(store): unify SQLite and Postgres into dialect-based SQLStore
Extract shared SQLStore with Dialect interface for the ~5% that
differs between backends (DDL, placeholders, sequence resets).

- New dialect.go: Dialect interface + placeholder rewriter (? → $N)
- New sqlstore.go: single implementation of all 19 Store methods
- sqlite.go: reduced from 286 to 83 lines (SQLiteDialect only)
- postgres.go: reduced from 266 to 78 lines (PostgresDialect only)
- main.go: use NewSQLiteStore/NewPostgresStore constructors

Zero CRUD logic duplication. Every future schema change written once.
2026-05-15 00:31:44 -04:00
lerko 4d5116644f fix(core): correctness and robustness fixes across all subsystems
- Move status page template to package-level template.Must (panic on
  parse error at init instead of nil deref at runtime)
- Fix XSS in import error responses (log detail server-side, return
  generic message to client)
- Handle ListenAndServe errors in HTTP and SSH servers
- Use defer resp.Body.Close() in all alert providers, check
  json.Marshal errors
- Share HTTP clients across checks instead of creating per-request
- Use http.NewRequestWithContext for per-site timeout control
- Support HTTP method field (was always GET despite DB storing method)
- Implement AcceptedCodes validation (was hardcoded >= 400 despite DB
  storing accepted code ranges)
- Add defer tx.Rollback() to ImportData for transaction safety
2026-05-15 00:00:02 -04:00
lerko e97780ad38 fix(tui,status,store): add delete confirm, input validation, XSS fix, history persistence
Prevent accidental deletes with y/n confirmation dialog. Validate all
numeric form inputs (interval, port, timeout, threshold, retries) with
range checks instead of silently defaulting to zero. Escape user-supplied
data in status page JavaScript to close XSS via monitor names. Persist
check history to new check_history table so sparklines and uptime
percentages survive restarts.
2026-05-14 20:51:06 -04:00
lerko d5ab3a18a4 feat(tui,status): add per-site pause, fix viewport, polish status page
Per-site pause: [p] key toggles pause for selected monitor in TUI.
Paused monitors skip checks, persist to DB, show on status page.

Status page: replace full-page reload with fetch-based DOM updates
to eliminate scroll-jump on refresh. Add summary bar (UP/DOWN/PAUSED
counts), stale-data indicator, and fix SSL EXP CSS class bug.

TUI: constrain tables to terminal width via lipgloss .Width() to
prevent row wrapping that pushed header off-screen. Add MaxHeight
safety net. Bump subtle style from #383838 to #565f89 for
readability on dark terminals.
2026-05-14 18:46:17 -04:00
lerko f06dd5702b feat(models): widen Site struct and DB schema for ping, port, dns, group monitor types
Add Hostname, Port, Timeout, Method, Description, ParentID, AcceptedCodes,
DNSResolveType, DNSServer, and IgnoreTLS fields. Refactor AddSite/UpdateSite
to accept models.Site instead of individual params. Includes DB migrations
for existing databases, per-monitor timeout/TLS in the engine, new type
options in TUI forms, and TYPE column in the sites table.
2026-05-14 17:10:56 -04:00
lerko b7592ee9e5 feat(tui): upgrade users tab with lipgloss table, edit support, role select
Users tab now matches sites/alerts quality: lipgloss bordered table,
click-to-select zones, edit form with role picker, and UpdateUser
support across both store backends.
2026-05-14 15:28:04 -04:00
lerko 02f0a39d97 feat: initial commit — uptime monitor (forked from go-upkeep)
Go-based uptime monitor with SQLite/Postgres storage, TUI dashboard,
SSH server, alerting, and clustering support.
2026-05-14 11:05:10 -04:00