docs: publish 2026-04-26

2026-04-26 22:26:14 -04:00
parent 6730781dd0
commit a061d37602
22 changed files with 388 additions and 2032 deletions
@@ -1,15 +1,39 @@
 # Authentication Flow

+Forward auth path for an internal service that doesn't speak OIDC natively. OIDC-native services skip the Caddy hop and go straight to Authentik for the auth handshake — same trust boundary, fewer round trips.
+
 ```mermaid
 sequenceDiagram
-  User->>Caddy: HTTPS request
-  Caddy->>Authentik: Forward auth check
-  Authentik-->>Caddy: 401 if unauthenticated
-  Caddy-->>User: Redirect to auth.lerkolabs.com
-  User->>Authentik: Login (OIDC or forward auth)
-  Authentik-->>User: Session cookie
-  User->>Caddy: HTTPS request + cookie
-  Caddy->>Authentik: Forward auth check
-  Authentik-->>Caddy: 200 OK
-  Caddy->>Service: Proxy request
+  participant U as User
+  participant C as Caddy<br/>(reverse proxy)
+  participant A as Authentik<br/>(IdP)
+  participant S as Internal service
+
+  U->>C: HTTPS request
+  C->>A: Forward auth check
+  A-->>C: 401 (no session)
+  C-->>U: 302 → auth.lerkolabs.com
+
+  U->>A: Login (OIDC or password)
+  A-->>U: Set session cookie
+
+  U->>C: HTTPS request + cookie
+  C->>A: Forward auth check
+  A-->>C: 200 OK + identity headers
+  C->>S: Proxy request<br/>(plain HTTP, internal hop)
+  S-->>U: Response
 ```
+
+## How this ended up here
+
+Authentik was originally only standing up because Outline needed an IdP to function. I figured I'd run it for that one app and forget about it. Then I started noticing how many other apps had OIDC support sitting right there, and integrating each new one was cheap once the IdP was already in place. After a few months, "every internal service goes through SSO" had become the default without me ever sitting down to decide it.
+
+The forward-auth path in this diagram is the catch-all for apps that don't speak OIDC. Caddy intercepts the request, asks Authentik whether the user is logged in, and either proxies through or bounces them to the login page. Less elegant than native OIDC, but it means the "log in with the app's local account" bypass simply doesn't exist anywhere. Every door uses the same key.
+
+The Discord music bot is the one case where this same flow is reachable from the public internet. That started because I wanted my friends to be able to use the bot, which meant the dashboard had to be hittable from outside the LAN. Authentik in the DMZ Caddy gates the request; the policy only lets through specific Discord user IDs. Friends get in, randoms don't, and the same auth machinery I was already running handles it.
+
+## Notes
+
+- **Edge terminates TLS; internal hops are HTTP.** Trust on internal hops is established by segmentation and identity, not by re-encrypting every jump.
+- **Identity headers from Authentik** are passed to the upstream service so it can attribute requests to a user without implementing its own auth.
+- **No anonymous access path.** If Authentik is down, internal services are unreachable. Accepted tradeoff over the alternative — a fallback "skip auth" mode would inevitably get used and would inevitably be the thing that got abused.
@@ -1,9 +1,58 @@
-# DNS Resolution Chain
+# DNS Resolution
+
+Two flows, one resolver chain. Splitting them apart because the interesting part of the design is what *doesn't* go to the upstream.
+
+## External resolution
+
+What happens when a client asks for a public domain.

 ```mermaid
 graph LR
-  D[Device] --> PH[Pi-hole\n10.2.0.11]
-  PH --> UB[pfSense Unbound\n10.x.0.1]
-  UB --> CF[Cloudflare\n1.1.1.1]
-  PH -- "*.lerkolabs.com" --> CADDY[Caddy\n10.2.0.20]
+  CLIENT[Client<br/>most VLANs] --> PIHOLE[Pi-hole<br/>filtering + cache]
+  PIHOLE -->|miss| UNBOUND[Unbound on firewall<br/>recursive + DNSSEC]
+  UNBOUND --> UPSTREAM[Cloudflare<br/>fallback only]
+
+  PIHOLE -.->|blocked| BLOCKED[Ad/tracker<br/>domains]
+
+  classDef client fill:#1f2f3a,stroke:#3a6b8b,color:#d0e0f0
+  classDef resolver fill:#1f3a2f,stroke:#3a8b6b,color:#d0f0e0
+  classDef upstream fill:#3a2f1f,stroke:#8b6b3a,color:#f0e0d0
+  classDef blocked fill:#3a1f1f,stroke:#8b3a3a,color:#f0d0d0
+
+  class CLIENT client
+  class PIHOLE,UNBOUND resolver
+  class UPSTREAM upstream
+  class BLOCKED blocked
 ```
+
+## Local hostname resolution (split-horizon)
+
+What happens when a client asks for an internal hostname. The query never leaves the LAN — Pi-hole answers from its local A records, and the client connects to the internal reverse proxy directly.
+
+```mermaid
+graph LR
+  CLIENT[Client] -->|asks for<br/>app.lerkolabs.com| PIHOLE[Pi-hole<br/>local A records]
+  PIHOLE -->|returns<br/>internal IP| CLIENT
+  CLIENT -->|HTTPS<br/>valid public cert| CADDY[Internal Caddy<br/>reverse proxy]
+  CADDY --> SVC[Internal service]
+
+  classDef client fill:#1f2f3a,stroke:#3a6b8b,color:#d0e0f0
+  classDef resolver fill:#1f3a2f,stroke:#3a8b6b,color:#d0f0e0
+  classDef edge fill:#2f1f3a,stroke:#6b3a8b,color:#e0d0f0
+
+  class CLIENT client
+  class PIHOLE resolver
+  class CADDY,SVC edge
+```
+
+## Why this design
+
+A few things are doing more work here than they look.
+
+**Pi-hole is the only authoritative source for internal names.** One source of truth for hostname → IP, one place to update when something moves. It's also a documented SPOF — if it dies, internal hostnames stop resolving. I considered mirroring the records into Unbound on the firewall as a fallback and decided not to. I'd rather know Pi-hole is unhealthy than paper over it with a fallback that hides the problem.
+
+**Internal services get valid public certs without ever being exposed to the internet.** Cloudflare DNS-01 ACME proves I control the domain via a TXT record; the cert never requires a publicly-reachable HTTP-01 challenge. Combined with split-horizon DNS, a VPN or LAN client browsing to `app.lerkolabs.com` gets a real cert chain on a connection that never leaves the network. The cert proves identity; segmentation handles confidentiality.
+
+**Bootstrap exception.** The host running Pi-hole has to resolve through the firewall directly, not through itself, or nothing comes up at boot. Took a power outage to learn that one cleanly.
+
+**WFH and Management tiers don't use Pi-hole.** Different reasons, both deliberate — see private repo for detail. The short version: the WFH laptop shouldn't see the local hostname inventory, and Management hosts can't depend on Pi-hole being up.
@@ -1,15 +1,93 @@
 # Network Topology

+Two views of the same network. The trust-tier diagram is how I *reason* about it. The physical-flow one is for when someone asks "but where does it actually plug in."
+
+## Trust tiers and policy
+
+Seven VLANs grouped by how much I trust what's on them. Edges are allowed inter-tier flows; everything else is default-deny.
+
 ```mermaid
-graph TD
-  ONT[AT&T Fiber ONT] --> BGW[BGW320 IP Passthrough]
-  BGW --> PF[pfSense N100]
-  PF --> SW[Omada Switch]
-  SW --> MGMT[VLAN 1000 MGMT\n10.0.0.0/24]
-  SW --> LAN[VLAN 1010 LAN\n10.1.0.0/24]
-  SW --> HL[VLAN 1020 Homelab\n10.2.0.0/24]
-  SW --> GUEST[VLAN 1030 Guests\n10.3.0.0/24]
-  SW --> IOT[VLAN 1040 IoT\n10.4.0.0/24]
-  SW --> WFH[VLAN 1050 WFH\n10.5.0.0/24]
-  SW --> DMZ[VLAN 1 DMZ\n10.99.0.0/24]
+graph TB
+  subgraph UNTRUSTED["Untrusted — internet only, no internal access"]
+    GUEST[Guest WiFi]
+    IOT[IoT]
+    WFH[Work-from-home]
+  end
+
+  subgraph PUBLIC["Public-facing"]
+    DMZ[DMZ<br/>reverse proxy + public services]
+  end
+
+  subgraph TRUSTED["Trusted"]
+    LAN[LAN<br/>personal devices]
+    INT[Internal services<br/>app stack]
+  end
+
+  subgraph MGMT["Management — VPN-only"]
+    ADMIN[Hypervisor, firewall,<br/>backup, switches, APs]
+  end
+
+  subgraph REMOTE["Remote"]
+    VPN[WireGuard clients]
+  end
+
+  INTERNET((Internet))
+
+  UNTRUSTED -->|outbound only| INTERNET
+  INTERNET -->|HTTP/HTTPS<br/>tight allowlist| DMZ
+  INTERNET -->|WireGuard<br/>UDP| VPN
+
+  DMZ -.->|narrow allowlist<br/>firewall-enforced| INT
+  LAN -->|consume services| INT
+  VPN -->|LAN-equivalent +<br/>admin access| INT
+  VPN --> ADMIN
+
+  classDef untrusted fill:#3a1f1f,stroke:#8b3a3a,color:#f0d0d0
+  classDef public fill:#3a2f1f,stroke:#8b6b3a,color:#f0e0d0
+  classDef trusted fill:#1f3a2f,stroke:#3a8b6b,color:#d0f0e0
+  classDef mgmt fill:#1f2f3a,stroke:#3a6b8b,color:#d0e0f0
+  classDef remote fill:#2f1f3a,stroke:#6b3a8b,color:#e0d0f0
+
+  class GUEST,IOT,WFH untrusted
+  class DMZ public
+  class LAN,INT trusted
+  class ADMIN mgmt
+  class VPN remote
 ```
+
+## Physical flow
+
+What plugs into what. Tier labels, not addresses.
+
+```mermaid
+graph LR
+  ISP[ISP] --> GW[Carrier gateway<br/>passthrough mode]
+  GW --> FW[pfSense firewall]
+  FW --> SW[Managed switch<br/>VLAN-aware]
+
+  SW --> T_MGMT[MGMT tier]
+  SW --> T_INT[Internal services tier]
+  SW --> T_LAN[LAN tier]
+  SW --> T_WFH[WFH tier]
+  SW --> T_IOT[IoT tier]
+  SW --> T_GUEST[Guest tier]
+  SW --> T_DMZ[DMZ tier]
+
+  FW -.->|VPN concentrator| VPN[WireGuard]
+```
+
+## Why two reverse proxies, not one
+
+The DMZ-to-internal arrow above does a lot of work, so worth being explicit. There are two Caddy instances:
+
+- One in the DMZ, internet-facing, fronting a deliberately small set of public services.
+- One in the internal services tier, LAN/VPN only, fronting everything else.
+
+The first version of this was a single Caddy cloned into the DMZ doing both jobs. It "worked," in the sense that nothing was on fire — but every internal admin surface was technically internet-reachable, gated only by app-level auth. Once I drew it out I realized I'd built exactly the thing the DMZ was supposed to prevent. Splitting them was the fix: the DMZ Caddy can only see the small handful of backends it's allowed to reach, the firewall enforces that independently of the proxy config, and VPN clients hit the internal Caddy directly without ever touching the DMZ.
+
+This is the layered-controls principle from `SECURITY.md` made concrete. Both the proxy and the firewall enforce the same allowlist. Misconfiguring the proxy is way easier than misconfiguring the firewall, so they back each other up.
+
+## Notes
+
+- Inter-tier policy is enforced at the firewall. Intra-tier traffic between hosts on the same bridge does not — see `NETWORK.md` for why that matters when reasoning about blast radius.
+- Subnets, VLAN IDs, hardware models, and ISP details live in the private repo. The trust tiers are the part worth publishing; the IP plan isn't.