Certificate Architecture Documentation¶
Overview¶
This ticket documents the Tinkero certificate architecture that enables automatic SSL certificate acquisition for deployed sites. It explains the problem, solution, and implementation with comprehensive visual diagrams.
The Problem: Why HostRegexp Doesn't Work¶
Original Architecture (Broken)¶
The initial Tinkero deployment used a single catch-all Traefik router to handle all subdomains:
traefik.http.routers.caddy.rule=HostRegexp(`{subdomain:[a-z0-9-]+}.tkr.lair.nntin.xyz`)
traefik.http.routers.caddy.tls.certresolver=letsencrypt
The Issue: Traefik's automatic certificate acquisition only works with explicit Host() rules, not pattern-matched HostRegexp() rules.
Result: All deployed sites present Traefik's default self-signed certificate, causing browser security warnings and making HTTPS unusable.
Why This Limitation Exists¶
Let's Encrypt requires domain validation before issuing certificates. Traefik needs to know the exact domain name to request a certificate for. With HostRegexp(), Traefik only knows the pattern, not the specific domains that will match it.
Analogy: It's like asking for a driver's license without providing your name - the pattern matches many possibilities, but the certificate authority needs a specific identity.
The Solution: Per-Site Routers with Sidecar Containers¶
Architecture Overview¶
Instead of one catch-all router, we create individual Traefik routers for each deployed site using lightweight Docker sidecar containers.
graph TD
subgraph "Internet"
User[User Browser]
end
subgraph "Traefik Reverse Proxy"
Traefik[Traefik]
Router1[Router: site1.tkr...]
Router2[Router: site2.tkr...]
Router3[Router: site3.tkr...]
Catchall[Catchall Router: *.tkr...]
end
subgraph "Docker Network: lair-network"
Sidecar1[Sidecar Container 1]
Sidecar2[Sidecar Container 2]
Sidecar3[Sidecar Container 3]
end
subgraph "Docker Network: tinkero-network"
Caddy[Caddy Static Server]
Files1[Site 1 Files]
Files2[Site 2 Files]
Files3[Site 3 Files]
end
subgraph "Certificate Authority"
LE[Let's Encrypt]
end
User -->|HTTPS Request| Traefik
Traefik -->|Discovers| Sidecar1
Traefik -->|Discovers| Sidecar2
Traefik -->|Discovers| Sidecar3
Sidecar1 -.->|Defines| Router1
Sidecar2 -.->|Defines| Router2
Sidecar3 -.->|Defines| Router3
Router1 -->|Routes to| Caddy
Router2 -->|Routes to| Caddy
Router3 -->|Routes to| Caddy
Catchall -->|Fallback| Caddy
Caddy --> Files1
Caddy --> Files2
Caddy --> Files3
Traefik <-->|TLS-ALPN-01| LE
Key Components¶
Sidecar Containers:
- Minimal Alpine Linux containers (~10MB each)
- Run sleep infinity - exist only for their Docker labels
- Carry Traefik routing configuration as labels
- Join both lair-network (for Traefik discovery) and tinkero-network (for routing)
Traefik Docker Provider:
- Watches Docker daemon for new containers
- Reads labels from sidecar containers
- Automatically creates routers with explicit Host() rules
- Triggers Let's Encrypt certificate requests
Graceful Degradation:
- Catch-all HostRegexp() router remains active (priority 1)
- Per-site routers have higher priority (explicit Host() rules naturally win)
- If per-site router fails, catch-all handles traffic (HTTP works, HTTPS shows warning)
Network Topology¶
Multi-Network Architecture¶
graph TB
subgraph "External"
Internet[Internet]
DNS[DNS: *.tkr.lair.nntin.xyz]
end
subgraph "lair-network (Traefik Discovery)"
Traefik[Traefik Proxy]
Sidecar1[Sidecar: site1]
Sidecar2[Sidecar: site2]
Grafana[Grafana]
Prometheus[Prometheus]
end
subgraph "tinkero-network (Service Communication)"
Caddy[Caddy Server]
Webhook[Webhook Handler]
Redis[Redis]
Loki[Loki]
end
subgraph "Host Filesystem"
Sites[/data/tinkero/sites/]
Certs["/data/traefik/acme.json"]
end
Internet -->|HTTPS| Traefik
DNS -.->|Resolves to| Traefik
Traefik -->|Discovers| Sidecar1
Traefik -->|Discovers| Sidecar2
Traefik -->|Routes to| Caddy
Caddy -->|Serves from| Sites
Webhook -->|Creates| Sidecar1
Webhook -->|Creates| Sidecar2
Webhook -->|Stores metadata| Redis
Traefik -->|Stores certs| Certs
Traefik -->|Metrics| Prometheus
Webhook -->|Logs| Loki
Grafana -->|Queries| Prometheus
Grafana -->|Queries| Loki
style Sidecar1 fill:#e1f5ff
style Sidecar2 fill:#e1f5ff
style Traefik fill:#fff3cd
style Caddy fill:#d4edda
Why Two Networks?
- lair-network: Traefik's Docker provider watches this network for containers with routing labels
- tinkero-network: Internal service communication (Webhook Handler → Caddy, Redis, etc.)
- Sidecar containers join BOTH: Enables Traefik to discover them AND route traffic to Caddy
Component Architecture¶
System Components and Interactions¶
graph TD
subgraph "Webhook Handler Service"
Server[HTTP Server]
Deploy[Deployment Orchestrator]
TraefikClient[Traefik Client]
CaddyClient[Caddy Client]
RedisClient[Redis Client]
Metrics[Prometheus Metrics]
end
subgraph "External Services"
GitHub[GitHub Webhooks]
Docker[Docker Daemon]
RedisDB[(Redis Database)]
TraefikProxy[Traefik Proxy]
CaddyServer[Caddy Server]
end
subgraph "Observability Stack"
Prom[Prometheus]
Graf[Grafana]
LokiDB[Loki]
end
GitHub -->|Webhook| Server
Server -->|Orchestrate| Deploy
Deploy -->|Build| Docker
Deploy -->|Deploy Files| CaddyServer
Deploy -->|Configure Route| CaddyClient
Deploy -->|Create Router| TraefikClient
TraefikClient -->|Create Container| Docker
Docker -->|Discover| TraefikProxy
Deploy -->|Store Metadata| RedisClient
RedisClient -->|Read/Write| RedisDB
Server -->|Record| Metrics
Metrics -->|Expose| Prom
Graf -->|Query| Prom
Graf -->|Query| LokiDB
Server -->|Logs| LokiDB
style TraefikClient fill:#e1f5ff
style Deploy fill:#fff3cd
Traefik Client Package Structure¶
The new internal/traefik/ package follows the pattern established by internal/caddy/:
internal/traefik/
├── client.go // Docker-based router creation
├── router.go // Label builder for Traefik config
├── types.go // RouterConfig struct and constants
├── reconcile.go // Startup cleanup of orphaned containers
└── client_test.go // Unit tests
Key Responsibilities:
- Client: Creates/removes Docker sidecar containers
- RouterConfig: Builds Traefik label map for a site
- reconcile: Cleans up orphaned containers on startup (idempotent recovery)
Deployment Flow¶
Complete Deployment Sequence¶
sequenceDiagram
participant Dev as Developer
participant GH as GitHub
participant WH as Webhook Handler
participant Build as Build System
participant FS as Filesystem
participant Caddy as Caddy Server
participant Docker as Docker Daemon
participant Redis as Redis DB
participant Traefik as Traefik Proxy
participant LE as Let's Encrypt
Dev->>GH: git push
GH->>WH: POST /webhook (push event)
WH->>WH: Verify webhook signature
WH->>GH: Create deployment (pending)
Note over WH,Build: Build Phase
WH->>Build: Clone repository
Build->>Build: npm install && npm run build
Build-->>WH: Build artifacts
Note over WH,FS: Deploy Phase
WH->>FS: Copy files to release directory
WH->>FS: Atomic symlink swap (current → new release)
WH->>Caddy: Configure route via API
Caddy-->>WH: Route configured
Note over WH,Redis: Metadata Phase
WH->>Redis: Store deployment metadata
Redis-->>WH: Metadata stored
WH->>GH: Update deployment (success)
Note over WH: Site accessible via HTTP
Note over WH,Docker: Router Creation (Async)
WH->>Redis: Update RouterStatus = "pending"
WH->>Docker: Create sidecar container with labels
Docker-->>WH: Container ID
WH->>Redis: Update RouterStatus = "ready"
Note over Docker,Traefik: Certificate Acquisition (Automatic)
Docker->>Traefik: Container discovered (Docker provider)
Traefik->>Traefik: Create router from labels
Traefik->>LE: Request certificate (TLS-ALPN-01)
LE->>Traefik: Domain validation challenge
Traefik->>LE: Challenge response
LE-->>Traefik: Certificate issued
Traefik->>Traefik: Store certificate
Note over Traefik: Site accessible via HTTPS
Deployment Timeline¶
| Time | Event | Site Status |
|---|---|---|
| T+0s | Webhook received | Processing |
| T+5s | Build completes | Building |
| T+6s | Files deployed | HTTP accessible ✅ |
| T+6s | Router creation starts (async) | HTTP accessible ✅ |
| T+7s | Sidecar container created | HTTP accessible ✅ |
| T+8s | Traefik discovers router | HTTP accessible ✅ |
| T+10s | Certificate request sent | HTTP accessible ✅ |
| T+40s | Certificate acquired | HTTPS accessible ✅ |
Key Insight: Deployment completes in ~6 seconds. HTTPS becomes available ~40 seconds later (non-blocking).
Certificate Acquisition Flow¶
Detailed Certificate Request Process¶
sequenceDiagram
participant Sidecar as Sidecar Container
participant Traefik as Traefik Proxy
participant LE as Let's Encrypt
participant Browser as User Browser
Note over Sidecar: Container created with labels
Sidecar->>Traefik: Docker event: container started
Traefik->>Traefik: Read container labels
Traefik->>Traefik: Create router: Host(`site.tkr...`)
Traefik->>Traefik: Router has certResolver=letsencrypt
Note over Traefik,LE: TLS-ALPN-01 Challenge
Traefik->>LE: Request certificate for site.tkr...
LE->>Traefik: Challenge: Prove domain ownership
Traefik->>Traefik: Generate challenge response
LE->>Traefik: TLS connection to verify challenge
Traefik->>LE: Present challenge certificate
LE->>LE: Validate domain ownership
LE-->>Traefik: Issue certificate (valid 90 days)
Traefik->>Traefik: Store cert in acme.json
Note over Traefik: Router active with valid cert
Browser->>Traefik: HTTPS request to site.tkr...
Traefik->>Browser: Present Let's Encrypt certificate
Browser->>Browser: Validate certificate
Browser->>Traefik: Encrypted request
Traefik->>Browser: Encrypted response
Note over Browser: Green lock icon ✅
Why TLS-ALPN-01 Challenge?¶
Challenge Types Comparison:
| Challenge | Works with CNAME? | Supports Wildcards? | Requires API? | Chosen? |
|---|---|---|---|---|
| DNS-01 | ❌ No* | ✅ Yes | ✅ Yes (Cloudflare) | ❌ |
| TLS-ALPN-01 | ✅ Yes | ❌ No | ❌ No | ✅ |
| HTTP-01 | ✅ Yes | ❌ No | ❌ No | ❌ |
* DNS-01 requires API access to the authoritative DNS zone (not possible with CNAME to myfritz.net)
Decision: TLS-ALPN-01 works with our CNAME setup (*.tkr.lair.nntin.xyz → myfritz.net) and doesn't require Cloudflare API access.
Router State Machine¶
Router Lifecycle States¶
stateDiagram-v2
[*] --> NotCreated: Site deployed
NotCreated --> Pending: Router creation starts
Pending --> Ready: Container created successfully
Pending --> Failed: Container creation error
Failed --> Pending: Redeploy (retry)
Ready --> Ready: Redeployment (skip creation)
Ready --> Failed: Container manually deleted
Failed --> [*]: Site removed
Ready --> [*]: Site removed
note right of NotCreated
RouterStatus = ""
No router exists yet
end note
note right of Pending
RouterStatus = "pending"
Goroutine creating container
end note
note right of Ready
RouterStatus = "ready"
Container exists
Certificate acquired by Traefik
end note
note right of Failed
RouterStatus = "failed"
RouterError contains details
Site uses catch-all router
end note
State Transitions¶
State Definitions:
- NotCreated (
RouterStatus = ""): - Initial state after deployment
- No router container exists
-
Site accessible via catch-all router (HTTP works, HTTPS shows warning)
-
Pending (
RouterStatus = "pending"): - Router creation goroutine running
- Docker API calls in progress
-
Typically lasts 1-2 seconds
-
Ready (
RouterStatus = "ready"): - Sidecar container created successfully
- Traefik has discovered router
- Certificate acquired (or in progress)
-
Site fully functional via HTTPS
-
Failed (
RouterStatus = "failed"): - Container creation failed
RouterErrorfield contains error message- Site falls back to catch-all router
- Redeploy to retry
Data Model¶
Redis Metadata Structure¶
classDiagram
class DeploymentMetadata {
+string Path
+string Status
+time.Time LastDeployed
+string CommitHash
+string Repository
+string Branch
+string NodeVersion
+string OutputDir
+string RouterStatus
+time.Time RouterCreatedAt
+time.Time RouterCompletedAt
+string RouterError
+string RouterContainerID
}
class RouterStatus {
<<enumeration>>
EMPTY
PENDING
READY
FAILED
}
class RedisClient {
+UpdateRouterStatus(siteName, status, timestamp, error)
+GetDeploymentMetadata(siteName)
+StoreDeploymentMetadata(metadata)
}
DeploymentMetadata --> RouterStatus
RedisClient --> DeploymentMetadata
Field Semantics:
RouterStatus: Current state of router creation""(empty): Not attempted"pending": In progress"ready": Container created (cert acquisition happens in Traefik)-
"failed": Creation failed -
RouterCreatedAt: When router creation started (used for metrics) RouterCompletedAt: When creation finished (success or failure)RouterError: Human-readable error message if failedRouterContainerID: Docker container ID for inspection/cleanup
Redis Key Pattern: deployment:metadata:{siteName}
Error Handling¶
Failure Scenarios and Recovery¶
flowchart TD
Start[Router Creation Starts] --> CheckDocker{Docker API Available?}
CheckDocker -->|No| LogError1[Log: Docker API unavailable]
LogError1 --> UpdateFailed1[Update Redis: status=failed]
UpdateFailed1 --> Fallback1[Site uses catch-all router]
CheckDocker -->|Yes| CreateContainer{Create Container}
CreateContainer -->|Success| UpdateReady[Update Redis: status=ready]
UpdateReady --> TraefikDiscover[Traefik discovers container]
TraefikDiscover --> CertRequest[Certificate requested]
CertRequest --> Success[HTTPS works ✅]
CreateContainer -->|Failure| Retry{Retry Count < 2?}
Retry -->|Yes| CreateContainer
Retry -->|No| LogError2[Log: Container creation failed]
LogError2 --> UpdateFailed2[Update Redis: status=failed]
UpdateFailed2 --> Fallback2[Site uses catch-all router]
Fallback1 --> HTTPWorks1[HTTP accessible ✅]
Fallback2 --> HTTPWorks2[HTTP accessible ✅]
HTTPWorks1 --> HTTPSWarning1[HTTPS shows cert warning ⚠️]
HTTPSWarning2 --> HTTPSWarning1[HTTPS shows cert warning ⚠️]
HTTPSWarning1 --> Redeploy[Operator redeploys to retry]
Redeploy --> Start
style Success fill:#d4edda
style Fallback1 fill:#fff3cd
style Fallback2 fill:#fff3cd
style LogError1 fill:#f8d7da
style LogError2 fill:#f8d7da
Error Recovery Mechanisms¶
Automatic Recovery:
1. Docker healthcheck: Restarts crashed sidecar containers
- Healthcheck: ["CMD", "true"] every 30s
- Restart policy: unless-stopped
- Startup reconciliation: Cleans up orphaned containers
- Runs on webhook-handler startup
- Removes containers without matching Redis metadata
- Syncs Redis state with Docker reality
Manual Recovery: 1. Redeploy: Push new commit or trigger webhook manually - Retries router creation - Updates Redis metadata - Most common recovery method
- State divergence: If Redis says "ready" but container missing
- Trust Redis state (accept occasional divergence)
- Redeploy forces recreation
- No automatic reconciliation during runtime
Monitoring and Observability¶
Prometheus Metrics¶
graph LR
subgraph "Webhook Handler"
Code[Router Creation Code]
end
subgraph "Prometheus Metrics"
M1[tinkero_router_creation_total]
M2[tinkero_router_creation_duration_seconds]
M3[tinkero_router_status]
end
subgraph "Grafana Dashboard"
P1[Overview Panel: Total/Ready/Failed/Pending]
P2[Time Series: Success/Failure Rate]
P3[Table: Per-Site Status]
P4[Histogram: Acquisition Time]
P5[Logs: Recent Errors]
end
Code -->|Increment| M1
Code -->|Observe| M2
Code -->|Set| M3
M1 --> P1
M1 --> P2
M2 --> P4
M3 --> P3
P5 -.->|Query| Loki[Loki Logs]
Metrics Exposed:
- tinkero_router_creation_total (Counter)
- Labels:
status(success/failure),site -
Tracks total router creation attempts
-
tinkero_router_creation_duration_seconds (Histogram)
- Labels:
status - Measures time to create router container
-
Buckets: 0.1s, 0.5s, 1s, 2s, 5s, 10s
-
tinkero_router_status (Gauge)
- Labels:
site - Current status: 0=none, 1=pending, 2=ready, 3=failed
Grafana Dashboard Panels¶
Dashboard: configs/grafana/dashboards/tinkero/certificates.json
- Overview Stats (top row):
- Total sites deployed
- Sites with ready routers
- Sites with failed routers
-
Sites with pending routers
-
Router Creation Rate (time-series):
- Success rate over time
-
Failure rate over time
-
Router Status Table (main panel):
-
Site name, status, last deployment, creation time, error message
-
Certificate Acquisition Time (histogram):
-
Distribution of router creation duration
-
Recent Errors (logs panel):
- Loki query:
{service="webhook-handler"} |= "Router creation failed"
Troubleshooting Guide¶
Common Issues and Solutions¶
Issue 1: HTTPS shows certificate warning
Symptoms: - Site accessible via HTTP - HTTPS shows "Your connection is not private" - Certificate issuer: "TRAEFIK DEFAULT CERT"
Diagnosis:
1. Check Grafana dashboard → Tinkero Certificates
2. Find site in status table
3. Check RouterStatus:
- "pending": Wait 1-2 minutes, refresh
- "failed": Check error message
- "" (empty): Router creation never attempted
Solution:
- If "failed": Fix underlying issue (see error message), then redeploy
- If "": Redeploy to trigger router creation
- If "pending" for > 5 minutes: Check Docker daemon, redeploy
Issue 2: Router creation fails with "Docker API unavailable"
Symptoms:
- Grafana shows RouterStatus = "failed"
- Error: "Docker API error: connection refused"
Diagnosis:
1. Check if Docker daemon is running: systemctl status docker
2. Check webhook-handler has Docker socket access: docker exec webhook-handler ls -la /var/run/docker.sock
Solution:
- Restart Docker daemon: systemctl restart docker
- Verify webhook-handler container has Docker socket mounted
- Redeploy site to retry
Issue 3: Certificate acquired but HTTPS still fails
Symptoms:
- Grafana shows RouterStatus = "ready"
- Certificate exists in Traefik dashboard
- HTTPS still shows warning or connection refused
Diagnosis:
1. Check Traefik dashboard: https://lair.nntin.xyz/traefik
2. Verify router exists with correct Host() rule
3. Check router priority (should be higher than catch-all)
4. Verify sidecar container is running: docker ps | grep router-
Solution: - If router missing: Redeploy to recreate - If container stopped: Check logs, restart container - If priority wrong: Update router configuration, redeploy
Implementation References¶
Key Files¶
Traefik Client Package: - file:projects/Tinkero/services/webhook-handler/internal/traefik/client.go - file:projects/Tinkero/services/webhook-handler/internal/traefik/router.go - file:projects/Tinkero/services/webhook-handler/internal/traefik/types.go - file:projects/Tinkero/services/webhook-handler/internal/traefik/reconcile.go
Deployment Integration: - file:projects/Tinkero/services/webhook-handler/internal/server/handlers.go (router creation trigger) - file:projects/Tinkero/services/webhook-handler/internal/server/server.go (client initialization) - file:projects/Tinkero/services/webhook-handler/internal/server/reconcile.go (startup cleanup)
Metadata and Metrics: - file:projects/Tinkero/services/webhook-handler/internal/redis/metadata.go (DeploymentMetadata) - file:projects/Tinkero/services/webhook-handler/internal/server/metrics.go (Prometheus metrics)
Observability: - file:configs/grafana/dashboards/tinkero/certificates.json (Grafana dashboard)
Summary¶
This documentation provides a comprehensive visual guide to the Tinkero certificate architecture. The per-site router approach using Docker sidecar containers enables automatic SSL certificate acquisition while maintaining operational simplicity and full observability.
Key Takeaways:
1. Problem: HostRegexp() doesn't trigger certificate acquisition
2. Solution: Per-site routers with explicit Host() rules
3. Implementation: Lightweight sidecar containers with Traefik labels
4. Observability: Prometheus metrics + Grafana dashboard
5. Recovery: Redeploy to retry, graceful degradation via catch-all router