Skip to content

Certificate Architecture Documentation

Overview

This ticket documents the Tinkero certificate architecture that enables automatic SSL certificate acquisition for deployed sites. It explains the problem, solution, and implementation with comprehensive visual diagrams.


The Problem: Why HostRegexp Doesn't Work

Original Architecture (Broken)

The initial Tinkero deployment used a single catch-all Traefik router to handle all subdomains:

traefik.http.routers.caddy.rule=HostRegexp(`{subdomain:[a-z0-9-]+}.tkr.lair.nntin.xyz`)
traefik.http.routers.caddy.tls.certresolver=letsencrypt

The Issue: Traefik's automatic certificate acquisition only works with explicit Host() rules, not pattern-matched HostRegexp() rules.

Result: All deployed sites present Traefik's default self-signed certificate, causing browser security warnings and making HTTPS unusable.

Why This Limitation Exists

Let's Encrypt requires domain validation before issuing certificates. Traefik needs to know the exact domain name to request a certificate for. With HostRegexp(), Traefik only knows the pattern, not the specific domains that will match it.

Analogy: It's like asking for a driver's license without providing your name - the pattern matches many possibilities, but the certificate authority needs a specific identity.


The Solution: Per-Site Routers with Sidecar Containers

Architecture Overview

Instead of one catch-all router, we create individual Traefik routers for each deployed site using lightweight Docker sidecar containers.

graph TD
    subgraph "Internet"
        User[User Browser]
    end

    subgraph "Traefik Reverse Proxy"
        Traefik[Traefik]
        Router1[Router: site1.tkr...]
        Router2[Router: site2.tkr...]
        Router3[Router: site3.tkr...]
        Catchall[Catchall Router: *.tkr...]
    end

    subgraph "Docker Network: lair-network"
        Sidecar1[Sidecar Container 1]
        Sidecar2[Sidecar Container 2]
        Sidecar3[Sidecar Container 3]
    end

    subgraph "Docker Network: tinkero-network"
        Caddy[Caddy Static Server]
        Files1[Site 1 Files]
        Files2[Site 2 Files]
        Files3[Site 3 Files]
    end

    subgraph "Certificate Authority"
        LE[Let's Encrypt]
    end

    User -->|HTTPS Request| Traefik
    Traefik -->|Discovers| Sidecar1
    Traefik -->|Discovers| Sidecar2
    Traefik -->|Discovers| Sidecar3
    Sidecar1 -.->|Defines| Router1
    Sidecar2 -.->|Defines| Router2
    Sidecar3 -.->|Defines| Router3
    Router1 -->|Routes to| Caddy
    Router2 -->|Routes to| Caddy
    Router3 -->|Routes to| Caddy
    Catchall -->|Fallback| Caddy
    Caddy --> Files1
    Caddy --> Files2
    Caddy --> Files3
    Traefik <-->|TLS-ALPN-01| LE

Key Components

Sidecar Containers: - Minimal Alpine Linux containers (~10MB each) - Run sleep infinity - exist only for their Docker labels - Carry Traefik routing configuration as labels - Join both lair-network (for Traefik discovery) and tinkero-network (for routing)

Traefik Docker Provider: - Watches Docker daemon for new containers - Reads labels from sidecar containers - Automatically creates routers with explicit Host() rules - Triggers Let's Encrypt certificate requests

Graceful Degradation: - Catch-all HostRegexp() router remains active (priority 1) - Per-site routers have higher priority (explicit Host() rules naturally win) - If per-site router fails, catch-all handles traffic (HTTP works, HTTPS shows warning)


Network Topology

Multi-Network Architecture

graph TB
    subgraph "External"
        Internet[Internet]
        DNS[DNS: *.tkr.lair.nntin.xyz]
    end

    subgraph "lair-network (Traefik Discovery)"
        Traefik[Traefik Proxy]
        Sidecar1[Sidecar: site1]
        Sidecar2[Sidecar: site2]
        Grafana[Grafana]
        Prometheus[Prometheus]
    end

    subgraph "tinkero-network (Service Communication)"
        Caddy[Caddy Server]
        Webhook[Webhook Handler]
        Redis[Redis]
        Loki[Loki]
    end

    subgraph "Host Filesystem"
        Sites[/data/tinkero/sites/]
        Certs["/data/traefik/acme.json"]
    end

    Internet -->|HTTPS| Traefik
    DNS -.->|Resolves to| Traefik
    Traefik -->|Discovers| Sidecar1
    Traefik -->|Discovers| Sidecar2
    Traefik -->|Routes to| Caddy
    Caddy -->|Serves from| Sites
    Webhook -->|Creates| Sidecar1
    Webhook -->|Creates| Sidecar2
    Webhook -->|Stores metadata| Redis
    Traefik -->|Stores certs| Certs
    Traefik -->|Metrics| Prometheus
    Webhook -->|Logs| Loki
    Grafana -->|Queries| Prometheus
    Grafana -->|Queries| Loki

    style Sidecar1 fill:#e1f5ff
    style Sidecar2 fill:#e1f5ff
    style Traefik fill:#fff3cd
    style Caddy fill:#d4edda

Why Two Networks?

  1. lair-network: Traefik's Docker provider watches this network for containers with routing labels
  2. tinkero-network: Internal service communication (Webhook Handler → Caddy, Redis, etc.)
  3. Sidecar containers join BOTH: Enables Traefik to discover them AND route traffic to Caddy

Component Architecture

System Components and Interactions

graph TD
    subgraph "Webhook Handler Service"
        Server[HTTP Server]
        Deploy[Deployment Orchestrator]
        TraefikClient[Traefik Client]
        CaddyClient[Caddy Client]
        RedisClient[Redis Client]
        Metrics[Prometheus Metrics]
    end

    subgraph "External Services"
        GitHub[GitHub Webhooks]
        Docker[Docker Daemon]
        RedisDB[(Redis Database)]
        TraefikProxy[Traefik Proxy]
        CaddyServer[Caddy Server]
    end

    subgraph "Observability Stack"
        Prom[Prometheus]
        Graf[Grafana]
        LokiDB[Loki]
    end

    GitHub -->|Webhook| Server
    Server -->|Orchestrate| Deploy
    Deploy -->|Build| Docker
    Deploy -->|Deploy Files| CaddyServer
    Deploy -->|Configure Route| CaddyClient
    Deploy -->|Create Router| TraefikClient
    TraefikClient -->|Create Container| Docker
    Docker -->|Discover| TraefikProxy
    Deploy -->|Store Metadata| RedisClient
    RedisClient -->|Read/Write| RedisDB
    Server -->|Record| Metrics
    Metrics -->|Expose| Prom
    Graf -->|Query| Prom
    Graf -->|Query| LokiDB
    Server -->|Logs| LokiDB

    style TraefikClient fill:#e1f5ff
    style Deploy fill:#fff3cd

Traefik Client Package Structure

The new internal/traefik/ package follows the pattern established by internal/caddy/:

internal/traefik/
├── client.go       // Docker-based router creation
├── router.go       // Label builder for Traefik config
├── types.go        // RouterConfig struct and constants
├── reconcile.go    // Startup cleanup of orphaned containers
└── client_test.go  // Unit tests

Key Responsibilities: - Client: Creates/removes Docker sidecar containers - RouterConfig: Builds Traefik label map for a site - reconcile: Cleans up orphaned containers on startup (idempotent recovery)


Deployment Flow

Complete Deployment Sequence

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub
    participant WH as Webhook Handler
    participant Build as Build System
    participant FS as Filesystem
    participant Caddy as Caddy Server
    participant Docker as Docker Daemon
    participant Redis as Redis DB
    participant Traefik as Traefik Proxy
    participant LE as Let's Encrypt

    Dev->>GH: git push
    GH->>WH: POST /webhook (push event)
    WH->>WH: Verify webhook signature
    WH->>GH: Create deployment (pending)

    Note over WH,Build: Build Phase
    WH->>Build: Clone repository
    Build->>Build: npm install && npm run build
    Build-->>WH: Build artifacts

    Note over WH,FS: Deploy Phase
    WH->>FS: Copy files to release directory
    WH->>FS: Atomic symlink swap (current → new release)
    WH->>Caddy: Configure route via API
    Caddy-->>WH: Route configured

    Note over WH,Redis: Metadata Phase
    WH->>Redis: Store deployment metadata
    Redis-->>WH: Metadata stored
    WH->>GH: Update deployment (success)

    Note over WH: Site accessible via HTTP

    Note over WH,Docker: Router Creation (Async)
    WH->>Redis: Update RouterStatus = "pending"
    WH->>Docker: Create sidecar container with labels
    Docker-->>WH: Container ID
    WH->>Redis: Update RouterStatus = "ready"

    Note over Docker,Traefik: Certificate Acquisition (Automatic)
    Docker->>Traefik: Container discovered (Docker provider)
    Traefik->>Traefik: Create router from labels
    Traefik->>LE: Request certificate (TLS-ALPN-01)
    LE->>Traefik: Domain validation challenge
    Traefik->>LE: Challenge response
    LE-->>Traefik: Certificate issued
    Traefik->>Traefik: Store certificate

    Note over Traefik: Site accessible via HTTPS

Deployment Timeline

Time Event Site Status
T+0s Webhook received Processing
T+5s Build completes Building
T+6s Files deployed HTTP accessible ✅
T+6s Router creation starts (async) HTTP accessible ✅
T+7s Sidecar container created HTTP accessible ✅
T+8s Traefik discovers router HTTP accessible ✅
T+10s Certificate request sent HTTP accessible ✅
T+40s Certificate acquired HTTPS accessible ✅

Key Insight: Deployment completes in ~6 seconds. HTTPS becomes available ~40 seconds later (non-blocking).


Certificate Acquisition Flow

Detailed Certificate Request Process

sequenceDiagram
    participant Sidecar as Sidecar Container
    participant Traefik as Traefik Proxy
    participant LE as Let's Encrypt
    participant Browser as User Browser

    Note over Sidecar: Container created with labels
    Sidecar->>Traefik: Docker event: container started
    Traefik->>Traefik: Read container labels
    Traefik->>Traefik: Create router: Host(`site.tkr...`)
    Traefik->>Traefik: Router has certResolver=letsencrypt

    Note over Traefik,LE: TLS-ALPN-01 Challenge
    Traefik->>LE: Request certificate for site.tkr...
    LE->>Traefik: Challenge: Prove domain ownership
    Traefik->>Traefik: Generate challenge response
    LE->>Traefik: TLS connection to verify challenge
    Traefik->>LE: Present challenge certificate
    LE->>LE: Validate domain ownership
    LE-->>Traefik: Issue certificate (valid 90 days)
    Traefik->>Traefik: Store cert in acme.json

    Note over Traefik: Router active with valid cert

    Browser->>Traefik: HTTPS request to site.tkr...
    Traefik->>Browser: Present Let's Encrypt certificate
    Browser->>Browser: Validate certificate
    Browser->>Traefik: Encrypted request
    Traefik->>Browser: Encrypted response

    Note over Browser: Green lock icon ✅

Why TLS-ALPN-01 Challenge?

Challenge Types Comparison:

Challenge Works with CNAME? Supports Wildcards? Requires API? Chosen?
DNS-01 ❌ No* ✅ Yes ✅ Yes (Cloudflare)
TLS-ALPN-01 ✅ Yes ❌ No ❌ No
HTTP-01 ✅ Yes ❌ No ❌ No

* DNS-01 requires API access to the authoritative DNS zone (not possible with CNAME to myfritz.net)

Decision: TLS-ALPN-01 works with our CNAME setup (*.tkr.lair.nntin.xyzmyfritz.net) and doesn't require Cloudflare API access.


Router State Machine

Router Lifecycle States

stateDiagram-v2
    [*] --> NotCreated: Site deployed
    NotCreated --> Pending: Router creation starts
    Pending --> Ready: Container created successfully
    Pending --> Failed: Container creation error
    Failed --> Pending: Redeploy (retry)
    Ready --> Ready: Redeployment (skip creation)
    Ready --> Failed: Container manually deleted
    Failed --> [*]: Site removed
    Ready --> [*]: Site removed

    note right of NotCreated
        RouterStatus = ""
        No router exists yet
    end note

    note right of Pending
        RouterStatus = "pending"
        Goroutine creating container
    end note

    note right of Ready
        RouterStatus = "ready"
        Container exists
        Certificate acquired by Traefik
    end note

    note right of Failed
        RouterStatus = "failed"
        RouterError contains details
        Site uses catch-all router
    end note

State Transitions

State Definitions:

  1. NotCreated (RouterStatus = ""):
  2. Initial state after deployment
  3. No router container exists
  4. Site accessible via catch-all router (HTTP works, HTTPS shows warning)

  5. Pending (RouterStatus = "pending"):

  6. Router creation goroutine running
  7. Docker API calls in progress
  8. Typically lasts 1-2 seconds

  9. Ready (RouterStatus = "ready"):

  10. Sidecar container created successfully
  11. Traefik has discovered router
  12. Certificate acquired (or in progress)
  13. Site fully functional via HTTPS

  14. Failed (RouterStatus = "failed"):

  15. Container creation failed
  16. RouterError field contains error message
  17. Site falls back to catch-all router
  18. Redeploy to retry

Data Model

Redis Metadata Structure

classDiagram
    class DeploymentMetadata {
        +string Path
        +string Status
        +time.Time LastDeployed
        +string CommitHash
        +string Repository
        +string Branch
        +string NodeVersion
        +string OutputDir
        +string RouterStatus
        +time.Time RouterCreatedAt
        +time.Time RouterCompletedAt
        +string RouterError
        +string RouterContainerID
    }

    class RouterStatus {
        <<enumeration>>
        EMPTY
        PENDING
        READY
        FAILED
    }

    class RedisClient {
        +UpdateRouterStatus(siteName, status, timestamp, error)
        +GetDeploymentMetadata(siteName)
        +StoreDeploymentMetadata(metadata)
    }

    DeploymentMetadata --> RouterStatus
    RedisClient --> DeploymentMetadata

Field Semantics:

  • RouterStatus: Current state of router creation
  • "" (empty): Not attempted
  • "pending": In progress
  • "ready": Container created (cert acquisition happens in Traefik)
  • "failed": Creation failed

  • RouterCreatedAt: When router creation started (used for metrics)

  • RouterCompletedAt: When creation finished (success or failure)
  • RouterError: Human-readable error message if failed
  • RouterContainerID: Docker container ID for inspection/cleanup

Redis Key Pattern: deployment:metadata:{siteName}


Error Handling

Failure Scenarios and Recovery

flowchart TD
    Start[Router Creation Starts] --> CheckDocker{Docker API Available?}
    CheckDocker -->|No| LogError1[Log: Docker API unavailable]
    LogError1 --> UpdateFailed1[Update Redis: status=failed]
    UpdateFailed1 --> Fallback1[Site uses catch-all router]

    CheckDocker -->|Yes| CreateContainer{Create Container}
    CreateContainer -->|Success| UpdateReady[Update Redis: status=ready]
    UpdateReady --> TraefikDiscover[Traefik discovers container]
    TraefikDiscover --> CertRequest[Certificate requested]
    CertRequest --> Success[HTTPS works ✅]

    CreateContainer -->|Failure| Retry{Retry Count < 2?}
    Retry -->|Yes| CreateContainer
    Retry -->|No| LogError2[Log: Container creation failed]
    LogError2 --> UpdateFailed2[Update Redis: status=failed]
    UpdateFailed2 --> Fallback2[Site uses catch-all router]

    Fallback1 --> HTTPWorks1[HTTP accessible ✅]
    Fallback2 --> HTTPWorks2[HTTP accessible ✅]
    HTTPWorks1 --> HTTPSWarning1[HTTPS shows cert warning ⚠️]
    HTTPSWarning2 --> HTTPSWarning1[HTTPS shows cert warning ⚠️]
    HTTPSWarning1 --> Redeploy[Operator redeploys to retry]
    Redeploy --> Start

    style Success fill:#d4edda
    style Fallback1 fill:#fff3cd
    style Fallback2 fill:#fff3cd
    style LogError1 fill:#f8d7da
    style LogError2 fill:#f8d7da

Error Recovery Mechanisms

Automatic Recovery: 1. Docker healthcheck: Restarts crashed sidecar containers - Healthcheck: ["CMD", "true"] every 30s - Restart policy: unless-stopped

  1. Startup reconciliation: Cleans up orphaned containers
  2. Runs on webhook-handler startup
  3. Removes containers without matching Redis metadata
  4. Syncs Redis state with Docker reality

Manual Recovery: 1. Redeploy: Push new commit or trigger webhook manually - Retries router creation - Updates Redis metadata - Most common recovery method

  1. State divergence: If Redis says "ready" but container missing
  2. Trust Redis state (accept occasional divergence)
  3. Redeploy forces recreation
  4. No automatic reconciliation during runtime

Monitoring and Observability

Prometheus Metrics

graph LR
    subgraph "Webhook Handler"
        Code[Router Creation Code]
    end

    subgraph "Prometheus Metrics"
        M1[tinkero_router_creation_total]
        M2[tinkero_router_creation_duration_seconds]
        M3[tinkero_router_status]
    end

    subgraph "Grafana Dashboard"
        P1[Overview Panel: Total/Ready/Failed/Pending]
        P2[Time Series: Success/Failure Rate]
        P3[Table: Per-Site Status]
        P4[Histogram: Acquisition Time]
        P5[Logs: Recent Errors]
    end

    Code -->|Increment| M1
    Code -->|Observe| M2
    Code -->|Set| M3
    M1 --> P1
    M1 --> P2
    M2 --> P4
    M3 --> P3
    P5 -.->|Query| Loki[Loki Logs]

Metrics Exposed:

  1. tinkero_router_creation_total (Counter)
  2. Labels: status (success/failure), site
  3. Tracks total router creation attempts

  4. tinkero_router_creation_duration_seconds (Histogram)

  5. Labels: status
  6. Measures time to create router container
  7. Buckets: 0.1s, 0.5s, 1s, 2s, 5s, 10s

  8. tinkero_router_status (Gauge)

  9. Labels: site
  10. Current status: 0=none, 1=pending, 2=ready, 3=failed

Grafana Dashboard Panels

Dashboard: configs/grafana/dashboards/tinkero/certificates.json

  1. Overview Stats (top row):
  2. Total sites deployed
  3. Sites with ready routers
  4. Sites with failed routers
  5. Sites with pending routers

  6. Router Creation Rate (time-series):

  7. Success rate over time
  8. Failure rate over time

  9. Router Status Table (main panel):

  10. Site name, status, last deployment, creation time, error message

  11. Certificate Acquisition Time (histogram):

  12. Distribution of router creation duration

  13. Recent Errors (logs panel):

  14. Loki query: {service="webhook-handler"} |= "Router creation failed"

Troubleshooting Guide

Common Issues and Solutions

Issue 1: HTTPS shows certificate warning

Symptoms: - Site accessible via HTTP - HTTPS shows "Your connection is not private" - Certificate issuer: "TRAEFIK DEFAULT CERT"

Diagnosis: 1. Check Grafana dashboard → Tinkero Certificates 2. Find site in status table 3. Check RouterStatus: - "pending": Wait 1-2 minutes, refresh - "failed": Check error message - "" (empty): Router creation never attempted

Solution: - If "failed": Fix underlying issue (see error message), then redeploy - If "": Redeploy to trigger router creation - If "pending" for > 5 minutes: Check Docker daemon, redeploy


Issue 2: Router creation fails with "Docker API unavailable"

Symptoms: - Grafana shows RouterStatus = "failed" - Error: "Docker API error: connection refused"

Diagnosis: 1. Check if Docker daemon is running: systemctl status docker 2. Check webhook-handler has Docker socket access: docker exec webhook-handler ls -la /var/run/docker.sock

Solution: - Restart Docker daemon: systemctl restart docker - Verify webhook-handler container has Docker socket mounted - Redeploy site to retry


Issue 3: Certificate acquired but HTTPS still fails

Symptoms: - Grafana shows RouterStatus = "ready" - Certificate exists in Traefik dashboard - HTTPS still shows warning or connection refused

Diagnosis: 1. Check Traefik dashboard: https://lair.nntin.xyz/traefik 2. Verify router exists with correct Host() rule 3. Check router priority (should be higher than catch-all) 4. Verify sidecar container is running: docker ps | grep router-

Solution: - If router missing: Redeploy to recreate - If container stopped: Check logs, restart container - If priority wrong: Update router configuration, redeploy


Implementation References

Key Files

Traefik Client Package: - file:projects/Tinkero/services/webhook-handler/internal/traefik/client.go - file:projects/Tinkero/services/webhook-handler/internal/traefik/router.go - file:projects/Tinkero/services/webhook-handler/internal/traefik/types.go - file:projects/Tinkero/services/webhook-handler/internal/traefik/reconcile.go

Deployment Integration: - file:projects/Tinkero/services/webhook-handler/internal/server/handlers.go (router creation trigger) - file:projects/Tinkero/services/webhook-handler/internal/server/server.go (client initialization) - file:projects/Tinkero/services/webhook-handler/internal/server/reconcile.go (startup cleanup)

Metadata and Metrics: - file:projects/Tinkero/services/webhook-handler/internal/redis/metadata.go (DeploymentMetadata) - file:projects/Tinkero/services/webhook-handler/internal/server/metrics.go (Prometheus metrics)

Observability: - file:configs/grafana/dashboards/tinkero/certificates.json (Grafana dashboard)

Summary

This documentation provides a comprehensive visual guide to the Tinkero certificate architecture. The per-site router approach using Docker sidecar containers enables automatic SSL certificate acquisition while maintaining operational simplicity and full observability.

Key Takeaways: 1. Problem: HostRegexp() doesn't trigger certificate acquisition 2. Solution: Per-site routers with explicit Host() rules 3. Implementation: Lightweight sidecar containers with Traefik labels 4. Observability: Prometheus metrics + Grafana dashboard 5. Recovery: Redeploy to retry, graceful degradation via catch-all router