Files
skipper/docs/architecture.md
2026-04-05 15:28:04 +02:00

15 KiB

Skipper Architecture

Purpose

Skipper is a lightweight hosting orchestration system built for clarity, inspectability, and future AI-assisted operations.

  • Skipper = control plane
  • Skippy = host agent
  • Communication model = HTTPS pull from agent to controller
  • Implementation language = Node.js only
  • Persistence = file-based JSON storage only
  • Deployment target = Docker for all components

This document now reflects the current implementation in this repository, while also marking the remaining gaps to the broader target architecture.

Design Principles

  • No hidden state
  • No implicit resource relationships
  • No shell-script driven control flow as the primary orchestration model
  • All operations must be idempotent and safe to retry
  • All state transitions must be inspectable via API or persisted state
  • Logs, events, and state must be sufficient for debugging
  • Schemas must be explicit and versioned
  • Extensibility is preferred over short-term convenience
  • Observability is a first-class requirement

System Topology

Skipper

Skipper is the controller and source of truth for desired state.

Current responsibilities:

  • Store resource definitions on disk
  • Store desired_state, current_state, and last_applied_state
  • Create declarative work orders targeted at nodes
  • Accept structured work-order results and state reports
  • Persist structured logs, events, idempotency records, and snapshots
  • Expose a versioned REST API under /v1

Skippy

Skippy is a node-local reconciliation agent.

Current responsibilities:

  • Authenticate to Skipper with a node token
  • Poll for work orders over HTTPS
  • Apply desired state locally through Node.js modules
  • Report structured results
  • Report updated resource state
  • Emit structured JSON logs locally while also driving persisted state changes through the API

Communication Model

Communication is LAN-first and HTTPS-oriented.

  • Agents initiate control-plane communication
  • Skipper does not require inbound connectivity to managed nodes
  • All implemented API endpoints are versioned under /v1
  • All implemented requests and responses use JSON
  • Authentication is token-based
  • Request tracing is propagated with request_id and correlation_id

The current local development stack uses plain HTTP inside Docker and during smoke tests. The architecture remains HTTPS-first for production deployment.

API Contract

Versioning

All implemented control-plane endpoints live under /v1.

Implemented endpoints:

  • GET /v1/health
  • GET /v1/resources
  • GET /v1/resources/:resourceType/:resourceId
  • GET /v1/work-orders
  • GET /v1/work-orders/:workOrderId
  • GET /v1/nodes/:nodeId/work-orders/next
  • POST /v1/nodes/:nodeId/heartbeat
  • POST /v1/work-orders/:workOrderId/result
  • POST /v1/deployments/:tenantId/apply
  • GET /v1/snapshots/system/latest
  • GET /v1/snapshots/tenants/:tenantId/latest

A compatibility GET /health endpoint also exists for simple health checks.

Request Metadata

Every request is processed with:

  • request_id
  • correlation_id

These are accepted from:

  • x-request-id
  • x-correlation-id

If absent, Skipper generates them and returns them in both response headers and the response body envelope.

Response Envelope

All API responses use a stable envelope:

{
  "schema_version": "v1",
  "request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
  "correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
  "data": {},
  "error": null,
  "metadata": {
    "timestamp": "2026-04-05T12:00:00.000Z"
  }
}

Error responses use the same envelope with data: null.

Authentication

Two auth modes are currently implemented:

  • admin API requests use x-admin-token
  • node API requests use Authorization: Bearer <node-token>

Node tokens are stored in:

  • /data/auth/nodes/<node_id>.json

Idempotency

Idempotency is currently implemented for deployment apply requests through x-idempotency-key.

Persisted idempotency records live under:

  • /data/idempotency

The main implemented idempotent flow is:

  • POST /v1/deployments/:tenantId/apply

Resource Model

All managed resources are explicit JSON documents.

Each resource document contains:

  • id
  • resource_type
  • schema_version
  • desired_state
  • current_state
  • last_applied_state
  • metadata
  • created_at
  • updated_at

The three-state model is implemented and central:

  • desired_state: what Skipper wants
  • current_state: what Skippy or Skipper currently knows to be true
  • last_applied_state: what Skippy most recently attempted or enforced

Implemented Resource Types

The storage layer currently supports these resource types:

  • tenant
  • node
  • service
  • deployment
  • resource_limits
  • network
  • volume

At the moment, the repository ships example data for:

  • tenant
  • node
  • service

deployment resources are created dynamically when a deployment apply request is issued.

Tenant

Current tenant usage:

  • deployment target policy
  • service references
  • compose project specification

Example:

{
  "id": "example-tenant",
  "resource_type": "tenant",
  "schema_version": "v1",
  "desired_state": {
    "display_name": "Example Tenant",
    "deployment_policy": {
      "target_node_id": "host-1"
    },
    "service_ids": ["service-web"],
    "compose": {
      "tenant_id": "example-tenant",
      "compose_file": "services:\n  web:\n    image: nginx:alpine\n",
      "env": {
        "NGINX_PORT": "8081"
      }
    }
  },
  "current_state": {},
  "last_applied_state": {},
  "metadata": {}
}

Node

Current node usage:

  • desired enablement and labels
  • heartbeat status
  • agent capabilities
  • agent version

Service

Current service usage:

  • tenant ownership
  • service kind
  • image
  • network and volume references
  • resource limit reference

Deployment

Current deployment usage:

  • created during POST /v1/deployments/:tenantId/apply
  • tracks deployment status
  • tracks associated work order
  • stores deployment-oriented desired state

File-Based Persistence Layout

The current on-disk layout is:

/data
  /resources
    /tenants
    /nodes
    /services
    /deployments
    /resource-limits
    /networks
    /volumes
  /work-orders
    /pending
    /running
    /finished
  /events
    /YYYY-MM-DD
  /logs
    /YYYY-MM-DD
  /snapshots
    /system
    /tenants
  /idempotency
  /auth
    /nodes

Rules implemented today:

  • one JSON document per state file
  • atomic JSON writes
  • append-only event and log history
  • stable file names derived from resource or work-order IDs

Work Order Model

Skipper does not send direct commands. It issues declarative work orders.

Work Order Schema

The implemented work-order model is:

{
  "id": "4b9f5e2a-cf65-4342-97f5-66f3fe5a54f7",
  "resource_type": "work_order",
  "schema_version": "v1",
  "type": "deploy_service",
  "target": {
    "tenant_id": "example-tenant",
    "node_id": "host-1"
  },
  "desired_state": {
    "deployment_id": "deployment-123",
    "tenant_id": "example-tenant",
    "service_ids": ["service-web"],
    "compose_project": {}
  },
  "status": "pending",
  "result": null,
  "request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
  "correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
  "created_at": "2026-04-05T12:00:00.000Z",
  "started_at": null,
  "finished_at": null,
  "metadata": {}
}

Status Values

Implemented status values:

  • pending
  • running
  • success
  • failed

Implemented Work Order Type

The current code implements:

  • deploy_service

This currently reconciles a tenant compose project by:

  1. writing docker-compose.yml
  2. writing .env
  3. running docker compose up -d
  4. reporting structured output and state changes

Work Order Result Schema

Work-order results are structured JSON, not free-form text:

{
  "success": true,
  "code": "APPLY_OK",
  "message": "Desired state applied",
  "details": {
    "duration_ms": 8,
    "compose_path": "/opt/skipper/tenants/example-tenant/docker-compose.yml",
    "changed_resources": ["service-web"],
    "unchanged_resources": [],
    "command": {
      "program": "docker",
      "args": ["compose", "-f", "...", "up", "-d"],
      "exit_code": 0,
      "stdout": "...",
      "stderr": ""
    }
  }
}

This keeps results machine-readable while still preserving execution detail.

Reconciliation Flow

The currently implemented flow is:

  1. Admin calls POST /v1/deployments/:tenantId/apply
  2. Skipper loads the tenant resource
  3. Skipper creates a deployment resource
  4. Skipper creates a deploy_service work order targeted at the node in tenant desired state
  5. Skippy sends heartbeat and polls GET /v1/nodes/:nodeId/work-orders/next
  6. Skippy claims and executes the work order
  7. Skippy reports POST /v1/work-orders/:workOrderId/result
  8. Skipper finishes the work order, updates resource state, writes events, and keeps logs/snapshots available

Retry Safety

Current retry safety measures:

  • deployment apply is idempotent through x-idempotency-key
  • work-order execution is state-based and convergent at the compose level
  • work-order completion is safe against duplicate result submission
  • filesystem writes are atomic

Structured Logging

All implemented operational logs are structured JSON.

Each log entry includes:

  • timestamp
  • level
  • service
  • node_id
  • tenant_id
  • request_id
  • correlation_id
  • action
  • result
  • metadata

Logs are written to:

  • /data/logs/YYYY-MM-DD/<service>.ndjson

The logger also redacts common secret-shaped keys such as token, secret, password, and authorization.

Event System

Every important state transition in the current flow emits an event.

Implemented event storage:

  • /data/events/YYYY-MM-DD/<timestamp>-<event_id>.json

Implemented event types in the current code path:

  • resource_created
  • work_order_created
  • work_order_started
  • work_order_succeeded
  • work_order_failed
  • deployment_started
  • deployment_succeeded
  • deployment_failed
  • node_heartbeat_received
  • snapshot_created

The broader architecture still expects additional event coverage for all future resource mutations.

State Snapshots

Snapshots are implemented and persisted as JSON documents.

Supported snapshot scopes:

  • system
  • per-tenant

Implemented endpoints:

  • GET /v1/snapshots/system/latest
  • GET /v1/snapshots/tenants/:tenantId/latest

Each snapshot includes:

  • snapshot_id
  • scope
  • created_at
  • request_id
  • correlation_id
  • resources
  • diffs

Snapshot files are currently stored as:

  • /data/snapshots/system/latest.json
  • /data/snapshots/tenants/<tenant_id>.json

Observability and AI Readiness

The current implementation is AI-ready at the core workflow level because it now preserves:

  • request-level tracing across API and agent boundaries
  • structured work-order lifecycle data
  • historical logs
  • historical events
  • explicit desired/current/last-applied state
  • exportable JSON/NDJSON persistence

For the implemented deployment path, the system can answer:

  • what changed
  • which work order applied it
  • which node applied it
  • what desired state was targeted
  • what current and last applied state were recorded

Error Handling

All API errors are structured and envelope-wrapped.

Implemented error shape:

{
  "code": "RESOURCE_NOT_FOUND",
  "message": "Tenant not found",
  "details": {
    "resource_type": "tenant",
    "resource_id": "example-tenant"
  }
}

Implemented machine-readable error codes include:

  • INVALID_REQUEST
  • UNAUTHORIZED
  • RESOURCE_NOT_FOUND
  • WORK_ORDER_NOT_CLAIMABLE
  • INTERNAL_ERROR

Raw stack traces are not returned in API responses.

Security Model

Implemented

  • node token authentication
  • admin token authentication
  • correlation-aware structured logging
  • redaction of common secret-shaped log fields

Not Yet Implemented

  • role-based authorization
  • secret rotation workflows
  • mTLS
  • per-resource authorization policies

Extensibility Model

The code is currently structured so new resource types and work-order types can be added without replacing the whole control flow.

Current extensibility anchors:

  • resource storage by resource_type
  • work-order execution by type
  • stable response envelope
  • versioned schemas
  • shared storage and telemetry modules in /shared

Implemented Internal Modules

Shared

Skipper API

Skippy Agent

Current Gaps

The code is now aligned with the architecture for the core deployment path, but it is not feature-complete across the full long-term vision.

Not yet implemented:

  • full CRUD APIs for all resource types
  • generic reconciliation across all future services
  • resource_updated and desired_state_changed event coverage for every mutation path
  • persisted state reports for all future resource kinds
  • richer diffing beyond snapshot-level desired/current comparisons
  • RBAC and richer authorization
  • production HTTPS termination inside the app itself
  • additional work-order types such as restart, migrate, nginx management, mysql provisioning, and systemd integration

Current Compliance Summary

Implemented and aligned:

  • /v1 API contract
  • request and correlation ID propagation
  • envelope-based responses
  • structured errors
  • declarative work orders
  • three-state resource model
  • structured JSON logging
  • event persistence
  • snapshot persistence
  • idempotent deployment apply
  • token-based controller/agent auth

Still incomplete relative to the full target:

  • broader resource coverage
  • broader reconciliation coverage
  • broader auth model
  • full event coverage for every possible state mutation