Files

Rasmus Pedersen 0435b3d07d Initial commit

2026-04-05 15:28:04 +02:00

15 KiB

Raw Permalink Blame History

Skipper Architecture

Purpose

Skipper is a lightweight hosting orchestration system built for clarity, inspectability, and future AI-assisted operations.

Skipper = control plane
Skippy = host agent
Communication model = HTTPS pull from agent to controller
Implementation language = Node.js only
Persistence = file-based JSON storage only
Deployment target = Docker for all components

This document now reflects the current implementation in this repository, while also marking the remaining gaps to the broader target architecture.

Design Principles

No hidden state
No implicit resource relationships
No shell-script driven control flow as the primary orchestration model
All operations must be idempotent and safe to retry
All state transitions must be inspectable via API or persisted state
Logs, events, and state must be sufficient for debugging
Schemas must be explicit and versioned
Extensibility is preferred over short-term convenience
Observability is a first-class requirement

System Topology

Skipper

Skipper is the controller and source of truth for desired state.

Current responsibilities:

Store resource definitions on disk
Store desired_state, current_state, and last_applied_state
Create declarative work orders targeted at nodes
Accept structured work-order results and state reports
Persist structured logs, events, idempotency records, and snapshots
Expose a versioned REST API under /v1

Skippy

Skippy is a node-local reconciliation agent.

Current responsibilities:

Authenticate to Skipper with a node token
Poll for work orders over HTTPS
Apply desired state locally through Node.js modules
Report structured results
Report updated resource state
Emit structured JSON logs locally while also driving persisted state changes through the API

Communication Model

Communication is LAN-first and HTTPS-oriented.

Agents initiate control-plane communication
Skipper does not require inbound connectivity to managed nodes
All implemented API endpoints are versioned under /v1
All implemented requests and responses use JSON
Authentication is token-based
Request tracing is propagated with request_id and correlation_id

The current local development stack uses plain HTTP inside Docker and during smoke tests. The architecture remains HTTPS-first for production deployment.

API Contract

Versioning

All implemented control-plane endpoints live under /v1.

Implemented endpoints:

GET /v1/health
GET /v1/resources
GET /v1/resources/:resourceType/:resourceId
GET /v1/work-orders
GET /v1/work-orders/:workOrderId
GET /v1/nodes/:nodeId/work-orders/next
POST /v1/nodes/:nodeId/heartbeat
POST /v1/work-orders/:workOrderId/result
POST /v1/deployments/:tenantId/apply
GET /v1/snapshots/system/latest
GET /v1/snapshots/tenants/:tenantId/latest

A compatibility GET /health endpoint also exists for simple health checks.

Request Metadata

Every request is processed with:

request_id
correlation_id

These are accepted from:

x-request-id
x-correlation-id

If absent, Skipper generates them and returns them in both response headers and the response body envelope.

Response Envelope

All API responses use a stable envelope:

{
  "schema_version": "v1",
  "request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
  "correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
  "data": {},
  "error": null,
  "metadata": {
    "timestamp": "2026-04-05T12:00:00.000Z"
  }
}

Error responses use the same envelope with data: null.

Authentication

Two auth modes are currently implemented:

admin API requests use x-admin-token
node API requests use Authorization: Bearer <node-token>

Node tokens are stored in:

/data/auth/nodes/<node_id>.json

Idempotency

Idempotency is currently implemented for deployment apply requests through x-idempotency-key.

Persisted idempotency records live under:

/data/idempotency

The main implemented idempotent flow is:

POST /v1/deployments/:tenantId/apply

Resource Model

All managed resources are explicit JSON documents.

Each resource document contains:

id
resource_type
schema_version
desired_state
current_state
last_applied_state
metadata
created_at
updated_at

The three-state model is implemented and central:

desired_state: what Skipper wants
current_state: what Skippy or Skipper currently knows to be true
last_applied_state: what Skippy most recently attempted or enforced

Implemented Resource Types

The storage layer currently supports these resource types:

tenant
node
service
deployment
resource_limits
network
volume

At the moment, the repository ships example data for:

tenant
node
service

deployment resources are created dynamically when a deployment apply request is issued.

Tenant

Current tenant usage:

deployment target policy
service references
compose project specification

Example:

{
  "id": "example-tenant",
  "resource_type": "tenant",
  "schema_version": "v1",
  "desired_state": {
    "display_name": "Example Tenant",
    "deployment_policy": {
      "target_node_id": "host-1"
    },
    "service_ids": ["service-web"],
    "compose": {
      "tenant_id": "example-tenant",
      "compose_file": "services:\n  web:\n    image: nginx:alpine\n",
      "env": {
        "NGINX_PORT": "8081"
      }
    }
  },
  "current_state": {},
  "last_applied_state": {},
  "metadata": {}
}

Node

Current node usage:

desired enablement and labels
heartbeat status
agent capabilities
agent version

Service

Current service usage:

tenant ownership
service kind
image
network and volume references
resource limit reference

Deployment

Current deployment usage:

created during POST /v1/deployments/:tenantId/apply
tracks deployment status
tracks associated work order
stores deployment-oriented desired state

File-Based Persistence Layout

The current on-disk layout is:

/data
  /resources
    /tenants
    /nodes
    /services
    /deployments
    /resource-limits
    /networks
    /volumes
  /work-orders
    /pending
    /running
    /finished
  /events
    /YYYY-MM-DD
  /logs
    /YYYY-MM-DD
  /snapshots
    /system
    /tenants
  /idempotency
  /auth
    /nodes

Rules implemented today:

one JSON document per state file
atomic JSON writes
append-only event and log history
stable file names derived from resource or work-order IDs

Work Order Model

Skipper does not send direct commands. It issues declarative work orders.

Work Order Schema

The implemented work-order model is:

{
  "id": "4b9f5e2a-cf65-4342-97f5-66f3fe5a54f7",
  "resource_type": "work_order",
  "schema_version": "v1",
  "type": "deploy_service",
  "target": {
    "tenant_id": "example-tenant",
    "node_id": "host-1"
  },
  "desired_state": {
    "deployment_id": "deployment-123",
    "tenant_id": "example-tenant",
    "service_ids": ["service-web"],
    "compose_project": {}
  },
  "status": "pending",
  "result": null,
  "request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
  "correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
  "created_at": "2026-04-05T12:00:00.000Z",
  "started_at": null,
  "finished_at": null,
  "metadata": {}
}

Status Values

Implemented status values:

pending
running
success
failed

Implemented Work Order Type

The current code implements:

deploy_service

This currently reconciles a tenant compose project by:

writing docker-compose.yml
writing .env
running docker compose up -d
reporting structured output and state changes

Work Order Result Schema

Work-order results are structured JSON, not free-form text:

{
  "success": true,
  "code": "APPLY_OK",
  "message": "Desired state applied",
  "details": {
    "duration_ms": 8,
    "compose_path": "/opt/skipper/tenants/example-tenant/docker-compose.yml",
    "changed_resources": ["service-web"],
    "unchanged_resources": [],
    "command": {
      "program": "docker",
      "args": ["compose", "-f", "...", "up", "-d"],
      "exit_code": 0,
      "stdout": "...",
      "stderr": ""
    }
  }
}

This keeps results machine-readable while still preserving execution detail.

Reconciliation Flow

The currently implemented flow is:

Admin calls POST /v1/deployments/:tenantId/apply
Skipper loads the tenant resource
Skipper creates a deployment resource
Skipper creates a deploy_service work order targeted at the node in tenant desired state
Skippy sends heartbeat and polls GET /v1/nodes/:nodeId/work-orders/next
Skippy claims and executes the work order
Skippy reports POST /v1/work-orders/:workOrderId/result
Skipper finishes the work order, updates resource state, writes events, and keeps logs/snapshots available

Retry Safety

Current retry safety measures:

deployment apply is idempotent through x-idempotency-key
work-order execution is state-based and convergent at the compose level
work-order completion is safe against duplicate result submission
filesystem writes are atomic

Structured Logging

All implemented operational logs are structured JSON.

Each log entry includes:

timestamp
level
service
node_id
tenant_id
request_id
correlation_id
action
result
metadata

Logs are written to:

/data/logs/YYYY-MM-DD/<service>.ndjson

The logger also redacts common secret-shaped keys such as token, secret, password, and authorization.

Event System

Every important state transition in the current flow emits an event.

Implemented event storage:

/data/events/YYYY-MM-DD/<timestamp>-<event_id>.json

Implemented event types in the current code path:

resource_created
work_order_created
work_order_started
work_order_succeeded
work_order_failed
deployment_started
deployment_succeeded
deployment_failed
node_heartbeat_received
snapshot_created

The broader architecture still expects additional event coverage for all future resource mutations.

State Snapshots

Snapshots are implemented and persisted as JSON documents.

Supported snapshot scopes:

system
per-tenant

Implemented endpoints:

GET /v1/snapshots/system/latest
GET /v1/snapshots/tenants/:tenantId/latest

Each snapshot includes:

snapshot_id
scope
created_at
request_id
correlation_id
resources
diffs

Snapshot files are currently stored as:

/data/snapshots/system/latest.json
/data/snapshots/tenants/<tenant_id>.json

Observability and AI Readiness

The current implementation is AI-ready at the core workflow level because it now preserves:

request-level tracing across API and agent boundaries
structured work-order lifecycle data
historical logs
historical events
explicit desired/current/last-applied state
exportable JSON/NDJSON persistence

For the implemented deployment path, the system can answer:

what changed
which work order applied it
which node applied it
what desired state was targeted
what current and last applied state were recorded

Error Handling

All API errors are structured and envelope-wrapped.

Implemented error shape:

{
  "code": "RESOURCE_NOT_FOUND",
  "message": "Tenant not found",
  "details": {
    "resource_type": "tenant",
    "resource_id": "example-tenant"
  }
}

Implemented machine-readable error codes include:

INVALID_REQUEST
UNAUTHORIZED
RESOURCE_NOT_FOUND
WORK_ORDER_NOT_CLAIMABLE
INTERNAL_ERROR

Raw stack traces are not returned in API responses.

Security Model

Implemented

node token authentication
admin token authentication
correlation-aware structured logging
redaction of common secret-shaped log fields

Not Yet Implemented

role-based authorization
secret rotation workflows
mTLS
per-resource authorization policies

Extensibility Model

The code is currently structured so new resource types and work-order types can be added without replacing the whole control flow.

Current extensibility anchors:

resource storage by resource_type
work-order execution by type
stable response envelope
versioned schemas
shared storage and telemetry modules in /shared

Implemented Internal Modules

Shared

Skipper API

skipper-api/src/index.js

Skippy Agent

Current Gaps

The code is now aligned with the architecture for the core deployment path, but it is not feature-complete across the full long-term vision.

Not yet implemented:

full CRUD APIs for all resource types
generic reconciliation across all future services
resource_updated and desired_state_changed event coverage for every mutation path
persisted state reports for all future resource kinds
richer diffing beyond snapshot-level desired/current comparisons
RBAC and richer authorization
production HTTPS termination inside the app itself
additional work-order types such as restart, migrate, nginx management, mysql provisioning, and systemd integration

Current Compliance Summary

Implemented and aligned:

/v1 API contract
request and correlation ID propagation
envelope-based responses
structured errors
declarative work orders
three-state resource model
structured JSON logging
event persistence
snapshot persistence
idempotent deployment apply
token-based controller/agent auth

Still incomplete relative to the full target:

broader resource coverage
broader reconciliation coverage
broader auth model
full event coverage for every possible state mutation

15 KiB Raw Permalink Blame History