15 KiB
Skipper Architecture
Purpose
Skipper is a lightweight hosting orchestration system built for clarity, inspectability, and future AI-assisted operations.
Skipper= control planeSkippy= host agent- Communication model = HTTPS pull from agent to controller
- Implementation language = Node.js only
- Persistence = file-based JSON storage only
- Deployment target = Docker for all components
This document now reflects the current implementation in this repository, while also marking the remaining gaps to the broader target architecture.
Design Principles
- No hidden state
- No implicit resource relationships
- No shell-script driven control flow as the primary orchestration model
- All operations must be idempotent and safe to retry
- All state transitions must be inspectable via API or persisted state
- Logs, events, and state must be sufficient for debugging
- Schemas must be explicit and versioned
- Extensibility is preferred over short-term convenience
- Observability is a first-class requirement
System Topology
Skipper
Skipper is the controller and source of truth for desired state.
Current responsibilities:
- Store resource definitions on disk
- Store
desired_state,current_state, andlast_applied_state - Create declarative work orders targeted at nodes
- Accept structured work-order results and state reports
- Persist structured logs, events, idempotency records, and snapshots
- Expose a versioned REST API under
/v1
Skippy
Skippy is a node-local reconciliation agent.
Current responsibilities:
- Authenticate to Skipper with a node token
- Poll for work orders over HTTPS
- Apply desired state locally through Node.js modules
- Report structured results
- Report updated resource state
- Emit structured JSON logs locally while also driving persisted state changes through the API
Communication Model
Communication is LAN-first and HTTPS-oriented.
- Agents initiate control-plane communication
- Skipper does not require inbound connectivity to managed nodes
- All implemented API endpoints are versioned under
/v1 - All implemented requests and responses use JSON
- Authentication is token-based
- Request tracing is propagated with
request_idandcorrelation_id
The current local development stack uses plain HTTP inside Docker and during smoke tests. The architecture remains HTTPS-first for production deployment.
API Contract
Versioning
All implemented control-plane endpoints live under /v1.
Implemented endpoints:
GET /v1/healthGET /v1/resourcesGET /v1/resources/:resourceType/:resourceIdGET /v1/work-ordersGET /v1/work-orders/:workOrderIdGET /v1/nodes/:nodeId/work-orders/nextPOST /v1/nodes/:nodeId/heartbeatPOST /v1/work-orders/:workOrderId/resultPOST /v1/deployments/:tenantId/applyGET /v1/snapshots/system/latestGET /v1/snapshots/tenants/:tenantId/latest
A compatibility GET /health endpoint also exists for simple health checks.
Request Metadata
Every request is processed with:
request_idcorrelation_id
These are accepted from:
x-request-idx-correlation-id
If absent, Skipper generates them and returns them in both response headers and the response body envelope.
Response Envelope
All API responses use a stable envelope:
{
"schema_version": "v1",
"request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
"correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
"data": {},
"error": null,
"metadata": {
"timestamp": "2026-04-05T12:00:00.000Z"
}
}
Error responses use the same envelope with data: null.
Authentication
Two auth modes are currently implemented:
- admin API requests use
x-admin-token - node API requests use
Authorization: Bearer <node-token>
Node tokens are stored in:
/data/auth/nodes/<node_id>.json
Idempotency
Idempotency is currently implemented for deployment apply requests through x-idempotency-key.
Persisted idempotency records live under:
/data/idempotency
The main implemented idempotent flow is:
POST /v1/deployments/:tenantId/apply
Resource Model
All managed resources are explicit JSON documents.
Each resource document contains:
idresource_typeschema_versiondesired_statecurrent_statelast_applied_statemetadatacreated_atupdated_at
The three-state model is implemented and central:
desired_state: what Skipper wantscurrent_state: what Skippy or Skipper currently knows to be truelast_applied_state: what Skippy most recently attempted or enforced
Implemented Resource Types
The storage layer currently supports these resource types:
tenantnodeservicedeploymentresource_limitsnetworkvolume
At the moment, the repository ships example data for:
tenantnodeservice
deployment resources are created dynamically when a deployment apply request is issued.
Tenant
Current tenant usage:
- deployment target policy
- service references
- compose project specification
Example:
{
"id": "example-tenant",
"resource_type": "tenant",
"schema_version": "v1",
"desired_state": {
"display_name": "Example Tenant",
"deployment_policy": {
"target_node_id": "host-1"
},
"service_ids": ["service-web"],
"compose": {
"tenant_id": "example-tenant",
"compose_file": "services:\n web:\n image: nginx:alpine\n",
"env": {
"NGINX_PORT": "8081"
}
}
},
"current_state": {},
"last_applied_state": {},
"metadata": {}
}
Node
Current node usage:
- desired enablement and labels
- heartbeat status
- agent capabilities
- agent version
Service
Current service usage:
- tenant ownership
- service kind
- image
- network and volume references
- resource limit reference
Deployment
Current deployment usage:
- created during
POST /v1/deployments/:tenantId/apply - tracks deployment status
- tracks associated work order
- stores deployment-oriented desired state
File-Based Persistence Layout
The current on-disk layout is:
/data
/resources
/tenants
/nodes
/services
/deployments
/resource-limits
/networks
/volumes
/work-orders
/pending
/running
/finished
/events
/YYYY-MM-DD
/logs
/YYYY-MM-DD
/snapshots
/system
/tenants
/idempotency
/auth
/nodes
Rules implemented today:
- one JSON document per state file
- atomic JSON writes
- append-only event and log history
- stable file names derived from resource or work-order IDs
Work Order Model
Skipper does not send direct commands. It issues declarative work orders.
Work Order Schema
The implemented work-order model is:
{
"id": "4b9f5e2a-cf65-4342-97f5-66f3fe5a54f7",
"resource_type": "work_order",
"schema_version": "v1",
"type": "deploy_service",
"target": {
"tenant_id": "example-tenant",
"node_id": "host-1"
},
"desired_state": {
"deployment_id": "deployment-123",
"tenant_id": "example-tenant",
"service_ids": ["service-web"],
"compose_project": {}
},
"status": "pending",
"result": null,
"request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
"correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
"created_at": "2026-04-05T12:00:00.000Z",
"started_at": null,
"finished_at": null,
"metadata": {}
}
Status Values
Implemented status values:
pendingrunningsuccessfailed
Implemented Work Order Type
The current code implements:
deploy_service
This currently reconciles a tenant compose project by:
- writing
docker-compose.yml - writing
.env - running
docker compose up -d - reporting structured output and state changes
Work Order Result Schema
Work-order results are structured JSON, not free-form text:
{
"success": true,
"code": "APPLY_OK",
"message": "Desired state applied",
"details": {
"duration_ms": 8,
"compose_path": "/opt/skipper/tenants/example-tenant/docker-compose.yml",
"changed_resources": ["service-web"],
"unchanged_resources": [],
"command": {
"program": "docker",
"args": ["compose", "-f", "...", "up", "-d"],
"exit_code": 0,
"stdout": "...",
"stderr": ""
}
}
}
This keeps results machine-readable while still preserving execution detail.
Reconciliation Flow
The currently implemented flow is:
- Admin calls
POST /v1/deployments/:tenantId/apply - Skipper loads the tenant resource
- Skipper creates a deployment resource
- Skipper creates a
deploy_servicework order targeted at the node in tenant desired state - Skippy sends heartbeat and polls
GET /v1/nodes/:nodeId/work-orders/next - Skippy claims and executes the work order
- Skippy reports
POST /v1/work-orders/:workOrderId/result - Skipper finishes the work order, updates resource state, writes events, and keeps logs/snapshots available
Retry Safety
Current retry safety measures:
- deployment apply is idempotent through
x-idempotency-key - work-order execution is state-based and convergent at the compose level
- work-order completion is safe against duplicate result submission
- filesystem writes are atomic
Structured Logging
All implemented operational logs are structured JSON.
Each log entry includes:
timestamplevelservicenode_idtenant_idrequest_idcorrelation_idactionresultmetadata
Logs are written to:
/data/logs/YYYY-MM-DD/<service>.ndjson
The logger also redacts common secret-shaped keys such as token, secret, password, and authorization.
Event System
Every important state transition in the current flow emits an event.
Implemented event storage:
/data/events/YYYY-MM-DD/<timestamp>-<event_id>.json
Implemented event types in the current code path:
resource_createdwork_order_createdwork_order_startedwork_order_succeededwork_order_faileddeployment_starteddeployment_succeededdeployment_failednode_heartbeat_receivedsnapshot_created
The broader architecture still expects additional event coverage for all future resource mutations.
State Snapshots
Snapshots are implemented and persisted as JSON documents.
Supported snapshot scopes:
- system
- per-tenant
Implemented endpoints:
GET /v1/snapshots/system/latestGET /v1/snapshots/tenants/:tenantId/latest
Each snapshot includes:
snapshot_idscopecreated_atrequest_idcorrelation_idresourcesdiffs
Snapshot files are currently stored as:
/data/snapshots/system/latest.json/data/snapshots/tenants/<tenant_id>.json
Observability and AI Readiness
The current implementation is AI-ready at the core workflow level because it now preserves:
- request-level tracing across API and agent boundaries
- structured work-order lifecycle data
- historical logs
- historical events
- explicit desired/current/last-applied state
- exportable JSON/NDJSON persistence
For the implemented deployment path, the system can answer:
- what changed
- which work order applied it
- which node applied it
- what desired state was targeted
- what current and last applied state were recorded
Error Handling
All API errors are structured and envelope-wrapped.
Implemented error shape:
{
"code": "RESOURCE_NOT_FOUND",
"message": "Tenant not found",
"details": {
"resource_type": "tenant",
"resource_id": "example-tenant"
}
}
Implemented machine-readable error codes include:
INVALID_REQUESTUNAUTHORIZEDRESOURCE_NOT_FOUNDWORK_ORDER_NOT_CLAIMABLEINTERNAL_ERROR
Raw stack traces are not returned in API responses.
Security Model
Implemented
- node token authentication
- admin token authentication
- correlation-aware structured logging
- redaction of common secret-shaped log fields
Not Yet Implemented
- role-based authorization
- secret rotation workflows
- mTLS
- per-resource authorization policies
Extensibility Model
The code is currently structured so new resource types and work-order types can be added without replacing the whole control flow.
Current extensibility anchors:
- resource storage by
resource_type - work-order execution by
type - stable response envelope
- versioned schemas
- shared storage and telemetry modules in
/shared
Implemented Internal Modules
Shared
shared/context.jsshared/errors.jsshared/auth.jsshared/resources.jsshared/work-orders.jsshared/logs.jsshared/events.jsshared/idempotency.jsshared/snapshots.js
Skipper API
Skippy Agent
Current Gaps
The code is now aligned with the architecture for the core deployment path, but it is not feature-complete across the full long-term vision.
Not yet implemented:
- full CRUD APIs for all resource types
- generic reconciliation across all future services
resource_updatedanddesired_state_changedevent coverage for every mutation path- persisted state reports for all future resource kinds
- richer diffing beyond snapshot-level desired/current comparisons
- RBAC and richer authorization
- production HTTPS termination inside the app itself
- additional work-order types such as restart, migrate, nginx management, mysql provisioning, and systemd integration
Current Compliance Summary
Implemented and aligned:
/v1API contract- request and correlation ID propagation
- envelope-based responses
- structured errors
- declarative work orders
- three-state resource model
- structured JSON logging
- event persistence
- snapshot persistence
- idempotent deployment apply
- token-based controller/agent auth
Still incomplete relative to the full target:
- broader resource coverage
- broader reconciliation coverage
- broader auth model
- full event coverage for every possible state mutation