606 lines
15 KiB
Markdown
606 lines
15 KiB
Markdown
# Skipper Architecture
|
|
|
|
## Purpose
|
|
|
|
Skipper is a lightweight hosting orchestration system built for clarity, inspectability, and future AI-assisted operations.
|
|
|
|
- `Skipper` = control plane
|
|
- `Skippy` = host agent
|
|
- Communication model = HTTPS pull from agent to controller
|
|
- Implementation language = Node.js only
|
|
- Persistence = file-based JSON storage only
|
|
- Deployment target = Docker for all components
|
|
|
|
This document now reflects the current implementation in this repository, while also marking the remaining gaps to the broader target architecture.
|
|
|
|
## Design Principles
|
|
|
|
- No hidden state
|
|
- No implicit resource relationships
|
|
- No shell-script driven control flow as the primary orchestration model
|
|
- All operations must be idempotent and safe to retry
|
|
- All state transitions must be inspectable via API or persisted state
|
|
- Logs, events, and state must be sufficient for debugging
|
|
- Schemas must be explicit and versioned
|
|
- Extensibility is preferred over short-term convenience
|
|
- Observability is a first-class requirement
|
|
|
|
## System Topology
|
|
|
|
### Skipper
|
|
|
|
Skipper is the controller and source of truth for desired state.
|
|
|
|
Current responsibilities:
|
|
|
|
- Store resource definitions on disk
|
|
- Store `desired_state`, `current_state`, and `last_applied_state`
|
|
- Create declarative work orders targeted at nodes
|
|
- Accept structured work-order results and state reports
|
|
- Persist structured logs, events, idempotency records, and snapshots
|
|
- Expose a versioned REST API under `/v1`
|
|
|
|
### Skippy
|
|
|
|
Skippy is a node-local reconciliation agent.
|
|
|
|
Current responsibilities:
|
|
|
|
- Authenticate to Skipper with a node token
|
|
- Poll for work orders over HTTPS
|
|
- Apply desired state locally through Node.js modules
|
|
- Report structured results
|
|
- Report updated resource state
|
|
- Emit structured JSON logs locally while also driving persisted state changes through the API
|
|
|
|
## Communication Model
|
|
|
|
Communication is LAN-first and HTTPS-oriented.
|
|
|
|
- Agents initiate control-plane communication
|
|
- Skipper does not require inbound connectivity to managed nodes
|
|
- All implemented API endpoints are versioned under `/v1`
|
|
- All implemented requests and responses use JSON
|
|
- Authentication is token-based
|
|
- Request tracing is propagated with `request_id` and `correlation_id`
|
|
|
|
The current local development stack uses plain HTTP inside Docker and during smoke tests. The architecture remains HTTPS-first for production deployment.
|
|
|
|
## API Contract
|
|
|
|
### Versioning
|
|
|
|
All implemented control-plane endpoints live under `/v1`.
|
|
|
|
Implemented endpoints:
|
|
|
|
- `GET /v1/health`
|
|
- `GET /v1/resources`
|
|
- `GET /v1/resources/:resourceType/:resourceId`
|
|
- `GET /v1/work-orders`
|
|
- `GET /v1/work-orders/:workOrderId`
|
|
- `GET /v1/nodes/:nodeId/work-orders/next`
|
|
- `POST /v1/nodes/:nodeId/heartbeat`
|
|
- `POST /v1/work-orders/:workOrderId/result`
|
|
- `POST /v1/deployments/:tenantId/apply`
|
|
- `GET /v1/snapshots/system/latest`
|
|
- `GET /v1/snapshots/tenants/:tenantId/latest`
|
|
|
|
A compatibility `GET /health` endpoint also exists for simple health checks.
|
|
|
|
### Request Metadata
|
|
|
|
Every request is processed with:
|
|
|
|
- `request_id`
|
|
- `correlation_id`
|
|
|
|
These are accepted from:
|
|
|
|
- `x-request-id`
|
|
- `x-correlation-id`
|
|
|
|
If absent, Skipper generates them and returns them in both response headers and the response body envelope.
|
|
|
|
### Response Envelope
|
|
|
|
All API responses use a stable envelope:
|
|
|
|
```json
|
|
{
|
|
"schema_version": "v1",
|
|
"request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
|
|
"correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
|
|
"data": {},
|
|
"error": null,
|
|
"metadata": {
|
|
"timestamp": "2026-04-05T12:00:00.000Z"
|
|
}
|
|
}
|
|
```
|
|
|
|
Error responses use the same envelope with `data: null`.
|
|
|
|
### Authentication
|
|
|
|
Two auth modes are currently implemented:
|
|
|
|
- admin API requests use `x-admin-token`
|
|
- node API requests use `Authorization: Bearer <node-token>`
|
|
|
|
Node tokens are stored in:
|
|
|
|
- `/data/auth/nodes/<node_id>.json`
|
|
|
|
### Idempotency
|
|
|
|
Idempotency is currently implemented for deployment apply requests through `x-idempotency-key`.
|
|
|
|
Persisted idempotency records live under:
|
|
|
|
- `/data/idempotency`
|
|
|
|
The main implemented idempotent flow is:
|
|
|
|
- `POST /v1/deployments/:tenantId/apply`
|
|
|
|
## Resource Model
|
|
|
|
All managed resources are explicit JSON documents.
|
|
|
|
Each resource document contains:
|
|
|
|
- `id`
|
|
- `resource_type`
|
|
- `schema_version`
|
|
- `desired_state`
|
|
- `current_state`
|
|
- `last_applied_state`
|
|
- `metadata`
|
|
- `created_at`
|
|
- `updated_at`
|
|
|
|
The three-state model is implemented and central:
|
|
|
|
- `desired_state`: what Skipper wants
|
|
- `current_state`: what Skippy or Skipper currently knows to be true
|
|
- `last_applied_state`: what Skippy most recently attempted or enforced
|
|
|
|
### Implemented Resource Types
|
|
|
|
The storage layer currently supports these resource types:
|
|
|
|
- `tenant`
|
|
- `node`
|
|
- `service`
|
|
- `deployment`
|
|
- `resource_limits`
|
|
- `network`
|
|
- `volume`
|
|
|
|
At the moment, the repository ships example data for:
|
|
|
|
- `tenant`
|
|
- `node`
|
|
- `service`
|
|
|
|
`deployment` resources are created dynamically when a deployment apply request is issued.
|
|
|
|
### Tenant
|
|
|
|
Current tenant usage:
|
|
|
|
- deployment target policy
|
|
- service references
|
|
- compose project specification
|
|
|
|
Example:
|
|
|
|
```json
|
|
{
|
|
"id": "example-tenant",
|
|
"resource_type": "tenant",
|
|
"schema_version": "v1",
|
|
"desired_state": {
|
|
"display_name": "Example Tenant",
|
|
"deployment_policy": {
|
|
"target_node_id": "host-1"
|
|
},
|
|
"service_ids": ["service-web"],
|
|
"compose": {
|
|
"tenant_id": "example-tenant",
|
|
"compose_file": "services:\n web:\n image: nginx:alpine\n",
|
|
"env": {
|
|
"NGINX_PORT": "8081"
|
|
}
|
|
}
|
|
},
|
|
"current_state": {},
|
|
"last_applied_state": {},
|
|
"metadata": {}
|
|
}
|
|
```
|
|
|
|
### Node
|
|
|
|
Current node usage:
|
|
|
|
- desired enablement and labels
|
|
- heartbeat status
|
|
- agent capabilities
|
|
- agent version
|
|
|
|
### Service
|
|
|
|
Current service usage:
|
|
|
|
- tenant ownership
|
|
- service kind
|
|
- image
|
|
- network and volume references
|
|
- resource limit reference
|
|
|
|
### Deployment
|
|
|
|
Current deployment usage:
|
|
|
|
- created during `POST /v1/deployments/:tenantId/apply`
|
|
- tracks deployment status
|
|
- tracks associated work order
|
|
- stores deployment-oriented desired state
|
|
|
|
## File-Based Persistence Layout
|
|
|
|
The current on-disk layout is:
|
|
|
|
```text
|
|
/data
|
|
/resources
|
|
/tenants
|
|
/nodes
|
|
/services
|
|
/deployments
|
|
/resource-limits
|
|
/networks
|
|
/volumes
|
|
/work-orders
|
|
/pending
|
|
/running
|
|
/finished
|
|
/events
|
|
/YYYY-MM-DD
|
|
/logs
|
|
/YYYY-MM-DD
|
|
/snapshots
|
|
/system
|
|
/tenants
|
|
/idempotency
|
|
/auth
|
|
/nodes
|
|
```
|
|
|
|
Rules implemented today:
|
|
|
|
- one JSON document per state file
|
|
- atomic JSON writes
|
|
- append-only event and log history
|
|
- stable file names derived from resource or work-order IDs
|
|
|
|
## Work Order Model
|
|
|
|
Skipper does not send direct commands. It issues declarative work orders.
|
|
|
|
### Work Order Schema
|
|
|
|
The implemented work-order model is:
|
|
|
|
```json
|
|
{
|
|
"id": "4b9f5e2a-cf65-4342-97f5-66f3fe5a54f7",
|
|
"resource_type": "work_order",
|
|
"schema_version": "v1",
|
|
"type": "deploy_service",
|
|
"target": {
|
|
"tenant_id": "example-tenant",
|
|
"node_id": "host-1"
|
|
},
|
|
"desired_state": {
|
|
"deployment_id": "deployment-123",
|
|
"tenant_id": "example-tenant",
|
|
"service_ids": ["service-web"],
|
|
"compose_project": {}
|
|
},
|
|
"status": "pending",
|
|
"result": null,
|
|
"request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
|
|
"correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
|
|
"created_at": "2026-04-05T12:00:00.000Z",
|
|
"started_at": null,
|
|
"finished_at": null,
|
|
"metadata": {}
|
|
}
|
|
```
|
|
|
|
### Status Values
|
|
|
|
Implemented status values:
|
|
|
|
- `pending`
|
|
- `running`
|
|
- `success`
|
|
- `failed`
|
|
|
|
### Implemented Work Order Type
|
|
|
|
The current code implements:
|
|
|
|
- `deploy_service`
|
|
|
|
This currently reconciles a tenant compose project by:
|
|
|
|
1. writing `docker-compose.yml`
|
|
2. writing `.env`
|
|
3. running `docker compose up -d`
|
|
4. reporting structured output and state changes
|
|
|
|
### Work Order Result Schema
|
|
|
|
Work-order results are structured JSON, not free-form text:
|
|
|
|
```json
|
|
{
|
|
"success": true,
|
|
"code": "APPLY_OK",
|
|
"message": "Desired state applied",
|
|
"details": {
|
|
"duration_ms": 8,
|
|
"compose_path": "/opt/skipper/tenants/example-tenant/docker-compose.yml",
|
|
"changed_resources": ["service-web"],
|
|
"unchanged_resources": [],
|
|
"command": {
|
|
"program": "docker",
|
|
"args": ["compose", "-f", "...", "up", "-d"],
|
|
"exit_code": 0,
|
|
"stdout": "...",
|
|
"stderr": ""
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
This keeps results machine-readable while still preserving execution detail.
|
|
|
|
## Reconciliation Flow
|
|
|
|
The currently implemented flow is:
|
|
|
|
1. Admin calls `POST /v1/deployments/:tenantId/apply`
|
|
2. Skipper loads the tenant resource
|
|
3. Skipper creates a deployment resource
|
|
4. Skipper creates a `deploy_service` work order targeted at the node in tenant desired state
|
|
5. Skippy sends heartbeat and polls `GET /v1/nodes/:nodeId/work-orders/next`
|
|
6. Skippy claims and executes the work order
|
|
7. Skippy reports `POST /v1/work-orders/:workOrderId/result`
|
|
8. Skipper finishes the work order, updates resource state, writes events, and keeps logs/snapshots available
|
|
|
|
### Retry Safety
|
|
|
|
Current retry safety measures:
|
|
|
|
- deployment apply is idempotent through `x-idempotency-key`
|
|
- work-order execution is state-based and convergent at the compose level
|
|
- work-order completion is safe against duplicate result submission
|
|
- filesystem writes are atomic
|
|
|
|
## Structured Logging
|
|
|
|
All implemented operational logs are structured JSON.
|
|
|
|
Each log entry includes:
|
|
|
|
- `timestamp`
|
|
- `level`
|
|
- `service`
|
|
- `node_id`
|
|
- `tenant_id`
|
|
- `request_id`
|
|
- `correlation_id`
|
|
- `action`
|
|
- `result`
|
|
- `metadata`
|
|
|
|
Logs are written to:
|
|
|
|
- `/data/logs/YYYY-MM-DD/<service>.ndjson`
|
|
|
|
The logger also redacts common secret-shaped keys such as `token`, `secret`, `password`, and `authorization`.
|
|
|
|
## Event System
|
|
|
|
Every important state transition in the current flow emits an event.
|
|
|
|
Implemented event storage:
|
|
|
|
- `/data/events/YYYY-MM-DD/<timestamp>-<event_id>.json`
|
|
|
|
Implemented event types in the current code path:
|
|
|
|
- `resource_created`
|
|
- `work_order_created`
|
|
- `work_order_started`
|
|
- `work_order_succeeded`
|
|
- `work_order_failed`
|
|
- `deployment_started`
|
|
- `deployment_succeeded`
|
|
- `deployment_failed`
|
|
- `node_heartbeat_received`
|
|
- `snapshot_created`
|
|
|
|
The broader architecture still expects additional event coverage for all future resource mutations.
|
|
|
|
## State Snapshots
|
|
|
|
Snapshots are implemented and persisted as JSON documents.
|
|
|
|
Supported snapshot scopes:
|
|
|
|
- system
|
|
- per-tenant
|
|
|
|
Implemented endpoints:
|
|
|
|
- `GET /v1/snapshots/system/latest`
|
|
- `GET /v1/snapshots/tenants/:tenantId/latest`
|
|
|
|
Each snapshot includes:
|
|
|
|
- `snapshot_id`
|
|
- `scope`
|
|
- `created_at`
|
|
- `request_id`
|
|
- `correlation_id`
|
|
- `resources`
|
|
- `diffs`
|
|
|
|
Snapshot files are currently stored as:
|
|
|
|
- `/data/snapshots/system/latest.json`
|
|
- `/data/snapshots/tenants/<tenant_id>.json`
|
|
|
|
## Observability and AI Readiness
|
|
|
|
The current implementation is AI-ready at the core workflow level because it now preserves:
|
|
|
|
- request-level tracing across API and agent boundaries
|
|
- structured work-order lifecycle data
|
|
- historical logs
|
|
- historical events
|
|
- explicit desired/current/last-applied state
|
|
- exportable JSON/NDJSON persistence
|
|
|
|
For the implemented deployment path, the system can answer:
|
|
|
|
- what changed
|
|
- which work order applied it
|
|
- which node applied it
|
|
- what desired state was targeted
|
|
- what current and last applied state were recorded
|
|
|
|
## Error Handling
|
|
|
|
All API errors are structured and envelope-wrapped.
|
|
|
|
Implemented error shape:
|
|
|
|
```json
|
|
{
|
|
"code": "RESOURCE_NOT_FOUND",
|
|
"message": "Tenant not found",
|
|
"details": {
|
|
"resource_type": "tenant",
|
|
"resource_id": "example-tenant"
|
|
}
|
|
}
|
|
```
|
|
|
|
Implemented machine-readable error codes include:
|
|
|
|
- `INVALID_REQUEST`
|
|
- `UNAUTHORIZED`
|
|
- `RESOURCE_NOT_FOUND`
|
|
- `WORK_ORDER_NOT_CLAIMABLE`
|
|
- `INTERNAL_ERROR`
|
|
|
|
Raw stack traces are not returned in API responses.
|
|
|
|
## Security Model
|
|
|
|
### Implemented
|
|
|
|
- node token authentication
|
|
- admin token authentication
|
|
- correlation-aware structured logging
|
|
- redaction of common secret-shaped log fields
|
|
|
|
### Not Yet Implemented
|
|
|
|
- role-based authorization
|
|
- secret rotation workflows
|
|
- mTLS
|
|
- per-resource authorization policies
|
|
|
|
## Extensibility Model
|
|
|
|
The code is currently structured so new resource types and work-order types can be added without replacing the whole control flow.
|
|
|
|
Current extensibility anchors:
|
|
|
|
- resource storage by `resource_type`
|
|
- work-order execution by `type`
|
|
- stable response envelope
|
|
- versioned schemas
|
|
- shared storage and telemetry modules in `/shared`
|
|
|
|
## Implemented Internal Modules
|
|
|
|
### Shared
|
|
|
|
- [`shared/context.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/context.js)
|
|
- [`shared/errors.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/errors.js)
|
|
- [`shared/auth.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/auth.js)
|
|
- [`shared/resources.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/resources.js)
|
|
- [`shared/work-orders.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/work-orders.js)
|
|
- [`shared/logs.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/logs.js)
|
|
- [`shared/events.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/events.js)
|
|
- [`shared/idempotency.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/idempotency.js)
|
|
- [`shared/snapshots.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/snapshots.js)
|
|
|
|
### Skipper API
|
|
|
|
- [`skipper-api/src/index.js`](/home/sundown/Projekter/nodeJS/Skipper/skipper-api/src/index.js)
|
|
|
|
### Skippy Agent
|
|
|
|
- [`skippy-agent/src/index.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/index.js)
|
|
- [`skippy-agent/src/lib/http.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/lib/http.js)
|
|
- [`skippy-agent/src/modules/docker.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/modules/docker.js)
|
|
|
|
## Current Gaps
|
|
|
|
The code is now aligned with the architecture for the core deployment path, but it is not feature-complete across the full long-term vision.
|
|
|
|
Not yet implemented:
|
|
|
|
- full CRUD APIs for all resource types
|
|
- generic reconciliation across all future services
|
|
- `resource_updated` and `desired_state_changed` event coverage for every mutation path
|
|
- persisted state reports for all future resource kinds
|
|
- richer diffing beyond snapshot-level desired/current comparisons
|
|
- RBAC and richer authorization
|
|
- production HTTPS termination inside the app itself
|
|
- additional work-order types such as restart, migrate, nginx management, mysql provisioning, and systemd integration
|
|
|
|
## Current Compliance Summary
|
|
|
|
Implemented and aligned:
|
|
|
|
- `/v1` API contract
|
|
- request and correlation ID propagation
|
|
- envelope-based responses
|
|
- structured errors
|
|
- declarative work orders
|
|
- three-state resource model
|
|
- structured JSON logging
|
|
- event persistence
|
|
- snapshot persistence
|
|
- idempotent deployment apply
|
|
- token-based controller/agent auth
|
|
|
|
Still incomplete relative to the full target:
|
|
|
|
- broader resource coverage
|
|
- broader reconciliation coverage
|
|
- broader auth model
|
|
- full event coverage for every possible state mutation
|
|
|