Files
skipper/docs/architecture.md
2026-04-05 15:28:04 +02:00

606 lines
15 KiB
Markdown

# Skipper Architecture
## Purpose
Skipper is a lightweight hosting orchestration system built for clarity, inspectability, and future AI-assisted operations.
- `Skipper` = control plane
- `Skippy` = host agent
- Communication model = HTTPS pull from agent to controller
- Implementation language = Node.js only
- Persistence = file-based JSON storage only
- Deployment target = Docker for all components
This document now reflects the current implementation in this repository, while also marking the remaining gaps to the broader target architecture.
## Design Principles
- No hidden state
- No implicit resource relationships
- No shell-script driven control flow as the primary orchestration model
- All operations must be idempotent and safe to retry
- All state transitions must be inspectable via API or persisted state
- Logs, events, and state must be sufficient for debugging
- Schemas must be explicit and versioned
- Extensibility is preferred over short-term convenience
- Observability is a first-class requirement
## System Topology
### Skipper
Skipper is the controller and source of truth for desired state.
Current responsibilities:
- Store resource definitions on disk
- Store `desired_state`, `current_state`, and `last_applied_state`
- Create declarative work orders targeted at nodes
- Accept structured work-order results and state reports
- Persist structured logs, events, idempotency records, and snapshots
- Expose a versioned REST API under `/v1`
### Skippy
Skippy is a node-local reconciliation agent.
Current responsibilities:
- Authenticate to Skipper with a node token
- Poll for work orders over HTTPS
- Apply desired state locally through Node.js modules
- Report structured results
- Report updated resource state
- Emit structured JSON logs locally while also driving persisted state changes through the API
## Communication Model
Communication is LAN-first and HTTPS-oriented.
- Agents initiate control-plane communication
- Skipper does not require inbound connectivity to managed nodes
- All implemented API endpoints are versioned under `/v1`
- All implemented requests and responses use JSON
- Authentication is token-based
- Request tracing is propagated with `request_id` and `correlation_id`
The current local development stack uses plain HTTP inside Docker and during smoke tests. The architecture remains HTTPS-first for production deployment.
## API Contract
### Versioning
All implemented control-plane endpoints live under `/v1`.
Implemented endpoints:
- `GET /v1/health`
- `GET /v1/resources`
- `GET /v1/resources/:resourceType/:resourceId`
- `GET /v1/work-orders`
- `GET /v1/work-orders/:workOrderId`
- `GET /v1/nodes/:nodeId/work-orders/next`
- `POST /v1/nodes/:nodeId/heartbeat`
- `POST /v1/work-orders/:workOrderId/result`
- `POST /v1/deployments/:tenantId/apply`
- `GET /v1/snapshots/system/latest`
- `GET /v1/snapshots/tenants/:tenantId/latest`
A compatibility `GET /health` endpoint also exists for simple health checks.
### Request Metadata
Every request is processed with:
- `request_id`
- `correlation_id`
These are accepted from:
- `x-request-id`
- `x-correlation-id`
If absent, Skipper generates them and returns them in both response headers and the response body envelope.
### Response Envelope
All API responses use a stable envelope:
```json
{
"schema_version": "v1",
"request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
"correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
"data": {},
"error": null,
"metadata": {
"timestamp": "2026-04-05T12:00:00.000Z"
}
}
```
Error responses use the same envelope with `data: null`.
### Authentication
Two auth modes are currently implemented:
- admin API requests use `x-admin-token`
- node API requests use `Authorization: Bearer <node-token>`
Node tokens are stored in:
- `/data/auth/nodes/<node_id>.json`
### Idempotency
Idempotency is currently implemented for deployment apply requests through `x-idempotency-key`.
Persisted idempotency records live under:
- `/data/idempotency`
The main implemented idempotent flow is:
- `POST /v1/deployments/:tenantId/apply`
## Resource Model
All managed resources are explicit JSON documents.
Each resource document contains:
- `id`
- `resource_type`
- `schema_version`
- `desired_state`
- `current_state`
- `last_applied_state`
- `metadata`
- `created_at`
- `updated_at`
The three-state model is implemented and central:
- `desired_state`: what Skipper wants
- `current_state`: what Skippy or Skipper currently knows to be true
- `last_applied_state`: what Skippy most recently attempted or enforced
### Implemented Resource Types
The storage layer currently supports these resource types:
- `tenant`
- `node`
- `service`
- `deployment`
- `resource_limits`
- `network`
- `volume`
At the moment, the repository ships example data for:
- `tenant`
- `node`
- `service`
`deployment` resources are created dynamically when a deployment apply request is issued.
### Tenant
Current tenant usage:
- deployment target policy
- service references
- compose project specification
Example:
```json
{
"id": "example-tenant",
"resource_type": "tenant",
"schema_version": "v1",
"desired_state": {
"display_name": "Example Tenant",
"deployment_policy": {
"target_node_id": "host-1"
},
"service_ids": ["service-web"],
"compose": {
"tenant_id": "example-tenant",
"compose_file": "services:\n web:\n image: nginx:alpine\n",
"env": {
"NGINX_PORT": "8081"
}
}
},
"current_state": {},
"last_applied_state": {},
"metadata": {}
}
```
### Node
Current node usage:
- desired enablement and labels
- heartbeat status
- agent capabilities
- agent version
### Service
Current service usage:
- tenant ownership
- service kind
- image
- network and volume references
- resource limit reference
### Deployment
Current deployment usage:
- created during `POST /v1/deployments/:tenantId/apply`
- tracks deployment status
- tracks associated work order
- stores deployment-oriented desired state
## File-Based Persistence Layout
The current on-disk layout is:
```text
/data
/resources
/tenants
/nodes
/services
/deployments
/resource-limits
/networks
/volumes
/work-orders
/pending
/running
/finished
/events
/YYYY-MM-DD
/logs
/YYYY-MM-DD
/snapshots
/system
/tenants
/idempotency
/auth
/nodes
```
Rules implemented today:
- one JSON document per state file
- atomic JSON writes
- append-only event and log history
- stable file names derived from resource or work-order IDs
## Work Order Model
Skipper does not send direct commands. It issues declarative work orders.
### Work Order Schema
The implemented work-order model is:
```json
{
"id": "4b9f5e2a-cf65-4342-97f5-66f3fe5a54f7",
"resource_type": "work_order",
"schema_version": "v1",
"type": "deploy_service",
"target": {
"tenant_id": "example-tenant",
"node_id": "host-1"
},
"desired_state": {
"deployment_id": "deployment-123",
"tenant_id": "example-tenant",
"service_ids": ["service-web"],
"compose_project": {}
},
"status": "pending",
"result": null,
"request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
"correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
"created_at": "2026-04-05T12:00:00.000Z",
"started_at": null,
"finished_at": null,
"metadata": {}
}
```
### Status Values
Implemented status values:
- `pending`
- `running`
- `success`
- `failed`
### Implemented Work Order Type
The current code implements:
- `deploy_service`
This currently reconciles a tenant compose project by:
1. writing `docker-compose.yml`
2. writing `.env`
3. running `docker compose up -d`
4. reporting structured output and state changes
### Work Order Result Schema
Work-order results are structured JSON, not free-form text:
```json
{
"success": true,
"code": "APPLY_OK",
"message": "Desired state applied",
"details": {
"duration_ms": 8,
"compose_path": "/opt/skipper/tenants/example-tenant/docker-compose.yml",
"changed_resources": ["service-web"],
"unchanged_resources": [],
"command": {
"program": "docker",
"args": ["compose", "-f", "...", "up", "-d"],
"exit_code": 0,
"stdout": "...",
"stderr": ""
}
}
}
```
This keeps results machine-readable while still preserving execution detail.
## Reconciliation Flow
The currently implemented flow is:
1. Admin calls `POST /v1/deployments/:tenantId/apply`
2. Skipper loads the tenant resource
3. Skipper creates a deployment resource
4. Skipper creates a `deploy_service` work order targeted at the node in tenant desired state
5. Skippy sends heartbeat and polls `GET /v1/nodes/:nodeId/work-orders/next`
6. Skippy claims and executes the work order
7. Skippy reports `POST /v1/work-orders/:workOrderId/result`
8. Skipper finishes the work order, updates resource state, writes events, and keeps logs/snapshots available
### Retry Safety
Current retry safety measures:
- deployment apply is idempotent through `x-idempotency-key`
- work-order execution is state-based and convergent at the compose level
- work-order completion is safe against duplicate result submission
- filesystem writes are atomic
## Structured Logging
All implemented operational logs are structured JSON.
Each log entry includes:
- `timestamp`
- `level`
- `service`
- `node_id`
- `tenant_id`
- `request_id`
- `correlation_id`
- `action`
- `result`
- `metadata`
Logs are written to:
- `/data/logs/YYYY-MM-DD/<service>.ndjson`
The logger also redacts common secret-shaped keys such as `token`, `secret`, `password`, and `authorization`.
## Event System
Every important state transition in the current flow emits an event.
Implemented event storage:
- `/data/events/YYYY-MM-DD/<timestamp>-<event_id>.json`
Implemented event types in the current code path:
- `resource_created`
- `work_order_created`
- `work_order_started`
- `work_order_succeeded`
- `work_order_failed`
- `deployment_started`
- `deployment_succeeded`
- `deployment_failed`
- `node_heartbeat_received`
- `snapshot_created`
The broader architecture still expects additional event coverage for all future resource mutations.
## State Snapshots
Snapshots are implemented and persisted as JSON documents.
Supported snapshot scopes:
- system
- per-tenant
Implemented endpoints:
- `GET /v1/snapshots/system/latest`
- `GET /v1/snapshots/tenants/:tenantId/latest`
Each snapshot includes:
- `snapshot_id`
- `scope`
- `created_at`
- `request_id`
- `correlation_id`
- `resources`
- `diffs`
Snapshot files are currently stored as:
- `/data/snapshots/system/latest.json`
- `/data/snapshots/tenants/<tenant_id>.json`
## Observability and AI Readiness
The current implementation is AI-ready at the core workflow level because it now preserves:
- request-level tracing across API and agent boundaries
- structured work-order lifecycle data
- historical logs
- historical events
- explicit desired/current/last-applied state
- exportable JSON/NDJSON persistence
For the implemented deployment path, the system can answer:
- what changed
- which work order applied it
- which node applied it
- what desired state was targeted
- what current and last applied state were recorded
## Error Handling
All API errors are structured and envelope-wrapped.
Implemented error shape:
```json
{
"code": "RESOURCE_NOT_FOUND",
"message": "Tenant not found",
"details": {
"resource_type": "tenant",
"resource_id": "example-tenant"
}
}
```
Implemented machine-readable error codes include:
- `INVALID_REQUEST`
- `UNAUTHORIZED`
- `RESOURCE_NOT_FOUND`
- `WORK_ORDER_NOT_CLAIMABLE`
- `INTERNAL_ERROR`
Raw stack traces are not returned in API responses.
## Security Model
### Implemented
- node token authentication
- admin token authentication
- correlation-aware structured logging
- redaction of common secret-shaped log fields
### Not Yet Implemented
- role-based authorization
- secret rotation workflows
- mTLS
- per-resource authorization policies
## Extensibility Model
The code is currently structured so new resource types and work-order types can be added without replacing the whole control flow.
Current extensibility anchors:
- resource storage by `resource_type`
- work-order execution by `type`
- stable response envelope
- versioned schemas
- shared storage and telemetry modules in `/shared`
## Implemented Internal Modules
### Shared
- [`shared/context.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/context.js)
- [`shared/errors.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/errors.js)
- [`shared/auth.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/auth.js)
- [`shared/resources.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/resources.js)
- [`shared/work-orders.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/work-orders.js)
- [`shared/logs.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/logs.js)
- [`shared/events.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/events.js)
- [`shared/idempotency.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/idempotency.js)
- [`shared/snapshots.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/snapshots.js)
### Skipper API
- [`skipper-api/src/index.js`](/home/sundown/Projekter/nodeJS/Skipper/skipper-api/src/index.js)
### Skippy Agent
- [`skippy-agent/src/index.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/index.js)
- [`skippy-agent/src/lib/http.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/lib/http.js)
- [`skippy-agent/src/modules/docker.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/modules/docker.js)
## Current Gaps
The code is now aligned with the architecture for the core deployment path, but it is not feature-complete across the full long-term vision.
Not yet implemented:
- full CRUD APIs for all resource types
- generic reconciliation across all future services
- `resource_updated` and `desired_state_changed` event coverage for every mutation path
- persisted state reports for all future resource kinds
- richer diffing beyond snapshot-level desired/current comparisons
- RBAC and richer authorization
- production HTTPS termination inside the app itself
- additional work-order types such as restart, migrate, nginx management, mysql provisioning, and systemd integration
## Current Compliance Summary
Implemented and aligned:
- `/v1` API contract
- request and correlation ID propagation
- envelope-based responses
- structured errors
- declarative work orders
- three-state resource model
- structured JSON logging
- event persistence
- snapshot persistence
- idempotent deployment apply
- token-based controller/agent auth
Still incomplete relative to the full target:
- broader resource coverage
- broader reconciliation coverage
- broader auth model
- full event coverage for every possible state mutation