# Skipper Architecture ## Purpose Skipper is a lightweight hosting orchestration system built for clarity, inspectability, and future AI-assisted operations. - `Skipper` = control plane - `Skippy` = host agent - Communication model = HTTPS pull from agent to controller - Implementation language = Node.js only - Persistence = file-based JSON storage only - Deployment target = Docker for all components This document now reflects the current implementation in this repository, while also marking the remaining gaps to the broader target architecture. ## Design Principles - No hidden state - No implicit resource relationships - No shell-script driven control flow as the primary orchestration model - All operations must be idempotent and safe to retry - All state transitions must be inspectable via API or persisted state - Logs, events, and state must be sufficient for debugging - Schemas must be explicit and versioned - Extensibility is preferred over short-term convenience - Observability is a first-class requirement ## System Topology ### Skipper Skipper is the controller and source of truth for desired state. Current responsibilities: - Store resource definitions on disk - Store `desired_state`, `current_state`, and `last_applied_state` - Create declarative work orders targeted at nodes - Accept structured work-order results and state reports - Persist structured logs, events, idempotency records, and snapshots - Expose a versioned REST API under `/v1` ### Skippy Skippy is a node-local reconciliation agent. Current responsibilities: - Authenticate to Skipper with a node token - Poll for work orders over HTTPS - Apply desired state locally through Node.js modules - Report structured results - Report updated resource state - Emit structured JSON logs locally while also driving persisted state changes through the API ## Communication Model Communication is LAN-first and HTTPS-oriented. - Agents initiate control-plane communication - Skipper does not require inbound connectivity to managed nodes - All implemented API endpoints are versioned under `/v1` - All implemented requests and responses use JSON - Authentication is token-based - Request tracing is propagated with `request_id` and `correlation_id` The current local development stack uses plain HTTP inside Docker and during smoke tests. The architecture remains HTTPS-first for production deployment. ## API Contract ### Versioning All implemented control-plane endpoints live under `/v1`. Implemented endpoints: - `GET /v1/health` - `GET /v1/resources` - `GET /v1/resources/:resourceType/:resourceId` - `GET /v1/work-orders` - `GET /v1/work-orders/:workOrderId` - `GET /v1/nodes/:nodeId/work-orders/next` - `POST /v1/nodes/:nodeId/heartbeat` - `POST /v1/work-orders/:workOrderId/result` - `POST /v1/deployments/:tenantId/apply` - `GET /v1/snapshots/system/latest` - `GET /v1/snapshots/tenants/:tenantId/latest` A compatibility `GET /health` endpoint also exists for simple health checks. ### Request Metadata Every request is processed with: - `request_id` - `correlation_id` These are accepted from: - `x-request-id` - `x-correlation-id` If absent, Skipper generates them and returns them in both response headers and the response body envelope. ### Response Envelope All API responses use a stable envelope: ```json { "schema_version": "v1", "request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8", "correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d", "data": {}, "error": null, "metadata": { "timestamp": "2026-04-05T12:00:00.000Z" } } ``` Error responses use the same envelope with `data: null`. ### Authentication Two auth modes are currently implemented: - admin API requests use `x-admin-token` - node API requests use `Authorization: Bearer ` Node tokens are stored in: - `/data/auth/nodes/.json` ### Idempotency Idempotency is currently implemented for deployment apply requests through `x-idempotency-key`. Persisted idempotency records live under: - `/data/idempotency` The main implemented idempotent flow is: - `POST /v1/deployments/:tenantId/apply` ## Resource Model All managed resources are explicit JSON documents. Each resource document contains: - `id` - `resource_type` - `schema_version` - `desired_state` - `current_state` - `last_applied_state` - `metadata` - `created_at` - `updated_at` The three-state model is implemented and central: - `desired_state`: what Skipper wants - `current_state`: what Skippy or Skipper currently knows to be true - `last_applied_state`: what Skippy most recently attempted or enforced ### Implemented Resource Types The storage layer currently supports these resource types: - `tenant` - `node` - `service` - `deployment` - `resource_limits` - `network` - `volume` At the moment, the repository ships example data for: - `tenant` - `node` - `service` `deployment` resources are created dynamically when a deployment apply request is issued. ### Tenant Current tenant usage: - deployment target policy - service references - compose project specification Example: ```json { "id": "example-tenant", "resource_type": "tenant", "schema_version": "v1", "desired_state": { "display_name": "Example Tenant", "deployment_policy": { "target_node_id": "host-1" }, "service_ids": ["service-web"], "compose": { "tenant_id": "example-tenant", "compose_file": "services:\n web:\n image: nginx:alpine\n", "env": { "NGINX_PORT": "8081" } } }, "current_state": {}, "last_applied_state": {}, "metadata": {} } ``` ### Node Current node usage: - desired enablement and labels - heartbeat status - agent capabilities - agent version ### Service Current service usage: - tenant ownership - service kind - image - network and volume references - resource limit reference ### Deployment Current deployment usage: - created during `POST /v1/deployments/:tenantId/apply` - tracks deployment status - tracks associated work order - stores deployment-oriented desired state ## File-Based Persistence Layout The current on-disk layout is: ```text /data /resources /tenants /nodes /services /deployments /resource-limits /networks /volumes /work-orders /pending /running /finished /events /YYYY-MM-DD /logs /YYYY-MM-DD /snapshots /system /tenants /idempotency /auth /nodes ``` Rules implemented today: - one JSON document per state file - atomic JSON writes - append-only event and log history - stable file names derived from resource or work-order IDs ## Work Order Model Skipper does not send direct commands. It issues declarative work orders. ### Work Order Schema The implemented work-order model is: ```json { "id": "4b9f5e2a-cf65-4342-97f5-66f3fe5a54f7", "resource_type": "work_order", "schema_version": "v1", "type": "deploy_service", "target": { "tenant_id": "example-tenant", "node_id": "host-1" }, "desired_state": { "deployment_id": "deployment-123", "tenant_id": "example-tenant", "service_ids": ["service-web"], "compose_project": {} }, "status": "pending", "result": null, "request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8", "correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d", "created_at": "2026-04-05T12:00:00.000Z", "started_at": null, "finished_at": null, "metadata": {} } ``` ### Status Values Implemented status values: - `pending` - `running` - `success` - `failed` ### Implemented Work Order Type The current code implements: - `deploy_service` This currently reconciles a tenant compose project by: 1. writing `docker-compose.yml` 2. writing `.env` 3. running `docker compose up -d` 4. reporting structured output and state changes ### Work Order Result Schema Work-order results are structured JSON, not free-form text: ```json { "success": true, "code": "APPLY_OK", "message": "Desired state applied", "details": { "duration_ms": 8, "compose_path": "/opt/skipper/tenants/example-tenant/docker-compose.yml", "changed_resources": ["service-web"], "unchanged_resources": [], "command": { "program": "docker", "args": ["compose", "-f", "...", "up", "-d"], "exit_code": 0, "stdout": "...", "stderr": "" } } } ``` This keeps results machine-readable while still preserving execution detail. ## Reconciliation Flow The currently implemented flow is: 1. Admin calls `POST /v1/deployments/:tenantId/apply` 2. Skipper loads the tenant resource 3. Skipper creates a deployment resource 4. Skipper creates a `deploy_service` work order targeted at the node in tenant desired state 5. Skippy sends heartbeat and polls `GET /v1/nodes/:nodeId/work-orders/next` 6. Skippy claims and executes the work order 7. Skippy reports `POST /v1/work-orders/:workOrderId/result` 8. Skipper finishes the work order, updates resource state, writes events, and keeps logs/snapshots available ### Retry Safety Current retry safety measures: - deployment apply is idempotent through `x-idempotency-key` - work-order execution is state-based and convergent at the compose level - work-order completion is safe against duplicate result submission - filesystem writes are atomic ## Structured Logging All implemented operational logs are structured JSON. Each log entry includes: - `timestamp` - `level` - `service` - `node_id` - `tenant_id` - `request_id` - `correlation_id` - `action` - `result` - `metadata` Logs are written to: - `/data/logs/YYYY-MM-DD/.ndjson` The logger also redacts common secret-shaped keys such as `token`, `secret`, `password`, and `authorization`. ## Event System Every important state transition in the current flow emits an event. Implemented event storage: - `/data/events/YYYY-MM-DD/-.json` Implemented event types in the current code path: - `resource_created` - `work_order_created` - `work_order_started` - `work_order_succeeded` - `work_order_failed` - `deployment_started` - `deployment_succeeded` - `deployment_failed` - `node_heartbeat_received` - `snapshot_created` The broader architecture still expects additional event coverage for all future resource mutations. ## State Snapshots Snapshots are implemented and persisted as JSON documents. Supported snapshot scopes: - system - per-tenant Implemented endpoints: - `GET /v1/snapshots/system/latest` - `GET /v1/snapshots/tenants/:tenantId/latest` Each snapshot includes: - `snapshot_id` - `scope` - `created_at` - `request_id` - `correlation_id` - `resources` - `diffs` Snapshot files are currently stored as: - `/data/snapshots/system/latest.json` - `/data/snapshots/tenants/.json` ## Observability and AI Readiness The current implementation is AI-ready at the core workflow level because it now preserves: - request-level tracing across API and agent boundaries - structured work-order lifecycle data - historical logs - historical events - explicit desired/current/last-applied state - exportable JSON/NDJSON persistence For the implemented deployment path, the system can answer: - what changed - which work order applied it - which node applied it - what desired state was targeted - what current and last applied state were recorded ## Error Handling All API errors are structured and envelope-wrapped. Implemented error shape: ```json { "code": "RESOURCE_NOT_FOUND", "message": "Tenant not found", "details": { "resource_type": "tenant", "resource_id": "example-tenant" } } ``` Implemented machine-readable error codes include: - `INVALID_REQUEST` - `UNAUTHORIZED` - `RESOURCE_NOT_FOUND` - `WORK_ORDER_NOT_CLAIMABLE` - `INTERNAL_ERROR` Raw stack traces are not returned in API responses. ## Security Model ### Implemented - node token authentication - admin token authentication - correlation-aware structured logging - redaction of common secret-shaped log fields ### Not Yet Implemented - role-based authorization - secret rotation workflows - mTLS - per-resource authorization policies ## Extensibility Model The code is currently structured so new resource types and work-order types can be added without replacing the whole control flow. Current extensibility anchors: - resource storage by `resource_type` - work-order execution by `type` - stable response envelope - versioned schemas - shared storage and telemetry modules in `/shared` ## Implemented Internal Modules ### Shared - [`shared/context.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/context.js) - [`shared/errors.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/errors.js) - [`shared/auth.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/auth.js) - [`shared/resources.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/resources.js) - [`shared/work-orders.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/work-orders.js) - [`shared/logs.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/logs.js) - [`shared/events.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/events.js) - [`shared/idempotency.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/idempotency.js) - [`shared/snapshots.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/snapshots.js) ### Skipper API - [`skipper-api/src/index.js`](/home/sundown/Projekter/nodeJS/Skipper/skipper-api/src/index.js) ### Skippy Agent - [`skippy-agent/src/index.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/index.js) - [`skippy-agent/src/lib/http.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/lib/http.js) - [`skippy-agent/src/modules/docker.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/modules/docker.js) ## Current Gaps The code is now aligned with the architecture for the core deployment path, but it is not feature-complete across the full long-term vision. Not yet implemented: - full CRUD APIs for all resource types - generic reconciliation across all future services - `resource_updated` and `desired_state_changed` event coverage for every mutation path - persisted state reports for all future resource kinds - richer diffing beyond snapshot-level desired/current comparisons - RBAC and richer authorization - production HTTPS termination inside the app itself - additional work-order types such as restart, migrate, nginx management, mysql provisioning, and systemd integration ## Current Compliance Summary Implemented and aligned: - `/v1` API contract - request and correlation ID propagation - envelope-based responses - structured errors - declarative work orders - three-state resource model - structured JSON logging - event persistence - snapshot persistence - idempotent deployment apply - token-based controller/agent auth Still incomplete relative to the full target: - broader resource coverage - broader reconciliation coverage - broader auth model - full event coverage for every possible state mutation