Initial commit
This commit is contained in:
605
docs/architecture.md
Normal file
605
docs/architecture.md
Normal file
@@ -0,0 +1,605 @@
|
||||
# Skipper Architecture
|
||||
|
||||
## Purpose
|
||||
|
||||
Skipper is a lightweight hosting orchestration system built for clarity, inspectability, and future AI-assisted operations.
|
||||
|
||||
- `Skipper` = control plane
|
||||
- `Skippy` = host agent
|
||||
- Communication model = HTTPS pull from agent to controller
|
||||
- Implementation language = Node.js only
|
||||
- Persistence = file-based JSON storage only
|
||||
- Deployment target = Docker for all components
|
||||
|
||||
This document now reflects the current implementation in this repository, while also marking the remaining gaps to the broader target architecture.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- No hidden state
|
||||
- No implicit resource relationships
|
||||
- No shell-script driven control flow as the primary orchestration model
|
||||
- All operations must be idempotent and safe to retry
|
||||
- All state transitions must be inspectable via API or persisted state
|
||||
- Logs, events, and state must be sufficient for debugging
|
||||
- Schemas must be explicit and versioned
|
||||
- Extensibility is preferred over short-term convenience
|
||||
- Observability is a first-class requirement
|
||||
|
||||
## System Topology
|
||||
|
||||
### Skipper
|
||||
|
||||
Skipper is the controller and source of truth for desired state.
|
||||
|
||||
Current responsibilities:
|
||||
|
||||
- Store resource definitions on disk
|
||||
- Store `desired_state`, `current_state`, and `last_applied_state`
|
||||
- Create declarative work orders targeted at nodes
|
||||
- Accept structured work-order results and state reports
|
||||
- Persist structured logs, events, idempotency records, and snapshots
|
||||
- Expose a versioned REST API under `/v1`
|
||||
|
||||
### Skippy
|
||||
|
||||
Skippy is a node-local reconciliation agent.
|
||||
|
||||
Current responsibilities:
|
||||
|
||||
- Authenticate to Skipper with a node token
|
||||
- Poll for work orders over HTTPS
|
||||
- Apply desired state locally through Node.js modules
|
||||
- Report structured results
|
||||
- Report updated resource state
|
||||
- Emit structured JSON logs locally while also driving persisted state changes through the API
|
||||
|
||||
## Communication Model
|
||||
|
||||
Communication is LAN-first and HTTPS-oriented.
|
||||
|
||||
- Agents initiate control-plane communication
|
||||
- Skipper does not require inbound connectivity to managed nodes
|
||||
- All implemented API endpoints are versioned under `/v1`
|
||||
- All implemented requests and responses use JSON
|
||||
- Authentication is token-based
|
||||
- Request tracing is propagated with `request_id` and `correlation_id`
|
||||
|
||||
The current local development stack uses plain HTTP inside Docker and during smoke tests. The architecture remains HTTPS-first for production deployment.
|
||||
|
||||
## API Contract
|
||||
|
||||
### Versioning
|
||||
|
||||
All implemented control-plane endpoints live under `/v1`.
|
||||
|
||||
Implemented endpoints:
|
||||
|
||||
- `GET /v1/health`
|
||||
- `GET /v1/resources`
|
||||
- `GET /v1/resources/:resourceType/:resourceId`
|
||||
- `GET /v1/work-orders`
|
||||
- `GET /v1/work-orders/:workOrderId`
|
||||
- `GET /v1/nodes/:nodeId/work-orders/next`
|
||||
- `POST /v1/nodes/:nodeId/heartbeat`
|
||||
- `POST /v1/work-orders/:workOrderId/result`
|
||||
- `POST /v1/deployments/:tenantId/apply`
|
||||
- `GET /v1/snapshots/system/latest`
|
||||
- `GET /v1/snapshots/tenants/:tenantId/latest`
|
||||
|
||||
A compatibility `GET /health` endpoint also exists for simple health checks.
|
||||
|
||||
### Request Metadata
|
||||
|
||||
Every request is processed with:
|
||||
|
||||
- `request_id`
|
||||
- `correlation_id`
|
||||
|
||||
These are accepted from:
|
||||
|
||||
- `x-request-id`
|
||||
- `x-correlation-id`
|
||||
|
||||
If absent, Skipper generates them and returns them in both response headers and the response body envelope.
|
||||
|
||||
### Response Envelope
|
||||
|
||||
All API responses use a stable envelope:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "v1",
|
||||
"request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
|
||||
"correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
|
||||
"data": {},
|
||||
"error": null,
|
||||
"metadata": {
|
||||
"timestamp": "2026-04-05T12:00:00.000Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Error responses use the same envelope with `data: null`.
|
||||
|
||||
### Authentication
|
||||
|
||||
Two auth modes are currently implemented:
|
||||
|
||||
- admin API requests use `x-admin-token`
|
||||
- node API requests use `Authorization: Bearer <node-token>`
|
||||
|
||||
Node tokens are stored in:
|
||||
|
||||
- `/data/auth/nodes/<node_id>.json`
|
||||
|
||||
### Idempotency
|
||||
|
||||
Idempotency is currently implemented for deployment apply requests through `x-idempotency-key`.
|
||||
|
||||
Persisted idempotency records live under:
|
||||
|
||||
- `/data/idempotency`
|
||||
|
||||
The main implemented idempotent flow is:
|
||||
|
||||
- `POST /v1/deployments/:tenantId/apply`
|
||||
|
||||
## Resource Model
|
||||
|
||||
All managed resources are explicit JSON documents.
|
||||
|
||||
Each resource document contains:
|
||||
|
||||
- `id`
|
||||
- `resource_type`
|
||||
- `schema_version`
|
||||
- `desired_state`
|
||||
- `current_state`
|
||||
- `last_applied_state`
|
||||
- `metadata`
|
||||
- `created_at`
|
||||
- `updated_at`
|
||||
|
||||
The three-state model is implemented and central:
|
||||
|
||||
- `desired_state`: what Skipper wants
|
||||
- `current_state`: what Skippy or Skipper currently knows to be true
|
||||
- `last_applied_state`: what Skippy most recently attempted or enforced
|
||||
|
||||
### Implemented Resource Types
|
||||
|
||||
The storage layer currently supports these resource types:
|
||||
|
||||
- `tenant`
|
||||
- `node`
|
||||
- `service`
|
||||
- `deployment`
|
||||
- `resource_limits`
|
||||
- `network`
|
||||
- `volume`
|
||||
|
||||
At the moment, the repository ships example data for:
|
||||
|
||||
- `tenant`
|
||||
- `node`
|
||||
- `service`
|
||||
|
||||
`deployment` resources are created dynamically when a deployment apply request is issued.
|
||||
|
||||
### Tenant
|
||||
|
||||
Current tenant usage:
|
||||
|
||||
- deployment target policy
|
||||
- service references
|
||||
- compose project specification
|
||||
|
||||
Example:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "example-tenant",
|
||||
"resource_type": "tenant",
|
||||
"schema_version": "v1",
|
||||
"desired_state": {
|
||||
"display_name": "Example Tenant",
|
||||
"deployment_policy": {
|
||||
"target_node_id": "host-1"
|
||||
},
|
||||
"service_ids": ["service-web"],
|
||||
"compose": {
|
||||
"tenant_id": "example-tenant",
|
||||
"compose_file": "services:\n web:\n image: nginx:alpine\n",
|
||||
"env": {
|
||||
"NGINX_PORT": "8081"
|
||||
}
|
||||
}
|
||||
},
|
||||
"current_state": {},
|
||||
"last_applied_state": {},
|
||||
"metadata": {}
|
||||
}
|
||||
```
|
||||
|
||||
### Node
|
||||
|
||||
Current node usage:
|
||||
|
||||
- desired enablement and labels
|
||||
- heartbeat status
|
||||
- agent capabilities
|
||||
- agent version
|
||||
|
||||
### Service
|
||||
|
||||
Current service usage:
|
||||
|
||||
- tenant ownership
|
||||
- service kind
|
||||
- image
|
||||
- network and volume references
|
||||
- resource limit reference
|
||||
|
||||
### Deployment
|
||||
|
||||
Current deployment usage:
|
||||
|
||||
- created during `POST /v1/deployments/:tenantId/apply`
|
||||
- tracks deployment status
|
||||
- tracks associated work order
|
||||
- stores deployment-oriented desired state
|
||||
|
||||
## File-Based Persistence Layout
|
||||
|
||||
The current on-disk layout is:
|
||||
|
||||
```text
|
||||
/data
|
||||
/resources
|
||||
/tenants
|
||||
/nodes
|
||||
/services
|
||||
/deployments
|
||||
/resource-limits
|
||||
/networks
|
||||
/volumes
|
||||
/work-orders
|
||||
/pending
|
||||
/running
|
||||
/finished
|
||||
/events
|
||||
/YYYY-MM-DD
|
||||
/logs
|
||||
/YYYY-MM-DD
|
||||
/snapshots
|
||||
/system
|
||||
/tenants
|
||||
/idempotency
|
||||
/auth
|
||||
/nodes
|
||||
```
|
||||
|
||||
Rules implemented today:
|
||||
|
||||
- one JSON document per state file
|
||||
- atomic JSON writes
|
||||
- append-only event and log history
|
||||
- stable file names derived from resource or work-order IDs
|
||||
|
||||
## Work Order Model
|
||||
|
||||
Skipper does not send direct commands. It issues declarative work orders.
|
||||
|
||||
### Work Order Schema
|
||||
|
||||
The implemented work-order model is:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "4b9f5e2a-cf65-4342-97f5-66f3fe5a54f7",
|
||||
"resource_type": "work_order",
|
||||
"schema_version": "v1",
|
||||
"type": "deploy_service",
|
||||
"target": {
|
||||
"tenant_id": "example-tenant",
|
||||
"node_id": "host-1"
|
||||
},
|
||||
"desired_state": {
|
||||
"deployment_id": "deployment-123",
|
||||
"tenant_id": "example-tenant",
|
||||
"service_ids": ["service-web"],
|
||||
"compose_project": {}
|
||||
},
|
||||
"status": "pending",
|
||||
"result": null,
|
||||
"request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
|
||||
"correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
|
||||
"created_at": "2026-04-05T12:00:00.000Z",
|
||||
"started_at": null,
|
||||
"finished_at": null,
|
||||
"metadata": {}
|
||||
}
|
||||
```
|
||||
|
||||
### Status Values
|
||||
|
||||
Implemented status values:
|
||||
|
||||
- `pending`
|
||||
- `running`
|
||||
- `success`
|
||||
- `failed`
|
||||
|
||||
### Implemented Work Order Type
|
||||
|
||||
The current code implements:
|
||||
|
||||
- `deploy_service`
|
||||
|
||||
This currently reconciles a tenant compose project by:
|
||||
|
||||
1. writing `docker-compose.yml`
|
||||
2. writing `.env`
|
||||
3. running `docker compose up -d`
|
||||
4. reporting structured output and state changes
|
||||
|
||||
### Work Order Result Schema
|
||||
|
||||
Work-order results are structured JSON, not free-form text:
|
||||
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"code": "APPLY_OK",
|
||||
"message": "Desired state applied",
|
||||
"details": {
|
||||
"duration_ms": 8,
|
||||
"compose_path": "/opt/skipper/tenants/example-tenant/docker-compose.yml",
|
||||
"changed_resources": ["service-web"],
|
||||
"unchanged_resources": [],
|
||||
"command": {
|
||||
"program": "docker",
|
||||
"args": ["compose", "-f", "...", "up", "-d"],
|
||||
"exit_code": 0,
|
||||
"stdout": "...",
|
||||
"stderr": ""
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This keeps results machine-readable while still preserving execution detail.
|
||||
|
||||
## Reconciliation Flow
|
||||
|
||||
The currently implemented flow is:
|
||||
|
||||
1. Admin calls `POST /v1/deployments/:tenantId/apply`
|
||||
2. Skipper loads the tenant resource
|
||||
3. Skipper creates a deployment resource
|
||||
4. Skipper creates a `deploy_service` work order targeted at the node in tenant desired state
|
||||
5. Skippy sends heartbeat and polls `GET /v1/nodes/:nodeId/work-orders/next`
|
||||
6. Skippy claims and executes the work order
|
||||
7. Skippy reports `POST /v1/work-orders/:workOrderId/result`
|
||||
8. Skipper finishes the work order, updates resource state, writes events, and keeps logs/snapshots available
|
||||
|
||||
### Retry Safety
|
||||
|
||||
Current retry safety measures:
|
||||
|
||||
- deployment apply is idempotent through `x-idempotency-key`
|
||||
- work-order execution is state-based and convergent at the compose level
|
||||
- work-order completion is safe against duplicate result submission
|
||||
- filesystem writes are atomic
|
||||
|
||||
## Structured Logging
|
||||
|
||||
All implemented operational logs are structured JSON.
|
||||
|
||||
Each log entry includes:
|
||||
|
||||
- `timestamp`
|
||||
- `level`
|
||||
- `service`
|
||||
- `node_id`
|
||||
- `tenant_id`
|
||||
- `request_id`
|
||||
- `correlation_id`
|
||||
- `action`
|
||||
- `result`
|
||||
- `metadata`
|
||||
|
||||
Logs are written to:
|
||||
|
||||
- `/data/logs/YYYY-MM-DD/<service>.ndjson`
|
||||
|
||||
The logger also redacts common secret-shaped keys such as `token`, `secret`, `password`, and `authorization`.
|
||||
|
||||
## Event System
|
||||
|
||||
Every important state transition in the current flow emits an event.
|
||||
|
||||
Implemented event storage:
|
||||
|
||||
- `/data/events/YYYY-MM-DD/<timestamp>-<event_id>.json`
|
||||
|
||||
Implemented event types in the current code path:
|
||||
|
||||
- `resource_created`
|
||||
- `work_order_created`
|
||||
- `work_order_started`
|
||||
- `work_order_succeeded`
|
||||
- `work_order_failed`
|
||||
- `deployment_started`
|
||||
- `deployment_succeeded`
|
||||
- `deployment_failed`
|
||||
- `node_heartbeat_received`
|
||||
- `snapshot_created`
|
||||
|
||||
The broader architecture still expects additional event coverage for all future resource mutations.
|
||||
|
||||
## State Snapshots
|
||||
|
||||
Snapshots are implemented and persisted as JSON documents.
|
||||
|
||||
Supported snapshot scopes:
|
||||
|
||||
- system
|
||||
- per-tenant
|
||||
|
||||
Implemented endpoints:
|
||||
|
||||
- `GET /v1/snapshots/system/latest`
|
||||
- `GET /v1/snapshots/tenants/:tenantId/latest`
|
||||
|
||||
Each snapshot includes:
|
||||
|
||||
- `snapshot_id`
|
||||
- `scope`
|
||||
- `created_at`
|
||||
- `request_id`
|
||||
- `correlation_id`
|
||||
- `resources`
|
||||
- `diffs`
|
||||
|
||||
Snapshot files are currently stored as:
|
||||
|
||||
- `/data/snapshots/system/latest.json`
|
||||
- `/data/snapshots/tenants/<tenant_id>.json`
|
||||
|
||||
## Observability and AI Readiness
|
||||
|
||||
The current implementation is AI-ready at the core workflow level because it now preserves:
|
||||
|
||||
- request-level tracing across API and agent boundaries
|
||||
- structured work-order lifecycle data
|
||||
- historical logs
|
||||
- historical events
|
||||
- explicit desired/current/last-applied state
|
||||
- exportable JSON/NDJSON persistence
|
||||
|
||||
For the implemented deployment path, the system can answer:
|
||||
|
||||
- what changed
|
||||
- which work order applied it
|
||||
- which node applied it
|
||||
- what desired state was targeted
|
||||
- what current and last applied state were recorded
|
||||
|
||||
## Error Handling
|
||||
|
||||
All API errors are structured and envelope-wrapped.
|
||||
|
||||
Implemented error shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"code": "RESOURCE_NOT_FOUND",
|
||||
"message": "Tenant not found",
|
||||
"details": {
|
||||
"resource_type": "tenant",
|
||||
"resource_id": "example-tenant"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Implemented machine-readable error codes include:
|
||||
|
||||
- `INVALID_REQUEST`
|
||||
- `UNAUTHORIZED`
|
||||
- `RESOURCE_NOT_FOUND`
|
||||
- `WORK_ORDER_NOT_CLAIMABLE`
|
||||
- `INTERNAL_ERROR`
|
||||
|
||||
Raw stack traces are not returned in API responses.
|
||||
|
||||
## Security Model
|
||||
|
||||
### Implemented
|
||||
|
||||
- node token authentication
|
||||
- admin token authentication
|
||||
- correlation-aware structured logging
|
||||
- redaction of common secret-shaped log fields
|
||||
|
||||
### Not Yet Implemented
|
||||
|
||||
- role-based authorization
|
||||
- secret rotation workflows
|
||||
- mTLS
|
||||
- per-resource authorization policies
|
||||
|
||||
## Extensibility Model
|
||||
|
||||
The code is currently structured so new resource types and work-order types can be added without replacing the whole control flow.
|
||||
|
||||
Current extensibility anchors:
|
||||
|
||||
- resource storage by `resource_type`
|
||||
- work-order execution by `type`
|
||||
- stable response envelope
|
||||
- versioned schemas
|
||||
- shared storage and telemetry modules in `/shared`
|
||||
|
||||
## Implemented Internal Modules
|
||||
|
||||
### Shared
|
||||
|
||||
- [`shared/context.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/context.js)
|
||||
- [`shared/errors.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/errors.js)
|
||||
- [`shared/auth.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/auth.js)
|
||||
- [`shared/resources.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/resources.js)
|
||||
- [`shared/work-orders.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/work-orders.js)
|
||||
- [`shared/logs.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/logs.js)
|
||||
- [`shared/events.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/events.js)
|
||||
- [`shared/idempotency.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/idempotency.js)
|
||||
- [`shared/snapshots.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/snapshots.js)
|
||||
|
||||
### Skipper API
|
||||
|
||||
- [`skipper-api/src/index.js`](/home/sundown/Projekter/nodeJS/Skipper/skipper-api/src/index.js)
|
||||
|
||||
### Skippy Agent
|
||||
|
||||
- [`skippy-agent/src/index.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/index.js)
|
||||
- [`skippy-agent/src/lib/http.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/lib/http.js)
|
||||
- [`skippy-agent/src/modules/docker.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/modules/docker.js)
|
||||
|
||||
## Current Gaps
|
||||
|
||||
The code is now aligned with the architecture for the core deployment path, but it is not feature-complete across the full long-term vision.
|
||||
|
||||
Not yet implemented:
|
||||
|
||||
- full CRUD APIs for all resource types
|
||||
- generic reconciliation across all future services
|
||||
- `resource_updated` and `desired_state_changed` event coverage for every mutation path
|
||||
- persisted state reports for all future resource kinds
|
||||
- richer diffing beyond snapshot-level desired/current comparisons
|
||||
- RBAC and richer authorization
|
||||
- production HTTPS termination inside the app itself
|
||||
- additional work-order types such as restart, migrate, nginx management, mysql provisioning, and systemd integration
|
||||
|
||||
## Current Compliance Summary
|
||||
|
||||
Implemented and aligned:
|
||||
|
||||
- `/v1` API contract
|
||||
- request and correlation ID propagation
|
||||
- envelope-based responses
|
||||
- structured errors
|
||||
- declarative work orders
|
||||
- three-state resource model
|
||||
- structured JSON logging
|
||||
- event persistence
|
||||
- snapshot persistence
|
||||
- idempotent deployment apply
|
||||
- token-based controller/agent auth
|
||||
|
||||
Still incomplete relative to the full target:
|
||||
|
||||
- broader resource coverage
|
||||
- broader reconciliation coverage
|
||||
- broader auth model
|
||||
- full event coverage for every possible state mutation
|
||||
|
||||
Reference in New Issue
Block a user