# Skipper Architecture

## Purpose

Skipper is a lightweight hosting orchestration system built for clarity, inspectability, and future AI-assisted operations.

- `Skipper` = control plane
- `Skippy` = host agent
- Communication model = HTTPS pull from agent to controller
- Implementation language = Node.js only
- Persistence = file-based JSON storage only
- Deployment target = Docker for all components

This document now reflects the current implementation in this repository, while also marking the remaining gaps to the broader target architecture.

## Design Principles

- No hidden state
- No implicit resource relationships
- No shell-script driven control flow as the primary orchestration model
- All operations must be idempotent and safe to retry
- All state transitions must be inspectable via API or persisted state
- Logs, events, and state must be sufficient for debugging
- Schemas must be explicit and versioned
- Extensibility is preferred over short-term convenience
- Observability is a first-class requirement

## System Topology

### Skipper

Skipper is the controller and source of truth for desired state.

Current responsibilities:

- Store resource definitions on disk
- Store `desired_state`, `current_state`, and `last_applied_state`
- Create declarative work orders targeted at nodes
- Accept structured work-order results and state reports
- Persist structured logs, events, idempotency records, and snapshots
- Expose a versioned REST API under `/v1`

### Skippy

Skippy is a node-local reconciliation agent.

Current responsibilities:

- Authenticate to Skipper with a node token
- Poll for work orders over HTTPS
- Apply desired state locally through Node.js modules
- Report structured results
- Report updated resource state
- Emit structured JSON logs locally while also driving persisted state changes through the API

## Communication Model

Communication is LAN-first and HTTPS-oriented.

- Agents initiate control-plane communication
- Skipper does not require inbound connectivity to managed nodes
- All implemented API endpoints are versioned under `/v1`
- All implemented requests and responses use JSON
- Authentication is token-based
- Request tracing is propagated with `request_id` and `correlation_id`

The current local development stack uses plain HTTP inside Docker and during smoke tests. The architecture remains HTTPS-first for production deployment.

## API Contract

### Versioning

All implemented control-plane endpoints live under `/v1`.

Implemented endpoints:

- `GET /v1/health`
- `GET /v1/resources`
- `GET /v1/resources/:resourceType/:resourceId`
- `GET /v1/work-orders`
- `GET /v1/work-orders/:workOrderId`
- `GET /v1/nodes/:nodeId/work-orders/next`
- `POST /v1/nodes/:nodeId/heartbeat`
- `POST /v1/work-orders/:workOrderId/result`
- `POST /v1/deployments/:tenantId/apply`
- `GET /v1/snapshots/system/latest`
- `GET /v1/snapshots/tenants/:tenantId/latest`

A compatibility `GET /health` endpoint also exists for simple health checks.

### Request Metadata

Every request is processed with:

- `request_id`
- `correlation_id`

These are accepted from:

- `x-request-id`
- `x-correlation-id`

If absent, Skipper generates them and returns them in both response headers and the response body envelope.

### Response Envelope

All API responses use a stable envelope:

```json
{
  "schema_version": "v1",
  "request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
  "correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
  "data": {},
  "error": null,
  "metadata": {
    "timestamp": "2026-04-05T12:00:00.000Z"
  }
}
```

Error responses use the same envelope with `data: null`.

### Authentication

Two auth modes are currently implemented:

- admin API requests use `x-admin-token`
- node API requests use `Authorization: Bearer <node-token>`

Node tokens are stored in:

- `/data/auth/nodes/<node_id>.json`

### Idempotency

Idempotency is currently implemented for deployment apply requests through `x-idempotency-key`.

Persisted idempotency records live under:

- `/data/idempotency`

The main implemented idempotent flow is:

- `POST /v1/deployments/:tenantId/apply`

## Resource Model

All managed resources are explicit JSON documents.

Each resource document contains:

- `id`
- `resource_type`
- `schema_version`
- `desired_state`
- `current_state`
- `last_applied_state`
- `metadata`
- `created_at`
- `updated_at`

The three-state model is implemented and central:

- `desired_state`: what Skipper wants
- `current_state`: what Skippy or Skipper currently knows to be true
- `last_applied_state`: what Skippy most recently attempted or enforced

### Implemented Resource Types

The storage layer currently supports these resource types:

- `tenant`
- `node`
- `service`
- `deployment`
- `resource_limits`
- `network`
- `volume`

At the moment, the repository ships example data for:

- `tenant`
- `node`
- `service`

`deployment` resources are created dynamically when a deployment apply request is issued.

### Tenant

Current tenant usage:

- deployment target policy
- service references
- compose project specification

Example:

```json
{
  "id": "example-tenant",
  "resource_type": "tenant",
  "schema_version": "v1",
  "desired_state": {
    "display_name": "Example Tenant",
    "deployment_policy": {
      "target_node_id": "host-1"
    },
    "service_ids": ["service-web"],
    "compose": {
      "tenant_id": "example-tenant",
      "compose_file": "services:\n  web:\n    image: nginx:alpine\n",
      "env": {
        "NGINX_PORT": "8081"
      }
    }
  },
  "current_state": {},
  "last_applied_state": {},
  "metadata": {}
}
```

### Node

Current node usage:

- desired enablement and labels
- heartbeat status
- agent capabilities
- agent version

### Service

Current service usage:

- tenant ownership
- service kind
- image
- network and volume references
- resource limit reference

### Deployment

Current deployment usage:

- created during `POST /v1/deployments/:tenantId/apply`
- tracks deployment status
- tracks associated work order
- stores deployment-oriented desired state

## File-Based Persistence Layout

The current on-disk layout is:

```text
/data
  /resources
    /tenants
    /nodes
    /services
    /deployments
    /resource-limits
    /networks
    /volumes
  /work-orders
    /pending
    /running
    /finished
  /events
    /YYYY-MM-DD
  /logs
    /YYYY-MM-DD
  /snapshots
    /system
    /tenants
  /idempotency
  /auth
    /nodes
```

Rules implemented today:

- one JSON document per state file
- atomic JSON writes
- append-only event and log history
- stable file names derived from resource or work-order IDs

## Work Order Model

Skipper does not send direct commands. It issues declarative work orders.

### Work Order Schema

The implemented work-order model is:

```json
{
  "id": "4b9f5e2a-cf65-4342-97f5-66f3fe5a54f7",
  "resource_type": "work_order",
  "schema_version": "v1",
  "type": "deploy_service",
  "target": {
    "tenant_id": "example-tenant",
    "node_id": "host-1"
  },
  "desired_state": {
    "deployment_id": "deployment-123",
    "tenant_id": "example-tenant",
    "service_ids": ["service-web"],
    "compose_project": {}
  },
  "status": "pending",
  "result": null,
  "request_id": "6c4a5b1f-7f91-42cc-aef5-5ea4248fb2e8",
  "correlation_id": "a0f84ecf-f8d6-4c4e-97a3-0ed68eb9c95d",
  "created_at": "2026-04-05T12:00:00.000Z",
  "started_at": null,
  "finished_at": null,
  "metadata": {}
}
```

### Status Values

Implemented status values:

- `pending`
- `running`
- `success`
- `failed`

### Implemented Work Order Type

The current code implements:

- `deploy_service`

This currently reconciles a tenant compose project by:

1. writing `docker-compose.yml`
2. writing `.env`
3. running `docker compose up -d`
4. reporting structured output and state changes

### Work Order Result Schema

Work-order results are structured JSON, not free-form text:

```json
{
  "success": true,
  "code": "APPLY_OK",
  "message": "Desired state applied",
  "details": {
    "duration_ms": 8,
    "compose_path": "/opt/skipper/tenants/example-tenant/docker-compose.yml",
    "changed_resources": ["service-web"],
    "unchanged_resources": [],
    "command": {
      "program": "docker",
      "args": ["compose", "-f", "...", "up", "-d"],
      "exit_code": 0,
      "stdout": "...",
      "stderr": ""
    }
  }
}
```

This keeps results machine-readable while still preserving execution detail.

## Reconciliation Flow

The currently implemented flow is:

1. Admin calls `POST /v1/deployments/:tenantId/apply`
2. Skipper loads the tenant resource
3. Skipper creates a deployment resource
4. Skipper creates a `deploy_service` work order targeted at the node in tenant desired state
5. Skippy sends heartbeat and polls `GET /v1/nodes/:nodeId/work-orders/next`
6. Skippy claims and executes the work order
7. Skippy reports `POST /v1/work-orders/:workOrderId/result`
8. Skipper finishes the work order, updates resource state, writes events, and keeps logs/snapshots available

### Retry Safety

Current retry safety measures:

- deployment apply is idempotent through `x-idempotency-key`
- work-order execution is state-based and convergent at the compose level
- work-order completion is safe against duplicate result submission
- filesystem writes are atomic

## Structured Logging

All implemented operational logs are structured JSON.

Each log entry includes:

- `timestamp`
- `level`
- `service`
- `node_id`
- `tenant_id`
- `request_id`
- `correlation_id`
- `action`
- `result`
- `metadata`

Logs are written to:

- `/data/logs/YYYY-MM-DD/<service>.ndjson`

The logger also redacts common secret-shaped keys such as `token`, `secret`, `password`, and `authorization`.

## Event System

Every important state transition in the current flow emits an event.

Implemented event storage:

- `/data/events/YYYY-MM-DD/<timestamp>-<event_id>.json`

Implemented event types in the current code path:

- `resource_created`
- `work_order_created`
- `work_order_started`
- `work_order_succeeded`
- `work_order_failed`
- `deployment_started`
- `deployment_succeeded`
- `deployment_failed`
- `node_heartbeat_received`
- `snapshot_created`

The broader architecture still expects additional event coverage for all future resource mutations.

## State Snapshots

Snapshots are implemented and persisted as JSON documents.

Supported snapshot scopes:

- system
- per-tenant

Implemented endpoints:

- `GET /v1/snapshots/system/latest`
- `GET /v1/snapshots/tenants/:tenantId/latest`

Each snapshot includes:

- `snapshot_id`
- `scope`
- `created_at`
- `request_id`
- `correlation_id`
- `resources`
- `diffs`

Snapshot files are currently stored as:

- `/data/snapshots/system/latest.json`
- `/data/snapshots/tenants/<tenant_id>.json`

## Observability and AI Readiness

The current implementation is AI-ready at the core workflow level because it now preserves:

- request-level tracing across API and agent boundaries
- structured work-order lifecycle data
- historical logs
- historical events
- explicit desired/current/last-applied state
- exportable JSON/NDJSON persistence

For the implemented deployment path, the system can answer:

- what changed
- which work order applied it
- which node applied it
- what desired state was targeted
- what current and last applied state were recorded

## Error Handling

All API errors are structured and envelope-wrapped.

Implemented error shape:

```json
{
  "code": "RESOURCE_NOT_FOUND",
  "message": "Tenant not found",
  "details": {
    "resource_type": "tenant",
    "resource_id": "example-tenant"
  }
}
```

Implemented machine-readable error codes include:

- `INVALID_REQUEST`
- `UNAUTHORIZED`
- `RESOURCE_NOT_FOUND`
- `WORK_ORDER_NOT_CLAIMABLE`
- `INTERNAL_ERROR`

Raw stack traces are not returned in API responses.

## Security Model

### Implemented

- node token authentication
- admin token authentication
- correlation-aware structured logging
- redaction of common secret-shaped log fields

### Not Yet Implemented

- role-based authorization
- secret rotation workflows
- mTLS
- per-resource authorization policies

## Extensibility Model

The code is currently structured so new resource types and work-order types can be added without replacing the whole control flow.

Current extensibility anchors:

- resource storage by `resource_type`
- work-order execution by `type`
- stable response envelope
- versioned schemas
- shared storage and telemetry modules in `/shared`

## Implemented Internal Modules

### Shared

- [`shared/context.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/context.js)
- [`shared/errors.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/errors.js)
- [`shared/auth.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/auth.js)
- [`shared/resources.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/resources.js)
- [`shared/work-orders.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/work-orders.js)
- [`shared/logs.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/logs.js)
- [`shared/events.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/events.js)
- [`shared/idempotency.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/idempotency.js)
- [`shared/snapshots.js`](/home/sundown/Projekter/nodeJS/Skipper/shared/snapshots.js)

### Skipper API

- [`skipper-api/src/index.js`](/home/sundown/Projekter/nodeJS/Skipper/skipper-api/src/index.js)

### Skippy Agent

- [`skippy-agent/src/index.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/index.js)
- [`skippy-agent/src/lib/http.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/lib/http.js)
- [`skippy-agent/src/modules/docker.js`](/home/sundown/Projekter/nodeJS/Skipper/skippy-agent/src/modules/docker.js)

## Current Gaps

The code is now aligned with the architecture for the core deployment path, but it is not feature-complete across the full long-term vision.

Not yet implemented:

- full CRUD APIs for all resource types
- generic reconciliation across all future services
- `resource_updated` and `desired_state_changed` event coverage for every mutation path
- persisted state reports for all future resource kinds
- richer diffing beyond snapshot-level desired/current comparisons
- RBAC and richer authorization
- production HTTPS termination inside the app itself
- additional work-order types such as restart, migrate, nginx management, mysql provisioning, and systemd integration

## Current Compliance Summary

Implemented and aligned:

- `/v1` API contract
- request and correlation ID propagation
- envelope-based responses
- structured errors
- declarative work orders
- three-state resource model
- structured JSON logging
- event persistence
- snapshot persistence
- idempotent deployment apply
- token-based controller/agent auth

Still incomplete relative to the full target:

- broader resource coverage
- broader reconciliation coverage
- broader auth model
- full event coverage for every possible state mutation