From cbef8f8a09b8382567ae3f88dbd5c3eb39412e7b Mon Sep 17 00:00:00 2001 From: Andrei Kvapil Date: Tue, 30 Jun 2026 17:13:32 +0200 Subject: [PATCH 1/3] docs(design-proposals): add ephemeral VM sessions (VMSession) proposal Propose VMSession, an on-demand short-lived isolated VM created by cloning an existing master VM as-is, starting it, and reclaiming it on teardown. Built on existing primitives (vm-disk source.disk clone, vm-instance, tenant-level Cilium isolation). Includes an alternative VMTemplate + templateRef shape and open questions on placement, A<->B isolation, and cross-tenancy. Co-Authored-By: Claude Signed-off-by: Andrei Kvapil --- .../ephemeral-vm-sessions/README.md | 113 ++++++++++++++++++ 1 file changed, 113 insertions(+) create mode 100644 design-proposals/ephemeral-vm-sessions/README.md diff --git a/design-proposals/ephemeral-vm-sessions/README.md b/design-proposals/ephemeral-vm-sessions/README.md new file mode 100644 index 0000000..a9a7e7e --- /dev/null +++ b/design-proposals/ephemeral-vm-sessions/README.md @@ -0,0 +1,113 @@ +# Ephemeral VM sessions + +- **Title:** `Ephemeral VM sessions (VMSession)` +- **Author(s):** `@kvaps` +- **Date:** `2026-06-30` +- **Status:** Draft + +## Overview + +This proposal introduces an on-demand, short-lived isolated VM created by cloning an existing VM. A new resource — working name `VMSession` — clones a designated "master" VM as-is into a fresh, isolated VM, starts it, and reclaims it when the session ends. You keep one VM as the master, configured however you like (image, tooling, packages); each session is a disposable, full clone of it. + +The goal is to offer per-user / per-workspace / per-session isolated environments as a first-class capability, instead of every product hand-rolling clone-and-lifecycle on top of the VM primitives. + +## Context + +Teams building developer-facing products repeatedly need to hand an end user a freshly isolated Linux box, on demand, and throw it away afterwards: + +- **AI coding agents** — each workspace runs *untrusted, machine-generated* code: read/write files, install dependencies from the public internet, run builds and tests. This needs a real escape boundary — the box must not reach the host, the orchestrator, the container runtime socket, internal networks, secrets, or other workspaces. +- **VDI** — a user connects, gets a clone of a desktop master; on disconnect it is destroyed. +- **Interactive playgrounds** (Katacoda/Killercoda-style) — a scenario clones a prepared master, accessible from the browser, destroyed after a time box. + +These are different products with one shared engine: an ephemeral, isolated VM cloned from a master, created on session start and reclaimed on session end. + +### The problem + +Cozystack already ships the building blocks — `vm-disk`, `vm-instance`, CSI disk cloning, tenant-level network isolation — but there is no single resource that ties them into "give me a throwaway clone of this VM and clean it up afterwards". Each consumer reimplements the clone, the boot, the lifecycle, and the teardown. + +## Goals + +- A single resource that clones a designated VM and runs the clone on demand. +- Start the clone even when the master is kept stopped as a template. +- Reclaim the clone and its ephemeral disks on teardown. +- Build on existing primitives; minimize net-new machinery. +- Strong isolation by default — the VM boundary plus the existing tenant network policy. + +### Non-goals + +- The in-VM contract (file access, shell/exec, the agent) — that stays the consuming product's responsibility. +- Warm, sub-second RAM resume with frozen processes — a possible later extension, not part of this proposal. +- Cluster-level or multi-node ephemeral environments — out of scope; this is VM-granularity only. + +## Design + +### The isolation boundary + +The unit of isolation is a virtual machine — its own kernel behind a hardware-virtualization (KVM) boundary, native on bare-metal Cozystack hosts with no nested virtualization. This is microVM-grade and self-hosted / EU-deployable. Container/namespace sandboxes (bubblewrap, even gVisor) are deliberately treated as a weaker boundary; the VM is what this design relies on. + +### Mapping to existing primitives + +The clone-and-run loop is already expressible: + +- `vm-disk` with `source.disk: ` produces a CSI fast-clone of the master's disk. +- `vm-instance` referencing the cloned disk, with `running: true`, boots a VM just like the master — even if the master itself is stopped. +- On teardown, the clone is deleted. + +`VMSession` is the thin wrapper that performs exactly this from a single master reference and reclaims the clone on delete. KubeVirt also ships a native `VirtualMachineClone` that copies a whole VM object at once — an alternative engine under the hood. + +### Reachability + +The session VM is reached through the VM's own Service: `vm-instance` exposes ports with `external: true` and `externalPorts` (e.g. `[22]` for SSH, with public keys injected via cloud-init `nocloud`). A browser web-terminal is just another exposed port. + +### Persistence (optional) + +"Clone as-is" is stateless by default — the clone dies with the session. If a workspace must persist across sessions, its files live on a separate disk that outlives the clone and is re-attached to the next session's clone. This gives resume semantics — the project is where you left it — without a snapshot resource. + +## User-facing changes + +A tenant creates a `VMSession` (or, in the alternative shape below, references a `VMTemplate` from a `VMInstance`), pointing at a master VM, and gets back a running, isolated clone with a known entry point. Deleting the session reclaims the clone. There is no change to existing VM workflows; the capability is additive and opt-in. + +## Upgrade and rollback compatibility + +Additive: a new, optional resource. Existing clusters, manifests, and the current `vm-instance` / `vm-disk` apps keep working unchanged. Removing the feature leaves existing VMs untouched; only sessions, which are ephemeral by definition, are affected. + +## Security + +The clone runs behind a VM (KVM) boundary and inherits Cozystack's tenant-level Cilium isolation, because it is a pod in the tenant namespace: by default a tenant may egress to the public internet (`toEntities: world`) but not to other tenants or arbitrary cluster pods, and the kube-apiserver is unreachable unless a workload is explicitly labelled `policy.cozystack.io/allow-to-apiserver`. + +One caveat: this isolation is tenant-scoped, not per-session. The default intra-tenant policy allows all pod-to-pod traffic inside a tenant, so two sessions in the same tenant are not isolated from each other today. Per-session (A↔B) isolation needs either one tenant/namespace per session, or a new per-session network-policy knob (see open questions). + +Accepted residual risk: a session can exfiltrate its own data over the permitted public path — inherent to "internet access is required". + +VMs are configured via cloud-init `nocloud` (user-data and SSH keys delivered through a Secret mounted as a disk); there is no metadata IMDS endpoint in this architecture. + +## Failure and edge cases + +- Master is stopped → the session clone is still created and started (`running: true`). +- Master is running while cloned → the clone is crash-consistent at best; clean clones expect a stopped or quiesced master (see open questions). +- Session deleted → the clone and its ephemeral disks are garbage-collected; a separate persistent workspace disk, if used, is retained. +- Two sessions in one tenant → reachable to each other today under the tenant-scoped policy; A↔B isolation is an open design point. + +## Testing + +A manual vertical slice on the existing apps first: clone a master's disk → boot a `vm-instance` with SSH exposed → install dependencies and build → verify the boundary (the kube-apiserver, internal services, and another tenant's session are unreachable; the public internet works) → tear the clone down (optionally re-attach a persistent workspace disk for stateful resume) → measure spin-up. Once the loop holds, the `VMSession` controller gets unit and e2e coverage for clone, start, teardown/GC, and policy. + +## Rollout + +1. Prototype the loop on existing apps with no new CRD, validating isolation and spin-up. +2. Introduce the chosen resource shape (`VMSession`, or `VMTemplate` + `templateRef`). +3. Layer optional persistence and, if wanted, cross-tenancy or a warm pool. + +## Open questions + +1. **Entity shape.** `VMSession` cloning an existing VM, or `VMTemplate` + a `templateRef` on `VMInstance` (see *Alternatives considered*)? +2. **Placement / A↔B granularity.** Shared tenant plus a new per-session network policy, vs one tenant/namespace per session. +3. **Cross-tenancy.** Do we need it at all — a master/template in one tenant cloned into sessions in other tenants, or a single source shared across tenants — and if so, how to implement it (cross-namespace clone permissions, a `cozy-public`-style shared catalog, or sub-tenants)? +4. **Spin-up latency.** Cold clone+boot is seconds; acceptable per session, or is a pre-warmed pool wanted? +5. **Running source.** Is cloning a running master/template (crash-consistency) needed, or is a stopped one enough? + +## Alternatives considered + +- **`VMTemplate` + a reference from `VMInstance`.** Instead of a session resource cloning a live VM, introduce a `VMTemplate` holding the VM settings (instance type, networking, base disks) and a `templateRef` on `VMInstance` that takes its settings from the template and clones the template's disks on creation. This gives a clean separation between an (almost) immutable template and the running clones, closer to VMware-template / image-and-flavor models, and keeps cloning inside the existing `VMInstance` rather than adding a session lifecycle object. The trade-off is a new template concept and a schema change to `VMInstance`, versus `VMSession` where the master is simply an existing VM you can boot and edit in place. Either shape can carry the ephemeral lifecycle; the choice is about whether the source of truth is a dedicated template or an existing VM. +- **Container / userspace sandboxes (bubblewrap, gVisor, Kata).** Weaker or operationally heavier isolation than a full VM; gVisor's userspace kernel and Kata's nested-virtualization requirement make them a worse fit for a strong, bare-metal, no-nesting boundary. +- **Managed microVM offerings (Firecracker-based).** Clean isolation, but not self-hosted and not EU-deployable; this proposal targets a self-hosted equivalent on Cozystack's existing KubeVirt-on-bare-metal substrate. From a1c6e625806f7b9dd38a3f684ef196e27f8840ae Mon Sep 17 00:00:00 2001 From: Andrei Kvapil Date: Tue, 30 Jun 2026 17:38:18 +0200 Subject: [PATCH 2/3] docs(design-proposals): add spec.state lifecycle and SecurityGroup network isolation Add a declarative spec.state (Running/Paused/Stopped) reconciled via KubeVirt runStrategy and the pause/unpause subresource, with an honest note on the warm-resume gap (no suspend-to-disk-and-free on stable KubeVirt). Add spec.networkIsolation realised through the merged SecurityGroup (sdn.cozystack.io/v1alpha1), noting that hard deny/A<->B enforcement depends on the planned default-deny tenant baseline. Co-Authored-By: Claude Signed-off-by: Andrei Kvapil --- .../ephemeral-vm-sessions/README.md | 21 ++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/design-proposals/ephemeral-vm-sessions/README.md b/design-proposals/ephemeral-vm-sessions/README.md index a9a7e7e..4ef5bed 100644 --- a/design-proposals/ephemeral-vm-sessions/README.md +++ b/design-proposals/ephemeral-vm-sessions/README.md @@ -55,6 +55,18 @@ The clone-and-run loop is already expressible: `VMSession` is the thin wrapper that performs exactly this from a single master reference and reclaims the clone on delete. KubeVirt also ships a native `VirtualMachineClone` that copies a whole VM object at once — an alternative engine under the hood. +### Lifecycle and desired state + +`VMSession` carries a declarative desired state — `spec.state: Running | Paused | Stopped` — and the controller reconciles the underlying VM to it: + +- **Running** — the clone runs (`vm-instance` `runStrategy: Always`). +- **Paused** — the guest is frozen via KubeVirt's `pause` subresource: vCPUs stop and resume is near-instant. Note that pause keeps the guest RAM resident on the node — node resources are *not* freed. Good for short idle with instant resume, not for scale-to-zero. +- **Stopped** — the VM is halted (`runStrategy: Halted`): node resources are freed, but the guest cold-boots on the next start; process and RAM state are lost, while the disk (and any persistent workspace volume) survive. + +Because pause is a KubeVirt subresource rather than a spec field, the controller invokes `pause`/`unpause` to drive the `Paused` state and uses `runStrategy` for `Running`/`Stopped`. + +**Limitation — warm resume.** Stable KubeVirt has no suspend-to-disk that both frees node resources *and* restores RAM/process state (a Firecracker-style snapshot). `VirtualMachineSnapshot` captures disk only; `memory-dump` is diagnostic and not restorable. So today `VMSession` offers fast-resume-but-resources-held (`Paused`) or resources-freed-but-cold (`Stopped`), not both. True warm, scale-to-zero resume with live processes is a gap that would need upstream KubeVirt work or an external CRIU-style layer; it is out of scope for v1 and tracked as a follow-up. + ### Reachability The session VM is reached through the VM's own Service: `vm-instance` exposes ports with `external: true` and `externalPorts` (e.g. `[22]` for SSH, with public keys injected via cloud-init `nocloud`). A browser web-terminal is just another exposed port. @@ -65,7 +77,7 @@ The session VM is reached through the VM's own Service: `vm-instance` exposes po ## User-facing changes -A tenant creates a `VMSession` (or, in the alternative shape below, references a `VMTemplate` from a `VMInstance`), pointing at a master VM, and gets back a running, isolated clone with a known entry point. Deleting the session reclaims the clone. There is no change to existing VM workflows; the capability is additive and opt-in. +A tenant creates a `VMSession` (or, in the alternative shape below, references a `VMTemplate` from a `VMInstance`), pointing at a master VM, and gets back a running, isolated clone with a known entry point. `spec.state` (`Running` / `Paused` / `Stopped`) controls the clone's lifecycle, and `spec.networkIsolation` opts the session into per-session network rules. Deleting the session reclaims the clone. There is no change to existing VM workflows; the capability is additive and opt-in. ## Upgrade and rollback compatibility @@ -75,7 +87,9 @@ Additive: a new, optional resource. Existing clusters, manifests, and the curren The clone runs behind a VM (KVM) boundary and inherits Cozystack's tenant-level Cilium isolation, because it is a pod in the tenant namespace: by default a tenant may egress to the public internet (`toEntities: world`) but not to other tenants or arbitrary cluster pods, and the kube-apiserver is unreachable unless a workload is explicitly labelled `policy.cozystack.io/allow-to-apiserver`. -One caveat: this isolation is tenant-scoped, not per-session. The default intra-tenant policy allows all pod-to-pod traffic inside a tenant, so two sessions in the same tenant are not isolated from each other today. Per-session (A↔B) isolation needs either one tenant/namespace per session, or a new per-session network-policy knob (see open questions). +Per-session isolation is expressed through `spec.networkIsolation`, which the controller realises by creating a `SecurityGroup` (`sdn.cozystack.io/v1alpha1`) attached to the session's VM — the workload-level mechanism for declaring which peers a session may reach (public internet, named apps, CIDRs, FQDNs, other groups). + +One honest caveat: `SecurityGroup` is allow-only and additive, and the current tenant baseline is blanket-allow. So an allow-list does not yet *subtract* from the baseline — hard "deny RFC1918 / deny A↔B" enforcement depends on the platform's planned **default-deny tenant baseline** (future work in the SDN design). The clean end state is a default-deny baseline plus a per-session `SecurityGroup` that opens only the public internet: internet yes; private/internal/metadata/other-sessions no. `VMSession` adopts `SecurityGroup` now, so the intent is captured and ready, and gains full enforcement when default-deny lands; until then, hard A↔B isolation is available by running one tenant/namespace per session. Accepted residual risk: a session can exfiltrate its own data over the permitted public path — inherent to "internet access is required". @@ -101,10 +115,11 @@ A manual vertical slice on the existing apps first: clone a master's disk → bo ## Open questions 1. **Entity shape.** `VMSession` cloning an existing VM, or `VMTemplate` + a `templateRef` on `VMInstance` (see *Alternatives considered*)? -2. **Placement / A↔B granularity.** Shared tenant plus a new per-session network policy, vs one tenant/namespace per session. +2. **Placement / A↔B granularity.** One tenant/namespace per session (hard isolation today), or a shared tenant relying on per-session `SecurityGroup`s — which needs the default-deny tenant baseline (SDN future work) to actually enforce. 3. **Cross-tenancy.** Do we need it at all — a master/template in one tenant cloned into sessions in other tenants, or a single source shared across tenants — and if so, how to implement it (cross-namespace clone permissions, a `cozy-public`-style shared catalog, or sub-tenants)? 4. **Spin-up latency.** Cold clone+boot is seconds; acceptable per session, or is a pre-warmed pool wanted? 5. **Running source.** Is cloning a running master/template (crash-consistency) needed, or is a stopped one enough? +6. **Warm resume.** Is `Paused` (instant resume, but node resources stay held) enough, or is true scale-to-zero warm resume with live processes required — which is not available on stable KubeVirt and would need upstream or CRIU-style work? ## Alternatives considered From d78105c2ede33c54ff16f690bf64d3ce1739b79b Mon Sep 17 00:00:00 2001 From: Andrei Kvapil Date: Tue, 30 Jun 2026 17:47:04 +0200 Subject: [PATCH 3/3] docs(design-proposals): ship per-session CiliumNetworkPolicy in the chart Drop the SecurityGroup dependency. Bake a per-session CiliumNetworkPolicy with deny rules into the VMSession chart instead: Cilium deny takes precedence over allow, so deny-private / deny-A<->B enforces regardless of the blanket-allow tenant baseline, with no dependency on a default-deny baseline. Precedent: tenant and cilium-networkpolicy charts. Co-Authored-By: Claude Signed-off-by: Andrei Kvapil --- design-proposals/ephemeral-vm-sessions/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/design-proposals/ephemeral-vm-sessions/README.md b/design-proposals/ephemeral-vm-sessions/README.md index 4ef5bed..d08ceab 100644 --- a/design-proposals/ephemeral-vm-sessions/README.md +++ b/design-proposals/ephemeral-vm-sessions/README.md @@ -87,9 +87,9 @@ Additive: a new, optional resource. Existing clusters, manifests, and the curren The clone runs behind a VM (KVM) boundary and inherits Cozystack's tenant-level Cilium isolation, because it is a pod in the tenant namespace: by default a tenant may egress to the public internet (`toEntities: world`) but not to other tenants or arbitrary cluster pods, and the kube-apiserver is unreachable unless a workload is explicitly labelled `policy.cozystack.io/allow-to-apiserver`. -Per-session isolation is expressed through `spec.networkIsolation`, which the controller realises by creating a `SecurityGroup` (`sdn.cozystack.io/v1alpha1`) attached to the session's VM — the workload-level mechanism for declaring which peers a session may reach (public internet, named apps, CIDRs, FQDNs, other groups). +Per-session isolation ships inside the `VMSession` chart as a `CiliumNetworkPolicy` scoped to the session's VM (an `endpointSelector` on the session label), toggled by `spec.networkIsolation`. It allows egress to the public internet and DNS, and uses Cilium **deny rules** (`egressDeny` / `ingressDeny`) to block private and internal ranges (RFC1918, link-local), the cluster pod/service CIDRs and the kube-apiserver, and other sessions in the same tenant (A↔B). -One honest caveat: `SecurityGroup` is allow-only and additive, and the current tenant baseline is blanket-allow. So an allow-list does not yet *subtract* from the baseline — hard "deny RFC1918 / deny A↔B" enforcement depends on the platform's planned **default-deny tenant baseline** (future work in the SDN design). The clean end state is a default-deny baseline plus a per-session `SecurityGroup` that opens only the public internet: internet yes; private/internal/metadata/other-sessions no. `VMSession` adopts `SecurityGroup` now, so the intent is captured and ready, and gains full enforcement when default-deny lands; until then, hard A↔B isolation is available by running one tenant/namespace per session. +This holds regardless of the blanket-allow tenant baseline, because Cilium deny rules take precedence over allow rules — so the policy enforces "internet yes; private/internal/metadata/other-sessions no" without depending on a platform-wide default-deny baseline. Shipping the policy in the chart follows existing precedent: the tenant chart and the `cilium-networkpolicy` system package already ship CiliumNetworkPolicies, the latter using `ingressDeny`. Accepted residual risk: a session can exfiltrate its own data over the permitted public path — inherent to "internet access is required". @@ -115,7 +115,7 @@ A manual vertical slice on the existing apps first: clone a master's disk → bo ## Open questions 1. **Entity shape.** `VMSession` cloning an existing VM, or `VMTemplate` + a `templateRef` on `VMInstance` (see *Alternatives considered*)? -2. **Placement / A↔B granularity.** One tenant/namespace per session (hard isolation today), or a shared tenant relying on per-session `SecurityGroup`s — which needs the default-deny tenant baseline (SDN future work) to actually enforce. +2. **Placement / A↔B granularity.** The chart-shipped per-session `CiliumNetworkPolicy` (deny rules) isolates A from B within a shared tenant directly; is that enough, or is one tenant/namespace per session still wanted as a coarser, simpler boundary? 3. **Cross-tenancy.** Do we need it at all — a master/template in one tenant cloned into sessions in other tenants, or a single source shared across tenants — and if so, how to implement it (cross-namespace clone permissions, a `cozy-public`-style shared catalog, or sub-tenants)? 4. **Spin-up latency.** Cold clone+boot is seconds; acceptable per session, or is a pre-warmed pool wanted? 5. **Running source.** Is cloning a running master/template (crash-consistency) needed, or is a stopped one enough?