restore failure cleanup leaks runtime and volume state #2

New issue

Open

opened 2026-04-09 04:29:01 +00:00 by harivansh-afk · 0 comments

harivansh-afk commented

2026-04-09 04:29:01 +00:00

Owner

RestoreSnapshot has a post-boot failure window where the restored VM and its new system volume can outlive the control-plane write that was supposed to make them durable.

What happens today:

After the VM is already restored and reachable, the code writes the new system volume record and then the machine record (internal/daemon/snapshot.go:233-284).
If store.CreateVolume or store.CreateMachine fails, the function returns immediately without stopping the restored VM, removing the copied disk, or rolling back the just-created volume record (internal/daemon/snapshot.go:257-267, internal/daemon/snapshot.go:282-283).
On restart, reconcileRestore only removes the machine disk directory and runtime directory when the machine record is missing; it does not delete any leaked volume record or explicitly stop a partially restored VM (internal/daemon/lifecycle.go:358-367).

Impact:

A partially restored VM can keep running even though RestoreSnapshot returned an error.
The store can retain a leaked system volume record for a machine that does not exist.
Reconcile does not fully heal the partial state, so follow-up restores and deletes can become inconsistent.

Expected behavior:

Once the runtime is live, any subsequent persistence failure should trigger compensating cleanup of the VM process, runtime dir, copied disk, and any volume records already created.
Reconcile should also delete leaked system-volume records for incomplete restores.

Suggested follow-up:

Add rollback after CreateVolume and CreateMachine failures.
Keep the restore operation journal entry until both records are durable and cleanup has succeeded.
Add a test that injects store failure after CreateVolume and after CreateMachine to verify full rollback.

`RestoreSnapshot` has a post-boot failure window where the restored VM and its new system volume can outlive the control-plane write that was supposed to make them durable. What happens today: - After the VM is already restored and reachable, the code writes the new system volume record and then the machine record (`internal/daemon/snapshot.go:233-284`). - If `store.CreateVolume` or `store.CreateMachine` fails, the function returns immediately without stopping the restored VM, removing the copied disk, or rolling back the just-created volume record (`internal/daemon/snapshot.go:257-267`, `internal/daemon/snapshot.go:282-283`). - On restart, `reconcileRestore` only removes the machine disk directory and runtime directory when the machine record is missing; it does not delete any leaked volume record or explicitly stop a partially restored VM (`internal/daemon/lifecycle.go:358-367`). Impact: - A partially restored VM can keep running even though `RestoreSnapshot` returned an error. - The store can retain a leaked system volume record for a machine that does not exist. - Reconcile does not fully heal the partial state, so follow-up restores and deletes can become inconsistent. Expected behavior: - Once the runtime is live, any subsequent persistence failure should trigger compensating cleanup of the VM process, runtime dir, copied disk, and any volume records already created. - Reconcile should also delete leaked system-volume records for incomplete restores. Suggested follow-up: - Add rollback after `CreateVolume` and `CreateMachine` failures. - Keep the restore operation journal entry until both records are durable and cleanup has succeeded. - Add a test that injects store failure after `CreateVolume` and after `CreateMachine` to verify full rollback.

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: getcompanion-ai/computer-host#2

No description provided.

Rows
Columns