restore failure cleanup leaks runtime and volume state #2

Open
opened 2026-04-09 04:29:01 +00:00 by harivansh-afk · 0 comments

RestoreSnapshot has a post-boot failure window where the restored VM and its new system volume can outlive the control-plane write that was supposed to make them durable.

What happens today:

  • After the VM is already restored and reachable, the code writes the new system volume record and then the machine record (internal/daemon/snapshot.go:233-284).
  • If store.CreateVolume or store.CreateMachine fails, the function returns immediately without stopping the restored VM, removing the copied disk, or rolling back the just-created volume record (internal/daemon/snapshot.go:257-267, internal/daemon/snapshot.go:282-283).
  • On restart, reconcileRestore only removes the machine disk directory and runtime directory when the machine record is missing; it does not delete any leaked volume record or explicitly stop a partially restored VM (internal/daemon/lifecycle.go:358-367).

Impact:

  • A partially restored VM can keep running even though RestoreSnapshot returned an error.
  • The store can retain a leaked system volume record for a machine that does not exist.
  • Reconcile does not fully heal the partial state, so follow-up restores and deletes can become inconsistent.

Expected behavior:

  • Once the runtime is live, any subsequent persistence failure should trigger compensating cleanup of the VM process, runtime dir, copied disk, and any volume records already created.
  • Reconcile should also delete leaked system-volume records for incomplete restores.

Suggested follow-up:

  • Add rollback after CreateVolume and CreateMachine failures.
  • Keep the restore operation journal entry until both records are durable and cleanup has succeeded.
  • Add a test that injects store failure after CreateVolume and after CreateMachine to verify full rollback.
`RestoreSnapshot` has a post-boot failure window where the restored VM and its new system volume can outlive the control-plane write that was supposed to make them durable. What happens today: - After the VM is already restored and reachable, the code writes the new system volume record and then the machine record (`internal/daemon/snapshot.go:233-284`). - If `store.CreateVolume` or `store.CreateMachine` fails, the function returns immediately without stopping the restored VM, removing the copied disk, or rolling back the just-created volume record (`internal/daemon/snapshot.go:257-267`, `internal/daemon/snapshot.go:282-283`). - On restart, `reconcileRestore` only removes the machine disk directory and runtime directory when the machine record is missing; it does not delete any leaked volume record or explicitly stop a partially restored VM (`internal/daemon/lifecycle.go:358-367`). Impact: - A partially restored VM can keep running even though `RestoreSnapshot` returned an error. - The store can retain a leaked system volume record for a machine that does not exist. - Reconcile does not fully heal the partial state, so follow-up restores and deletes can become inconsistent. Expected behavior: - Once the runtime is live, any subsequent persistence failure should trigger compensating cleanup of the VM process, runtime dir, copied disk, and any volume records already created. - Reconcile should also delete leaked system-volume records for incomplete restores. Suggested follow-up: - Add rollback after `CreateVolume` and `CreateMachine` failures. - Keep the restore operation journal entry until both records are durable and cleanup has succeeded. - Add a test that injects store failure after `CreateVolume` and after `CreateMachine` to verify full rollback.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: getcompanion-ai/computer-host#2
No description provided.