feat(backup): incremental NAS backup support for KVM by jmsperu · Pull Request #13074 · apache/cloudstack

jmsperu · 2026-04-27T15:46:59Z

Summary

Implements incremental backup support for the NAS backup provider on KVM, using QEMU dirty bitmaps and libvirt's backup-begin API. RFC: #12899.

For large VMs this reduces daily backup storage 80–95% and shortens backup windows from hours to minutes (e.g. a 500 GB VM with moderate writes goes from ~500 GB/day to ~5–15 GB/day after the initial full backup).

What's in the PR

Commit	What
`f2a9202d74`	RFC document at `docs/rfcs/incremental-nas-backup.md`
`1981469099`	`NASBackupChainKeys` constants + zone-scoped `nas.backup.full.every` ConfigKey (default 10)
`fbb916b254`	`nasbackup.sh` mode-aware: full+checkpoint or incremental+rebase via `backup-begin`
`1f2aebca36`	Java orchestration: full-vs-incremental decision in provider, chain metadata in `backup_details`
`43e2f7504a`	On-demand bitmap recreation when CloudStack rebuilt the domain XML on VM restart
`39303fbf88`	Restore path: relative-path rebase + `qemu-img convert` flatten for file-based primary
`b8d069e127`	Cascade delete: `RebaseBackupCommand`, chain repair for delete-middle, refuse-delete-full-with-children
`49edc7f22c`	Five new smoke tests in `test/integration/smoke/test_backup_recovery_nas.py`

Full diff: 11 files, +1617 / −30.

Review feedback addressed (all from #12899 thread)

#	Reviewer	Concern	Resolution
1	@JoaoJandre	No new columns on `backups`	Chain metadata stored in existing `backup_details` kv table via `NASBackupChainKeys`
2	@abh1sar	`nas.backup.full.interval` (days) doesn't fit hourly/ad-hoc	Replaced with count-based `nas.backup.full.every` (default 10)
3	@abh1sar	Use `backup-begin` for full backups too	Done — both modes use `backup-begin`; full omits `<incremental>`
4	@abh1sar	Timestamp-based bitmap names	`backup-<epoch>` (`System.currentTimeMillis()/1000`)
5	@abh1sar	No explicit `block-dirty-bitmap-add`	libvirt manages bitmaps via `--checkpointxml`; manual bitmap commands removed
6	@abh1sar	`qemu-img rebase` after each incremental	Done in `nasbackup.sh`, with relative backing path so chain survives mount-point churn
7	@abh1sar	Stopped VMs	Stopped VMs always full; agent emits `INCREMENTAL_FALLBACK=` if cadence asked for inc
8	@abh1sar	Cascade delete behaviour	Implemented: middle-inc rebases child onto grandparent; full-with-children refuses unless `forced=true`
9	@abh1sar	Bitmap recreation on VM restart	Lazy recreation at next backup attempt — agent checks `virsh checkpoint-list`, recreates if missing, emits `BITMAP_RECREATED=`
10	@abh1sar	Smoke tests	5 new cases in `test_backup_recovery_nas.py`
11	@abh1sar	Single PR for 4.23	This PR

Backwards compatibility

The new -M / --bitmap-* flags on nasbackup.sh are optional. Without them, the script preserves the legacy full-only behaviour exactly (no checkpoint creation, same XML).
TakeBackupCommand new fields default to null; LibvirtTakeBackupCommandWrapper only emits the new flags when set, so a 4.22 management server talking to a 4.23 agent still works.
Existing backups (no chain_id in backup_details) are treated as standalone fulls by the cascade-delete logic — no migration needed.

Test plan

Build (CI)
Unit tests (CI; existing NASBackupProviderTest should still pass)
Smoke tests in test_backup_recovery_nas.py (5 new cases — require required_hardware="true")
Manual: take 5 backups with nas.backup.full.every=3, verify chain pattern FULL, INC, INC, FULL, INC
Manual: restore from a tail incremental and diff the restored disk against the live disk at backup time
Manual: stop-start a VM between two incrementals and check BITMAP_RECREATED= shows up in agent logs
Manual: delete a middle inc and confirm the next restore from the tail still has the deleted backup's blocks

Refs

Issue: [RFC] Incremental NAS Backup Support for KVM Hypervisor #12899

Adds the design document for incremental NAS backups using QEMU dirty bitmaps and libvirt's backup-begin API. Reduces daily backup storage 80-95% for large VMs. Refs: apache#12899

NASBackupChainKeys defines the keys this provider stores under the existing backup_details kv table (parent_backup_id, bitmap_name, chain_id, chain_position, type). This keeps the backups table provider-agnostic per the RFC review. nas.backup.full.every is a zone-scoped ConfigKey that controls how often a full backup is taken; the remaining backups in the cycle are incremental. Counts backups (not days), so it works for hourly, daily, and ad-hoc schedules. Default 10. Set to 1 to disable incrementals (every backup is full). Refs: apache#12899

Adds three new optional CLI flags to nasbackup.sh: -M|--mode <full|incremental> --bitmap-new <name> (checkpoint to create with this backup) --bitmap-parent <name> (incremental: parent bitmap to read changes since) --parent-path <path> (incremental: parent backup file for rebase) Behavior: - When -M is omitted, behavior is unchanged (legacy full-only, no checkpoint created), so existing callers are not affected. - With -M full + --bitmap-new, a full backup is taken AND a libvirt checkpoint of that name is registered atomically (via backup-begin's --checkpointxml), giving the next incremental its starting bitmap. - With -M incremental, libvirt's <incremental> element references the parent bitmap; only changed blocks are written. After completion, qemu-img rebase wires the new file to its parent so the chain on the NAS is self-describing for restore. - Stopped VMs cannot use backup-begin; if -M incremental is requested while VM is stopped, the script falls back to a full and emits INCREMENTAL_FALLBACK= on stderr so the orchestrator can record it correctly in the chain. - The script echoes BITMAP_CREATED=<name> on success so the Java caller can store it under backup_details (NASBackupChainKeys.BITMAP_NAME). Works across local file, NFS-file, and LINSTOR primary storage. Ceph RBD running-VM support is a pre-existing limitation of this script, not affected by this change. Refs: apache#12899

Adds the Java side of the incremental NAS backup feature: TakeBackupCommand + mode, bitmapNew, bitmapParent, parentPath fields (null for legacy callers — script preserves its existing behaviour when these are omitted). BackupAnswer + bitmapCreated (echoed by the agent on success) + incrementalFallback (true when an incremental was requested but the agent had to fall back to full because the VM was stopped). LibvirtTakeBackupCommandWrapper - Forwards the new fields to nasbackup.sh. - Strips the new BITMAP_CREATED= / INCREMENTAL_FALLBACK= marker lines out of stdout before the existing numeric-suffix size parser runs, so the script can keep the same "size as last line(s)" contract. - Surfaces both markers on the BackupAnswer. NASBackupProvider - decideChain(vm) walks backup_details (chain_id, chain_position, bitmap_name) for the latest BackedUp backup of the VM and decides: * Stopped VM -> full (libvirt backup-begin needs running QEMU) * No prior chain -> full (chain_position=0) * chain_position+1 >= nas.backup.full.every -> new full * otherwise -> incremental, parent=last bitmap - Generates timestamp-based bitmap names ("backup-<epoch>") matching what the script then registers as the libvirt checkpoint name. - persistChainMetadata() writes parent_backup_id, bitmap_name, chain_id, chain_position, type into the existing backup_details key/value table (per the RFC review — no new columns on backups). - Honours the agent's INCREMENTAL_FALLBACK= signal: re-records the backup as a full and starts a fresh chain. - createBackupObject() now takes a type argument so the BackupVO reflects the actual decision instead of always being "FULL". Refs: apache#12899

CloudStack rebuilds the libvirt domain XML on every VM start, which means persistent QEMU dirty bitmaps don't survive a stop/start cycle. Rather than hooking into the VM start lifecycle (intrusive across the orchestration layer), this commit handles the missing bitmap *lazily* at the next backup attempt: nasbackup.sh - When -M incremental is requested, the script first checks `virsh checkpoint-list` for the parent bitmap. If absent, it recreates the checkpoint on the running domain so libvirt accepts the <incremental> reference. The next incremental will be larger than usual (it captures all writes since recreate, not since the previous incremental) but is correct; subsequent ones return to normal size. - On recreation, emits BITMAP_RECREATED=<name> on stdout for the orchestrator to record. BackupAnswer + bitmapRecreated field surfaced from the agent. LibvirtTakeBackupCommandWrapper - Strips BITMAP_RECREATED= line from stdout before size parsing. - Sets answer.setBitmapRecreated(...). NASBackupChainKeys + BITMAP_RECREATED key for backup_details. NASBackupProvider - When the agent reports a recreated bitmap, persists it under backup_details and logs an info-level message so operators can correlate larger-than-usual incrementals with VM restarts. This satisfies the bitmap-loss-on-VM-restart concern from the RFC review without touching VirtualMachineManager / StartCommand / agent lifecycle. Refs: apache#12899

Two changes that together let an incremental NAS backup be restored without manual chain assembly: scripts/vm/hypervisor/kvm/nasbackup.sh - qemu-img rebase now writes a backing-file path that is RELATIVE to the new qcow2's directory (e.g. ../<parent-ts>/root.<uuid>.qcow2) rather than the absolute path on the current mount point. NAS mount points are ephemeral (mktemp -d), so an absolute reference would not resolve when the backup is re-mounted at restore time. Relative references are resolved by qemu-img against the file's own directory, so the chain stays valid no matter where the NAS is mounted next. - Verifies the parent file exists on the NAS before rebasing. LibvirtRestoreBackupCommandWrapper - For file-based primary storage (local, NFS-file), the existing code rsync'd the source qcow2 to the volume. That copies only the differential blocks of an incremental, leaving a volume whose backing-file reference points at a path the primary storage host doesn't have. Now: detect a backing-chain via qemu-img info JSON and flatten via 'qemu-img convert -O qcow2', which follows the chain and produces a self-contained qcow2. Full backups continue to use rsync (faster, no chain to flatten). - The block-storage path (RBD/Linstor) already used qemu-img convert via the QemuImg helper, which auto-flattens chains, so that path needed no change. Refs: apache#12899

@abh1sar

Adds the delete-with-chain-repair semantics agreed in the RFC review: scripts/vm/hypervisor/kvm/nasbackup.sh - New '-o rebase' operation: rebases an existing on-NAS qcow2 onto a new backing parent. Uses a SAFE rebase (no -u) so the target absorbs blocks of the about-to-be-deleted parent before the backing pointer is moved up to the grandparent. Writes the new backing reference relative to the target's directory so it survives mount-point changes. - New CLI flags --rebase-target, --rebase-new-backing (both passed mount-relative). RebaseBackupCommand + LibvirtRebaseBackupCommandWrapper - New agent command that wraps the script's rebase operation. The provider sends one of these per child that needs re-pointing. NASBackupProvider.deleteBackup - Now plans the chain repair before touching files via computeChainRepair(): * No chain metadata -> single-file delete (legacy behaviour) * Tail incremental -> single delete, no rebase * Middle incremental -> rebase immediate child onto our parent, then delete; shift chain_position of all later descendants by -1 * Full with descendants -> refuse unless forced=true; with forced=true delete full + every descendant newest-first - Updates parent_backup_id, chain_position metadata in backup_details after each rebase so the model in the DB matches the on-disk chain. This implements the cascade-delete behaviour requested in @abh1sar's review point apache#7. Refs: apache#12899

Adds five new test cases to test_backup_recovery_nas.py covering the end-to-end behaviour of the incremental NAS backup feature: * test_incremental_chain_cadence - Sets nas.backup.full.every=3, takes 5 backups, verifies the type pattern is FULL, INC, INC, FULL, INC. * test_restore_from_incremental - FULL + 2 INCs, each with a marker file. Restores from the latest INC and verifies all three markers are present (i.e. qemu-img convert flattened the chain correctly). * test_delete_middle_incremental_repairs_chain - Builds FULL, INC1, INC2; deletes INC1 (no force needed); restores from the surviving INC2 and verifies that markers from FULL, INC1 (which was deleted), and INC2 are all present — proving the rebase merged INC1's blocks into INC2. * test_refuse_delete_full_with_children - Verifies plain delete of a FULL that has children fails, and delete with forced=true succeeds and removes the whole chain. * test_stopped_vm_falls_back_to_full - Sets cadence to 2, takes one backup (FULL), stops the VM, triggers another (cadence would say INC). Verifies the second backup is recorded as FULL because the agent fell back when backup-begin couldn't run on a stopped VM. All tests restore nas.backup.full.every to 10 in finally blocks. Refs: apache#12899

codecov · 2026-04-27T17:19:46Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 3.52%. Comparing base (6f4445c) to head (d80ed16).
⚠️ Report is 1 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (6f4445c) and HEAD (d80ed16). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (6f4445c) HEAD (d80ed16)

unittests 1 0

Additional details and impacted files

@@              Coverage Diff              @@
##               main   #13074       +/-   ##
=============================================
- Coverage     18.02%    3.52%   -14.51%     
=============================================
  Files          6029      464     -5565     
  Lines        542184    40137   -502047     
  Branches      66451     7555    -58896     
=============================================
- Hits          97740     1415    -96325     
+ Misses       433428    38534   -394894     
+ Partials      11016      188    -10828

Flag	Coverage Δ
uitests	`3.52% <ø> (ø)`
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sureshanaparti · 2026-04-28T05:41:32Z

@jmsperu can you check the build failure. thanks.

weizhouapache · 2026-04-28T07:12:11Z

@jmsperu
is this ready for review ?

Phase 6 added a hasBackingChain() check before rsync that uses qemu-img info to detect chained incrementals. The existing testExecuteWithRsyncFailure test mocks Script.runSimpleBashScriptForExitValue to return 0 for any command, so the new qemu-img info check incorrectly evaluates as "has backing chain" and routes the test through the chain-flatten path instead of rsync — the test then asserts a failure that never occurs. Add a clause to the mock that returns 1 (no backing chain) for the qemu-img info backing-filename probe, so the test continues to exercise the rsync path it was designed for.

jmsperu · 2026-04-28T08:37:13Z

@weizhouapache yes — ready for review.

@sureshanaparti — apologies, I missed your earlier ping. The build failure was a unit test in LibvirtRestoreBackupCommandWrapperTest.testExecuteWithRsyncFailure (NPE on currentDevice after my new chain-flatten check incorrectly routed the test through the qemu-img convert path).

Fixed in d80ed16: the test's Script.runSimpleBashScriptForExitValue mock now returns 1 (no backing chain) for the new qemu-img info | grep "backing-filename" probe, so the test continues to exercise the rsync path it was designed for.

CI should be green on the next run. Cc @abh1sar @JoaoJandre @harikrishna-patnala in case you also want to take a look.

Copilot

Pull request overview

Adds incremental backup-chain support to the NAS backup provider for KVM by leveraging libvirt backup-begin with checkpoints/dirty-bitmaps, plus restore/flatten and chain-aware delete/repair semantics.

Changes:

Introduces backup-chain metadata keys (NASBackupChainKeys) and zone-scoped cadence config nas.backup.full.every, with orchestration logic to choose full vs incremental and persist chain details in backup_details.
Extends the KVM agent + nasbackup.sh to support full-with-checkpoint and incremental-with-rebase, plus a new “rebase” operation used for chain repair during delete.
Updates restore logic to detect qcow2 backing chains and flatten via qemu-img convert, and adds new integration smoke tests for incremental-chain behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
test/integration/smoke/test_backup_recovery_nas.py	Adds incremental-chain smoke tests (cadence, restore, delete-middle repair, forced delete behavior, stopped-VM fallback).
scripts/vm/hypervisor/kvm/nasbackup.sh	Adds mode-aware backup (`full`/`incremental`), checkpoint creation, incremental rebase, and a new `rebase` operation for delete-middle chain repair.
plugins/hypervisors/kvm/src/test/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtRestoreBackupCommandWrapperTest.java	Extends restore wrapper tests to exercise the “no backing chain => rsync” path.
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtTakeBackupCommandWrapper.java	Passes incremental args to `nasbackup.sh` and parses bitmap/fallback markers from script output.
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtRestoreBackupCommandWrapper.java	Detects qcow2 backing chains and flattens incrementals during restore using `qemu-img convert`.
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtRebaseBackupCommandWrapper.java	New wrapper to run `nasbackup.sh -o rebase` for chain repair.
plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/NASBackupProvider.java	Implements full-vs-incremental decisions, stores chain metadata in `backup_details`, and adds chain-aware delete/repair logic.
plugins/backup/nas/src/main/java/org/apache/cloudstack/backup/NASBackupChainKeys.java	Defines `backup_details` keys for chain id/position/type/bitmap/parent linkage.
docs/rfcs/incremental-nas-backup.md	Adds an RFC document describing incremental NAS backup approach (needs alignment with final implementation).
core/src/main/java/org/apache/cloudstack/backup/TakeBackupCommand.java	Adds optional incremental-mode fields (mode/bitmap names/parent path).
core/src/main/java/org/apache/cloudstack/backup/RebaseBackupCommand.java	New agent command to rebase a backup qcow2 onto a new backing file for chain repair.
core/src/main/java/org/apache/cloudstack/backup/BackupAnswer.java	Adds fields to return bitmap creation/recreation and incremental-fallback markers back to orchestration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T09:21:09Z

+     * Compose the on-NAS path of a parent backup's root-disk qcow2. Relative to the NAS mount,
+     * matches the layout written by {@code nasbackup.sh} ({@code <backupPath>/root.<volUuid>.qcow2}).
+     */
+    private String composeParentBackupPath(Backup parent) {
+        // backupPath is stored as externalId by createBackupObject — e.g. "i-2-1234-VM/2026.04.27.13.45.00".
+        // Volume UUID for the root volume is what the script keys backup files on.
+        VolumeVO rootVolume = volumeDao.getInstanceRootVolume(parent.getVmId());
+        String volUuid = rootVolume == null ? "root" : rootVolume.getUuid();
+        return parent.getExternalId() + "/root." + volUuid + ".qcow2";


composeParentBackupPath (and decideChain) only computes a parent qcow2 path for the root volume. However nasbackup.sh exports one qcow2 per disk (root + datadisks), and each disk’s incremental qcow2 needs to reference its own parent qcow2 file. As-is, incrementals for VMs with data disks cannot be rebased/flattened correctly. Consider passing a per-disk parent mapping (volUuid->path) or passing the parent backup directory and deriving root.<uuid>.qcow2 / datadisk.<uuid>.qcow2 for each disk.

Suggested change

* Compose the on-NAS path of a parent backup's root-disk qcow2. Relative to the NAS mount,

* matches the layout written by {@code nasbackup.sh} ({@code <backupPath>/root.<volUuid>.qcow2}).

*/

private String composeParentBackupPath(Backup parent) {

// backupPath is stored as externalId by createBackupObject — e.g. "i-2-1234-VM/2026.04.27.13.45.00".

// Volume UUID for the root volume is what the script keys backup files on.

VolumeVO rootVolume = volumeDao.getInstanceRootVolume(parent.getVmId());

String volUuid = rootVolume == null ? "root" : rootVolume.getUuid();

return parent.getExternalId() + "/root." + volUuid + ".qcow2";

* Compose the on-NAS path of a parent backup. Relative to the NAS mount, this is the backup

* directory written by {@code nasbackup.sh} (for example

* {@code <vmBackupDir>/<timestamp>}).

*

* The backup layout contains one qcow2 per disk ({@code root.<volUuid>.qcow2} and

* {@code datadisk.<volUuid>.qcow2}). Returning the parent backup directory instead of a

* root-disk-specific qcow2 path allows downstream code to derive the correct parent qcow2

* for each disk during incremental export/rebase/flatten operations.

*/

private String composeParentBackupPath(Backup parent) {

// backupPath is stored as externalId by createBackupObject — e.g. "i-2-1234-VM/2026.04.27.13.45.00".

// Return the parent backup directory so each disk can resolve its own parent qcow2 file.

return parent.getExternalId();

Copilot · 2026-04-28T09:21:10Z

+        VolumeVO rootVolume = volumeDao.getInstanceRootVolume(backup.getVmId());
+        String volUuid = rootVolume == null ? "root" : rootVolume.getUuid();
+        String childPath = immediateChild.getExternalId() + "/root." + volUuid + ".qcow2";
+        String parentPath = ourParent.getExternalId() + "/root." + volUuid + ".qcow2";
+
+        RebaseStep step = new RebaseStep(immediateChild.getId(), childPath, parentPath,
+                ourParent.getUuid(), chainPosition(immediateChild) - 1);
+
+        // After we delete the middle backup, every descendant's numeric chain_position
+        // becomes stale (off by one). Their backing-file pointers don't need re-writing
+        // (only the immediate child changed parents) but their position metadata does.
+        return ChainRepairPlan.proceed(
+                Collections.singletonList(step),


Chain repair on delete-middle currently rebases only the root qcow2 (.../root.<volUuid>.qcow2). For VMs with additional data disks, the corresponding datadisk.<uuid>.qcow2 layers will still reference the deleted backup, breaking restores. The repair plan needs to generate RebaseSteps per backed-up disk (and delete the per-disk files consistently).

Suggested change

VolumeVO rootVolume = volumeDao.getInstanceRootVolume(backup.getVmId());

String volUuid = rootVolume == null ? "root" : rootVolume.getUuid();

String childPath = immediateChild.getExternalId() + "/root." + volUuid + ".qcow2";

String parentPath = ourParent.getExternalId() + "/root." + volUuid + ".qcow2";

RebaseStep step = new RebaseStep(immediateChild.getId(), childPath, parentPath,

ourParent.getUuid(), chainPosition(immediateChild) - 1);

// After we delete the middle backup, every descendant's numeric chain_position

// becomes stale (off by one). Their backing-file pointers don't need re-writing

// (only the immediate child changed parents) but their position metadata does.

return ChainRepairPlan.proceed(

Collections.singletonList(step),

List<RebaseStep> rebaseSteps = new ArrayList<>();

List<VolumeVO> volumes = volumeDao.findByInstance(backup.getVmId());

if (CollectionUtils.isNotEmpty(volumes)) {

for (VolumeVO volume : volumes) {

String diskPrefix = Volume.Type.ROOT.equals(volume.getVolumeType()) ? "root." : "datadisk.";

String volumeUuid = volume.getUuid();

if (volumeUuid == null) {

LOG.warn("Skipping rebase step for volume id={} in backup chain {} because UUID is null",

volume.getId(), chainId);

continue;

}

String childPath = immediateChild.getExternalId() + "/" + diskPrefix + volumeUuid + ".qcow2";

String parentPath = ourParent.getExternalId() + "/" + diskPrefix + volumeUuid + ".qcow2";

rebaseSteps.add(new RebaseStep(immediateChild.getId(), childPath, parentPath,

ourParent.getUuid(), chainPosition(immediateChild) - 1));

}

}

if (rebaseSteps.isEmpty()) {

VolumeVO rootVolume = volumeDao.getInstanceRootVolume(backup.getVmId());

String volUuid = rootVolume == null ? "root" : rootVolume.getUuid();

String childPath = immediateChild.getExternalId() + "/root." + volUuid + ".qcow2";

String parentPath = ourParent.getExternalId() + "/root." + volUuid + ".qcow2";

rebaseSteps.add(new RebaseStep(immediateChild.getId(), childPath, parentPath,

ourParent.getUuid(), chainPosition(immediateChild) - 1));

}

// After we delete the middle backup, every descendant's numeric chain_position

// becomes stale (off by one). Their backing-file pointers don't need re-writing

// (only the immediate child changed parents) but their position metadata does.

return ChainRepairPlan.proceed(

rebaseSteps,

Copilot · 2026-04-28T09:21:10Z

+|-------------------|-------------|------------------------------------------------|
+| `id`              | BIGINT (PK) | Auto-increment primary key                     |
+| `vm_instance_id`  | BIGINT (FK) | The VM this chain belongs to                   |
+| `full_backup_id`  | BIGINT (FK) | The full backup anchoring this chain           |
+| `state`           | VARCHAR(16) | `ACTIVE`, `SEALED`, `EXPIRED`                  |
+| `created`         | DATETIME    | When the chain was started                     |
+
+**Schema migration** will be provided as a Liquibase changeset, consistent with CloudStack's existing migration framework. The new columns are nullable to maintain backward compatibility with existing backup records.
+
+#### 5. Management Server Changes
+
+**`BackupManagerImpl` (orchestration):**
+
+- Before taking a backup, query the active chain for the VM.
+- If no active chain exists, or the chain has reached `nas.backup.incremental.max.chain` incrementals, or `nas.backup.full.interval` days have elapsed since the last full backup: schedule a full backup and start a new chain.
+- Otherwise: schedule an incremental backup linked to the previous backup in the chain.
+- On backup failure: if the bitmap is suspected corrupt, mark the chain as `SEALED` and force a full backup on the next run.


This RFC section is now inconsistent with the implemented approach in this PR: it describes new backups columns / a backup_chains table and refers to nas.backup.full.interval / nas.backup.incremental.max.chain, but the code stores chain metadata in backup_details (NASBackupChainKeys) and uses nas.backup.full.every. Please update the RFC to reflect the current design to avoid misleading readers.

Suggested change

|-------------------|-------------|------------------------------------------------|

| `id` | BIGINT (PK) | Auto-increment primary key |

| `vm_instance_id` | BIGINT (FK) | The VM this chain belongs to |

| `full_backup_id` | BIGINT (FK) | The full backup anchoring this chain |

| `state` | VARCHAR(16) | `ACTIVE`, `SEALED`, `EXPIRED` |

| `created` | DATETIME | When the chain was started |

**Schema migration** will be provided as a Liquibase changeset, consistent with CloudStack's existing migration framework. The new columns are nullable to maintain backward compatibility with existing backup records.

#### 5. Management Server Changes

**`BackupManagerImpl` (orchestration):**

- Before taking a backup, query the active chain for the VM.

- If no active chain exists, or the chain has reached `nas.backup.incremental.max.chain` incrementals, or `nas.backup.full.interval` days have elapsed since the last full backup: schedule a full backup and start a new chain.

- Otherwise: schedule an incremental backup linked to the previous backup in the chain.

- On backup failure: if the bitmap is suspected corrupt, mark the chain as `SEALED` and force a full backup on the next run.

Instead of adding new chain-tracking columns to `backups` or introducing a separate `backup_chains` table, the implementation stores incremental-chain metadata in `backup_details`, keyed by `NASBackupChainKeys`. This keeps the schema aligned with the existing backup metadata model while allowing each backup record to carry the information needed to identify its chain, parent backup, bitmap, and chain state.

**Schema migration** will be provided as a Liquibase changeset, consistent with CloudStack's existing migration framework. The migration adds the required `backup_details` entries for chain metadata and remains backward-compatible with existing backup records.

#### 5. Management Server Changes

**`BackupManagerImpl` (orchestration):**

- Before taking a backup, query the current chain metadata for the VM from `backup_details` (`NASBackupChainKeys`).

- If no usable active chain exists, or `nas.backup.full.every` requires a new full backup based on the most recent full backup in the chain: schedule a full backup and start a new chain.

- Otherwise: schedule an incremental backup linked to the previous backup in the chain.

- On backup failure: if the bitmap is suspected corrupt, mark the chain metadata as sealed/invalid in `backup_details` and force a full backup on the next run.

Copilot · 2026-04-28T09:21:11Z

+    # All tests set nas.backup.full.every to a small value (3) so a chain
+    # forms quickly without needing many backup iterations. They restore
+    # the original value at teardown.
+
+    def _set_full_every(self, value):
+        Configurations.update(self.apiclient, name='nas.backup.full.every',
+                              value=str(value), zoneid=self.zone.id)
+


The incremental tests always reset nas.backup.full.every to 10 in finally, but they never read/preserve the original zone value. If a test environment has a non-default value configured, these tests will leave the zone config changed. Consider capturing the current value once (e.g., via Configurations.list(...) in setup) and restoring that exact value in each finally (or in tearDown).

Copilot · 2026-04-28T09:21:11Z

+  # to full and signal the fallback so the orchestrator can record it as a full
+  # in the chain.
+  if [[ "$MODE" == "incremental" ]]; then
+    echo "INCREMENTAL_FALLBACK=full (VM stopped — incremental requires running VM)" >&2


INCREMENTAL_FALLBACK=... is printed to stderr (>&2), but Script.executePipedCommands(...) only captures stdout. As a result, the management server will never see this marker and can't reliably record a stopped-VM backup as FULL / start a new chain. Emit the fallback marker on stdout (or change the Java wrapper to capture stderr too).

Suggested change

echo "INCREMENTAL_FALLBACK=full (VM stopped — incremental requires running VM)" >&2

echo "INCREMENTAL_FALLBACK=full (VM stopped — incremental requires running VM)"

Copilot · 2026-04-28T09:21:11Z

+        command.setBitmapParent(decision.bitmapParent);
+        command.setParentPath(decision.parentPath);

        if (VirtualMachine.State.Stopped.equals(vm.getState())) {


For stopped VMs decideChain returns fullStart(newBitmap) and takeBackup always sets command.mode="full" / bitmapNew. But nasbackup.sh's stopped-VM path doesn’t create a checkpoint/bitmap (no BITMAP_CREATED=), so persisting nas.bitmap_name from the requested bitmap can cause the next backup to attempt an incremental against a bitmap that was never created. Consider clearing mode/bitmapNew (legacy full) for stopped VMs and/or only persisting nas.bitmap_name when the agent confirms it via BITMAP_CREATED=.

Suggested change

if (VirtualMachine.State.Stopped.equals(vm.getState())) {

if (VirtualMachine.State.Stopped.equals(vm.getState())) {

// Stopped-VM backups use the offline path and do not create checkpoints/bitmaps.

// Clear chain metadata so a full backup does not imply a bitmap was created.

command.setMode(null);

command.setBitmapNew(null);

command.setBitmapParent(null);

command.setParentPath(null);

Copilot · 2026-04-28T09:21:12Z

+    if [[ "$effective_mode" == "incremental" ]]; then
+      volUuid="${fullpath##*/}"
+      if [[ "$fullpath" == /dev/drbd/by-res/* ]]; then
+        volUuid=$(get_linstor_uuid_from_path "$fullpath")
+      fi
+      # PARENT_PATH from the orchestrator is the parent backup's path relative to the
+      # NAS mount root (e.g. "i-2-X/2026.04.27.12.00.00/root.UUID.qcow2"). Convert it to
+      # a path relative to THIS new qcow2's directory so the backing reference resolves
+      # correctly the next time the NAS is mounted (mount points are ephemeral).
+      local parent_abs="$mount_point/$PARENT_PATH"
+      if [[ ! -f "$parent_abs" ]]; then
+        echo "Parent backup file does not exist on NAS: $parent_abs"
+        cleanup
+        exit 1
+      fi
+      local parent_rel
+      parent_rel=$(realpath --relative-to="$dest" "$parent_abs")
+      if ! qemu-img rebase -u -b "$parent_rel" -F qcow2 "$dest/$name.$volUuid.qcow2" >> "$logFile" 2> >(cat >&2); then
+        echo "qemu-img rebase failed for $dest/$name.$volUuid.qcow2 onto $parent_rel"


In incremental mode, each exported qcow2 (root.* and datadisk.*) is rebased onto the same PARENT_PATH. For VMs with multiple volumes this will rebase data-disk incrementals onto the root-disk parent file, corrupting the chain for non-root volumes. PARENT_PATH needs to be per-disk (or pass the parent backup directory and derive the correct parent filename for each volUuid/disk role).

Copilot · 2026-04-28T09:21:12Z

+        // Strip out our incremental marker lines before parsing size, so the legacy
+        // numeric-suffix parser keeps working.
+        String stdout = result.second().trim();
+        String bitmapCreated = null;
+        String bitmapRecreated = null;
+        boolean incrementalFallback = false;
+        StringBuilder filtered = new StringBuilder();
+        for (String line : stdout.split("\n")) {
+            String trimmed = line.trim();
+            if (trimmed.startsWith("BITMAP_CREATED=")) {
+                bitmapCreated = trimmed.substring("BITMAP_CREATED=".length());
+                continue;
+            }
+            if (trimmed.startsWith("BITMAP_RECREATED=")) {
+                bitmapRecreated = trimmed.substring("BITMAP_RECREATED=".length());
+                continue;
+            }
+            if (trimmed.startsWith("INCREMENTAL_FALLBACK=")) {
+                incrementalFallback = true;
+                continue;
+            }


This wrapper tries to detect INCREMENTAL_FALLBACK= from result.second(), but Script.executePipedCommands(...) only returns stdout from the process pipeline (it does not capture stderr). Since nasbackup.sh currently prints the fallback marker to stderr, incrementalFallback will always remain false here. Align the script/wrapper so the marker is emitted on stdout (or update the execution helper to merge/capture stderr).

Copilot · 2026-04-28T09:21:12Z

+  echo "Incremental backup options (running VMs only; requires QEMU >= 4.2 and libvirt >= 7.2):"
+  echo "  -M|--mode full          Take a full backup AND create a checkpoint (--bitmap-new required) for future incrementals."
+  echo "  -M|--mode incremental   Take an incremental backup since --bitmap-parent and create new checkpoint --bitmap-new."
+  echo "                          Requires --bitmap-parent, --bitmap-new, and --parent-path (parent backup file for rebase)."
+  echo "  Without -M, behaves as legacy full-only backup with no checkpoint creation."


The usage text says “Without -M, behaves as legacy full-only backup with no checkpoint creation”, but the script still runs sanity_checks unconditionally (QEMU>=4.2/libvirt>=7.2) even for legacy callers and non-backup ops (delete/stats/rebase). To preserve the documented legacy behavior/backward compatibility, gate the version checks to only the code paths that actually require backup-begin/incremental features (or only when MODE is set).

bernardodemarco · 2026-04-28T13:23:17Z

+# RFC: Incremental NAS Backup Support for KVM Hypervisor
+
+| Field         | Value                                      |
+|---------------|--------------------------------------------|
+| **Author**    | James Peru, Xcobean Systems Limited        |
+| **Status**    | Draft                                      |
+| **Created**   | 2026-03-27                                 |
+| **Target**    | Apache CloudStack 4.23+                    |
+| **Component** | Backup & Recovery (NAS Backup Provider)    |


@jmsperu, thanks for the PR.

We usually add the design documents/specs/RFCs in the project's wiki¹ or as a separate issue in GitHub².

Footnotes

See KBOSS spec, for example. ↩

See Quota: custom tariffs #5891, for example. ↩

docs: add RFC for incremental NAS backup support (KVM)

f2a9202

Adds the design document for incremental NAS backups using QEMU dirty bitmaps and libvirt's backup-begin API. Reduces daily backup storage 80-95% for large VMs. Refs: apache#12899

jmsperu mentioned this pull request Apr 27, 2026

[RFC] Incremental NAS Backup Support for KVM Hypervisor #12899

Open

boring-cyborg Bot added the component:backup label Apr 27, 2026

boring-cyborg Bot added the component:kvm label Apr 27, 2026

jmsperu added 5 commits April 27, 2026 19:07

boring-cyborg Bot added component:integration-test Python Warning... Python code Ahead! labels Apr 27, 2026

jmsperu changed the title ~~[WIP] feat(backup): incremental NAS backup support for KVM~~ feat(backup): incremental NAS backup support for KVM Apr 27, 2026

jmsperu marked this pull request as ready for review April 27, 2026 16:26

winterhazel added this to the 4.23.0 milestone Apr 27, 2026

weizhouapache requested review from abh1sar, Copilot and sureshanaparti April 28, 2026 09:06

Copilot started reviewing on behalf of weizhouapache April 28, 2026 09:12 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

harikrishna-patnala linked an issue Apr 28, 2026 that may be closed by this pull request

[RFC] Incremental NAS Backup Support for KVM Hypervisor #12899

Open

bernardodemarco reviewed Apr 28, 2026

View reviewed changes

	echo "INCREMENTAL_FALLBACK=full (VM stopped — incremental requires running VM)" >&2
	echo "INCREMENTAL_FALLBACK=full (VM stopped — incremental requires running VM)"

-        if (VirtualMachine.State.Stopped.equals(vm.getState())) {
+        if (VirtualMachine.State.Stopped.equals(vm.getState())) {
+            // Stopped-VM backups use the offline path and do not create checkpoints/bitmaps.
+            // Clear chain metadata so a full backup does not imply a bitmap was created.
+            command.setMode(null);
+            command.setBitmapNew(null);
+            command.setBitmapParent(null);
+            command.setParentPath(null);

Conversation

jmsperu commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

Review feedback addressed (all from #12899 thread)

Backwards compatibility

Test plan

Refs

Uh oh!

codecov Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sureshanaparti commented Apr 28, 2026

Uh oh!

weizhouapache commented Apr 28, 2026

Uh oh!

jmsperu commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

bernardodemarco Apr 28, 2026

Choose a reason for hiding this comment

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jmsperu commented Apr 27, 2026 •

edited

Loading

codecov Bot commented Apr 27, 2026 •

edited

Loading