Skip to content

<fix>[vm]: ZSTAC-68874 always submit DeleteVmGC on create rollback#3464

Open
MatheMatrix wants to merge 1 commit into5.5.12from
sync/zstackio/fix/ZSTAC-68874@@2
Open

<fix>[vm]: ZSTAC-68874 always submit DeleteVmGC on create rollback#3464
MatheMatrix wants to merge 1 commit into5.5.12from
sync/zstackio/fix/ZSTAC-68874@@2

Conversation

@MatheMatrix
Copy link
Owner

ZSTAC-68874 推理服务断开vip导致GPU脏数据

Root Cause

VM创建失败rollback时,DestroyVm也失败的场景下,旧逻辑只在GC_ELIGIBLE错误时提交DeleteVmGC。
非GC_ELIGIBLE错误(如MN不可用)不创建GC,导致orphan VM残留在hypervisor上占用GPU资源。

Fix

rollback始终提交DeleteVmGC,无论错误类型。非GC_ELIGIBLE错误额外触发canonical event(保持告警行为不变)。

Cross-repo

  • zstack: VmCreateOnHypervisorFlow.java rollback逻辑
  • premium (fix/ZSTAC-68874@@2): VmCreateGCCase.groovy 测试断言更新

sync from gitlab !9323

Resolves: ZSTAC-68874

Change-Id: I49fd973a29ddea3299acfca2d9c70f88ea5e7906
@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

Walkthrough

回滚路径中对虚拟机删除的垃圾回收提交由有条件改为无条件提交;大量文件将通用的通配符导入替换为显式导入;在若干流程处加入“DEBT: NoRollbackFlow”注释;新增全局错误码国际化实现并替换 LocaleUtils;对若干 VO/消息添加注解/超时标注。

Changes

Cohort / File(s) Summary
VM GC 行为变更
compute/src/main/java/org/zstack/compute/vm/VmCreateOnHypervisorFlow.java
回滚分支:不再仅在 OPERATION_FAILURE_GC_ELIGIBLE 时才提交 DeleteVmGC,改为无条件创建并提交(仍用 existedAndNotCompleted() 避免重复),并移动 GC 提交顺序与注释说明。
新增/替换错误码 i18n 实现
core/src/main/java/org/zstack/core/errorcode/GlobalErrorCodeI18nServiceImpl.java, core/src/main/java/org/zstack/core/errorcode/LocaleUtils.java
新增 GlobalErrorCodeI18nServiceImpl 的完整实现(加载 classpath JSON、模板解析、本地化方法等);重写/替换 LocaleUtils 实现为显式导入版。
VO/消息注解或超时变更
core/src/main/java/org/zstack/core/gc/GarbageCollectorVO.java, header/src/main/java/org/zstack/header/volume/APICreateDataVolumeFromVolumeSnapshotMsg.java
为 GarbageCollectorVO 添加 @BaseResource 注解;为 APICreateDataVolumeFromVolumeSnapshotMsg 添加 @DefaultTimeout(timeunit = TimeUnit.HOURS, value = 3)
导入精简(大规模)
compute/..., core/..., network/..., header/..., identity/..., image/..., configuration/..., console/... (多文件)
大量文件将通配符导入(如 import java.util.* / import org.zstack.header.*)替换为显式类级导入,绝大多数为纯导入调整,无功能改动。
流程注释/NoRollbackFlow 标注
compute/..., core/..., network/..., image/..., configuration/... (多文件)
在若干 Flow/FlowChain/rollback 节点插入“DEBT: NoRollbackFlow …”注释或以 NoRollbackFlow 包裹步骤(注释表明回滚语义债务),不改变现有控制流。

预估代码审查工作量

🎯 3 (中等) | ⏱️ ~20 分钟

垃圾不再看门脸,
无论何故都要去见,
孤立虚机被追寻,GPU也净颜,
小兔蹦跳唱一遍,代码森森整齐鲜 🐰✨

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sync/zstackio/fix/ZSTAC-68874@@2

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/VmCreateOnHypervisorFlow.java`:
- Around line 85-96: In VmCreateOnHypervisorFlow, the DeleteVmGC is being
registered with the wrong host UUID: change gc.hostUuid from
spec.getVmInventory().getHostUuid() to spec.getDestHost().getUuid() so the GC
(DeleteVmGC) runs and listens for host events on the actual destination host
where the VM was created; update the assignment near gcName creation (refer to
DeleteVmGC, gc.hostUuid, gcName and the logger.debug call) to use
spec.getDestHost().getUuid().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: cc5753a6-8268-427c-b9cd-3e36cd1b39ce

📥 Commits

Reviewing files that changed from the base of the PR and between 8117a99 and 8797213.

📒 Files selected for processing (1)
  • compute/src/main/java/org/zstack/compute/vm/VmCreateOnHypervisorFlow.java

Comment on lines +85 to +96
String gcName = String.format("gc-vm-%s-on-host-%s", spec.getVmInventory().getUuid(), spec.getDestHost().getUuid());

DeleteVmGC gc = new DeleteVmGC();
gc.NAME = gcName;
gc.hostUuid = spec.getVmInventory().getHostUuid();
gc.inventory = spec.getVmInventory();
if (gc.existedAndNotCompleted()) {
logger.debug(String.format("There is already a DeleteVmGC of vm[uuid:%s] " +
"on host[uuid:%s], skip.", spec.getVmInventory().getUuid(), spec.getDestHost().getUuid()));
} else {
gc.submit();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 查找 DeleteVmGC 类的实现
ast-grep --pattern $'class DeleteVmGC {
  $$$
}'

# 查找 existedAndNotCompleted 方法的实现
rg -n -A 10 'existedAndNotCompleted' --type java

Repository: MatheMatrix/zstack

Length of output: 6777


🏁 Script executed:

#!/bin/bash
# 查找 DeleteVmGC 类的定义和实现
find . -type f -name "*.java" | xargs grep -l "class DeleteVmGC"

# 查看 DeleteVmGC 的完整实现
rg -B 5 -A 30 "class DeleteVmGC" --type java

Repository: MatheMatrix/zstack

Length of output: 3642


🏁 Script executed:

#!/bin/bash
# 查看 VmCreateOnHypervisorFlow 完整代码段
rg -B 20 -A 10 "setHostUuid" compute/src/main/java/org/zstack/compute/vm/VmCreateOnHypervisorFlow.java

Repository: MatheMatrix/zstack

Length of output: 3295


gc.hostUuid 应使用 spec.getDestHost().getUuid() 而非 spec.getVmInventory().getHostUuid(),确保 GC 任务在正确的主机上执行。

在 VM 创建失败的回滚流程中,VM 被创建在 destHost 上。DeleteVmGChostUuid 字段被用于:

  1. 匹配主机事件(HOST_DELETED_PATHHOST_STATUS_CHANGED_PATH
  2. 查询主机状态以决定是否执行清理

第 85 行的 gcName 和第 93 行的日志都正确地使用了 spec.getDestHost().getUuid(),但第 89 行的 gc.hostUuid 使用了 spec.getVmInventory().getHostUuid(),导致 GC 可能在错误的主机上执行清理或无法正确监听主机事件。

建议修复
                     DeleteVmGC gc = new DeleteVmGC();
                     gc.NAME = gcName;
-                    gc.hostUuid = spec.getVmInventory().getHostUuid();
+                    gc.hostUuid = spec.getDestHost().getUuid();
                     gc.inventory = spec.getVmInventory();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmCreateOnHypervisorFlow.java`
around lines 85 - 96, In VmCreateOnHypervisorFlow, the DeleteVmGC is being
registered with the wrong host UUID: change gc.hostUuid from
spec.getVmInventory().getHostUuid() to spec.getDestHost().getUuid() so the GC
(DeleteVmGC) runs and listens for host events on the actual destination host
where the VM was created; update the assignment near gcName creation (refer to
DeleteVmGC, gc.hostUuid, gcName and the logger.debug call) to use
spec.getDestHost().getUuid().

@MatheMatrix MatheMatrix force-pushed the sync/zstackio/fix/ZSTAC-68874@@2 branch from e79c739 to 8797213 Compare March 12, 2026 02:19
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (15)
compute/src/main/java/org/zstack/compute/vm/VmExpungeCacheVolumeFlow.java (3)

32-32: ⚠️ Potential issue | 🟡 Minor

Logger 使用了错误的类引用。

当前 logger 使用 VmExpungeRootVolumeFlow.class,应该使用 VmExpungeCacheVolumeFlow.class

🐛 建议修复
-    private static final CLogger logger = Utils.getLogger(VmExpungeRootVolumeFlow.class);
+    private static final CLogger logger = Utils.getLogger(VmExpungeCacheVolumeFlow.class);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeCacheVolumeFlow.java` at
line 32, The logger is instantiated with the wrong class; update the logger
declaration in VmExpungeCacheVolumeFlow by replacing
Utils.getLogger(VmExpungeRootVolumeFlow.class) with
Utils.getLogger(VmExpungeCacheVolumeFlow.class) so the CLogger references the
current class (update the private static final CLogger logger assignment
accordingly).

64-66: ⚠️ Potential issue | 🟡 Minor

日志中记录的卷 UUID 不正确。

日志记录的是 getRootVolumeUuid(),但实际正在处理的是 cache volume,应该使用 vol.getUuid() 记录当前处理的缓存卷 UUID。

🐛 建议修复
                     if (!reply.isSuccess()) {
                         logger.warn(String.format("failed to expunge the cache volume[uuid:%s] of the vm[uuid:%s, name:%s], %s",
-                                spec.getVmInventory().getRootVolumeUuid(), spec.getVmInventory().getUuid(),
+                                vol.getUuid(), spec.getVmInventory().getUuid(),
                                 spec.getVmInventory().getName(), reply.getError()));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeCacheVolumeFlow.java`
around lines 64 - 66, The warning log in VmExpungeCacheVolumeFlow is using
spec.getVmInventory().getRootVolumeUuid() which is the root volume UUID instead
of the cache volume being processed; update the logger.warn call to log the
cache volume's UUID by using vol.getUuid() (keep the rest of the message the
same and continue to include the VM UUID/name via
spec.getVmInventory().getUuid()/getName() and reply.getError() for the error
details).

40-42: ⚠️ Potential issue | 🔴 Critical

缺少 return 语句导致潜在的 NullPointerException。

VmExpungeMemoryVolumeFlow 相同的问题:当 getAllVolumes() 返回 null 时,流程会继续执行到第 44 行,对 null 调用 stream() 抛出 NPE。

🐛 建议修复
         if (spec.getVmInventory().getAllVolumes() == null) {
             trigger.next();
+            return;
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeCacheVolumeFlow.java`
around lines 40 - 42, 在 VmExpungeCacheVolumeFlow 中,当
spec.getVmInventory().getAllVolumes() 返回 null 时当前只调用了 trigger.next() 但没有
return,导致后续对 getAllVolumes().stream() 的调用抛出 NPE;修改方法是在调用 trigger.next()
后立即返回以中止当前 flow(或改为使用 if (getAllVolumes() == null) { trigger.next(); return;
}),确保在执行 stream() 前 getAllVolumes() 不为 null。
compute/src/main/java/org/zstack/compute/vm/VmExpungeMemoryVolumeFlow.java (3)

67-69: ⚠️ Potential issue | 🟡 Minor

日志消息内容与实际操作不符。

日志中写的是 "root volume",但这个流程处理的是 memory volume。同时 getRootVolumeUuid() 应该改为 vol.getUuid() 以正确记录实际处理的卷 UUID。

🐛 建议修复
-                        logger.warn(String.format("failed to expunge the root volume[uuid:%s] of the vm[uuid:%s, name:%s], %s",
-                                spec.getVmInventory().getRootVolumeUuid(), spec.getVmInventory().getUuid(),
+                        logger.warn(String.format("failed to expunge the memory volume[uuid:%s] of the vm[uuid:%s, name:%s], %s",
+                                vol.getUuid(), spec.getVmInventory().getUuid(),
                                 spec.getVmInventory().getName(), reply.getError()));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeMemoryVolumeFlow.java`
around lines 67 - 69, The log message in VmExpungeMemoryVolumeFlow incorrectly
references "root volume" and uses spec.getVmInventory().getRootVolumeUuid();
update the warning in the exception handling to reference the actual memory
volume by using vol.getUuid() and change the text to "memory volume" (e.g.,
"failed to expunge the memory volume[uuid:%s] of the vm[uuid:%s, name:%s], %s")
so that the flow (VmExpungeMemoryVolumeFlow) correctly records the volume UUID
and accurate operation context.

40-42: ⚠️ Potential issue | 🔴 Critical

缺少 return 语句导致潜在的 NullPointerException。

getAllVolumes() 返回 null 时,trigger.next() 被调用后流程没有返回,代码会继续执行到第 44 行,对 null 调用 stream() 方法将抛出 NPE。

🐛 建议修复
         if (spec.getVmInventory().getAllVolumes() == null) {
             trigger.next();
+            return;
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeMemoryVolumeFlow.java`
around lines 40 - 42, In VmExpungeMemoryVolumeFlow, when
spec.getVmInventory().getAllVolumes() == null you call trigger.next() but do not
return, so execution continues and a subsequent call to
spec.getVmInventory().getAllVolumes().stream() can NPE; fix by returning
immediately after trigger.next() (add a return after the trigger.next() call) so
the method exits when getAllVolumes() is null.

32-32: ⚠️ Potential issue | 🟡 Minor

Logger 使用了错误的类引用。

当前 logger 使用 VmExpungeRootVolumeFlow.class,应该使用 VmExpungeMemoryVolumeFlow.class。这是复制粘贴遗留的错误,会导致日志归类不正确。

🐛 建议修复
-    private static final CLogger logger = Utils.getLogger(VmExpungeRootVolumeFlow.class);
+    private static final CLogger logger = Utils.getLogger(VmExpungeMemoryVolumeFlow.class);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeMemoryVolumeFlow.java`
at line 32, The logger is instantiated with the wrong class reference: replace
the incorrect VmExpungeRootVolumeFlow.class passed to Utils.getLogger with
VmExpungeMemoryVolumeFlow.class so the static field logger in
VmExpungeMemoryVolumeFlow references the correct class; update the line that
declares private static final CLogger logger = Utils.getLogger(...) accordingly.
compute/src/main/java/org/zstack/compute/vm/VmAllocateNicForStartingVmFlow.java (1)

192-192: ⚠️ Potential issue | 🟡 Minor

注释拼写错误:happendhappen

根据编码规范,注释应使用正确的英文拼写。建议在此次导入清理时一并修复。

🔧 建议修复
-        /* this case should not happend */
+        /* this case should not happen */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/VmAllocateNicForStartingVmFlow.java`
at line 192, Fix the typo in the inline comment inside the
VmAllocateNicForStartingVmFlow class: change "happend" to "happen" (search for
the comment string "this case should not happend" in
VmAllocateNicForStartingVmFlow.java and update it to "this case should not
happen"); keep the comment punctuation and spacing consistent with surrounding
comments and commit as a small cleanup.
header/src/main/java/org/zstack/header/volume/APICreateDataVolumeFromVolumeSnapshotMsg.java (1)

54-54: ⚠️ Potential issue | 🟡 Minor

默认超时改成 3 小时后,示例里的 timeout 已经过期。

当前类注释里的 @msg 示例仍是 1800000,但 Line 54 的默认值已经变成了 3 小时,即 10800000。如果 3 小时是预期行为,请同步更新示例文档;如果不是,这里的注解值需要再确认。

📝 建议同步的最小修改
- * "timeout": 1800000,
+ * "timeout": 10800000,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/volume/APICreateDataVolumeFromVolumeSnapshotMsg.java`
at line 54, The class APICreateDataVolumeFromVolumeSnapshotMsg has its
`@DefaultTimeout` annotation changed to 3 hours (TimeUnit.HOURS, value = 3) but
the `@msg` example in the class Javadoc still shows the old timeout value
(1800000); either update the example value to the equivalent millisecond value
(10800000) to reflect the new default or revert/adjust the `@DefaultTimeout`
annotation to the intended timeout (e.g., value matching 1800000ms) so code and
example stay consistent—locate the `@DefaultTimeout` on
APICreateDataVolumeFromVolumeSnapshotMsg and the `@msg` example in the class
comment and make the matching change.
core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java (1)

145-163: ⚠️ Potential issue | 🟡 Minor

修正代码格式不一致:Line 147 缺少空格

Line 147 的 __name__= 与 Line 131 的 __name__ = 格式不一致,应在 = 前添加空格。

🛠️ 建议的修复方案
                 // DEBT: NoRollbackFlow — in url
                 flow(new NoRollbackFlow() {
-                    String __name__= "init-server";
+                    String __name__ = "init-server";

                     `@Override`
                     public void run(FlowTrigger trigger, Map data) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java` around lines
145 - 163, The formatting of the anonymous NoRollbackFlow's name field is
inconsistent: update the declaration String __name__= "init-server"; inside the
NoRollbackFlow (the anonymous class passed to flow) to match the project's style
by adding a space before the equals sign (String __name__ = "init-server";),
ensuring consistency with other occurrences like the __name__ = usage earlier.
network/src/main/java/org/zstack/network/l3/NormalIpRangeFactory.java (1)

132-185: ⚠️ Potential issue | 🟠 Major

请为 SDN 侧副作用补偿/GC,否则失败后会留下外部脏状态。

这里把 enable-sdn-dhcpadd-sdn-subnet 都放进了 NoRollbackFlow。如果前者已经成功、后者再失败,链路只会回滚本地 save-db,但 SDN 控制器侧的 DHCP/子网状态不会被清理,最终会出现数据库与控制器状态不一致,后续重试也可能变得不确定。建议至少为这两个步骤补充补偿回滚,或者在失败时提交异步 GC / reconcile。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@network/src/main/java/org/zstack/network/l3/NormalIpRangeFactory.java` around
lines 132 - 185, The two NoRollbackFlow steps "enable-sdn-dhcp" and
"add-sdn-subnet" leave external SDN controller state if later steps fail; add
compensating cleanup or schedule async GC/reconcile on failure. Modify the flow
chain so that either these steps are implemented as RollbackFlow with rollback
logic that reverses SdnControllerUpdateDHCPMsg and SdnControllerL3.createIpRange
(e.g., send delete/disable messages to the SDN controller using the same service
IDs / resource UUIDs), or, in the enclosing error handler, detect whether
enable-sdn-dhcp and/or add-sdn-subnet succeeded and then invoke async cleanup
tasks (send delete DHCP msg and call SdnControllerL3.deleteIpRange or schedule
reconcile) for the resources created by SdnControllerUpdateDHCPMsg and
SdnControllerL3.createIpRange to ensure external state is removed or reconciled
on failure.
image/src/main/java/org/zstack/image/ImageBase.java (1)

492-535: ⚠️ Potential issue | 🟠 Major

这段删除前置卸载会在部分失败时留下半完成状态。

这里对多个 VM 逐个卸载 ISO,只要后面任意一个失败就会 trigger.fail(),但前面已经成功卸载的 VM 没有任何补偿。结果会变成:镜像删除失败,但部分 VM 已经丢失了 ISO 挂载。建议至少记录成功卸载的 VM,并在失败路径里补偿恢复;否则这一步更适合作为无副作用的前置校验,而不是直接改动 VM 状态。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@image/src/main/java/org/zstack/image/ImageBase.java` around lines 492 - 535,
The current NoRollbackFlow that iterates IsoOperator.getVmUuidByIsoUuid and
sends DetachIsoFromVmInstanceMsg via bus.send can leave partial state because it
calls trigger.fail() on any error without compensating for earlier successful
detaches; change the flow to track successfulVmUuids (List<String>) as you
iterate, and if any detach fails use those recorded UUIDs to send compensating
attach messages (e.g., create AttachIsoToVmInstanceMsg per UUID and bus.send
them) before calling trigger.fail(), or alternatively convert this step into a
pure validation check that only verifies all VMs can accept detach (no state
change) and move actual DetachIsoFromVmInstanceMsg logic into a rollback-capable
flow; update references in this block: NoRollbackFlow,
IsoOperator.getVmUuidByIsoUuid, DetachIsoFromVmInstanceMsg, bus.send,
trigger.fail, trigger.next accordingly.
compute/src/main/java/org/zstack/compute/vm/VmInstantiateOtherDiskFlow.java (3)

150-195: ⚠️ Potential issue | 🟠 Major

回滚释放容量时不要再读 diskAO.getSize()

Line 159 申请主存储用的是 diskSize,但 Line 189 回滚释放的是 diskAO.getSize()。走 diskOfferingUuid 分支时,diskAO.getSize() 很可能是 0,失败回滚会少释放容量,主存储账面空间会被污染。

建议修改
                     `@Override`
                     public void rollback(FlowRollback chain, Map data) {
                         if (!isSuccessAllocatePS) {
                             chain.rollback();
                             return;
                         }
                         ReleasePrimaryStorageSpaceMsg msg = new ReleasePrimaryStorageSpaceMsg();
-                        msg.setDiskSize(diskAO.getSize());
+                        msg.setDiskSize(diskSize);
                         msg.setAllocatedInstallUrl(allocatedInstallUrl);
                         msg.setPrimaryStorageUuid(allocatedPrimaryStorageUuid);
                         bus.makeTargetServiceIdByResourceUuid(msg, PrimaryStorageConstant.SERVICE_ID, allocatedPrimaryStorageUuid);
                         bus.send(msg);
                         chain.rollback();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstantiateOtherDiskFlow.java`
around lines 150 - 195, In setupCreateVolumeFromDiskSizeFlows, the rollback
currently uses diskAO.getSize() which can be 0 when size came from diskSize;
change the ReleasePrimaryStorageSpaceMsg to setDiskSize(diskSize) (use the same
diskSize passed into AllocatePrimaryStorageSpaceMsg) and keep
allocatedInstallUrl and allocatedPrimaryStorageUuid as-is so the rollback
releases the exact amount allocated; update any references to diskAO.getSize()
in the rollback block accordingly.

438-462: ⚠️ Potential issue | 🟠 Major

attachOtherDisk 现在没有任何补偿路径。

这段改成 NoRollbackFlow 后,只要后续步骤失败,外层 rollback() 既不会回调扩展点做 detach,也没有保存任何可用于补偿的数据;当前外层回滚只会处理 volumeInventory。对 VmAttachOtherDiskExtensionPoint 管理的磁盘/介质,这会遗留宿主机或控制面的关联状态。建议这里保留 rollback-capable Flow,或者给扩展点补一个对称的 detach 回调。

Also applies to: 468-493

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstantiateOtherDiskFlow.java`
around lines 438 - 462, The flow uses NoRollbackFlow for attaching external
media via VmAttachOtherDiskExtensionPoint.attachOtherDiskToVm, leaving no
compensation path if later steps fail; change to a rollback-capable Flow (not
NoRollbackFlow) and invoke the extension point’s detach on rollback or add a
symmetric detach callback on VmAttachOtherDiskExtensionPoint that the Flow calls
from its rollback method, ensuring any state created by attachOtherDiskToVm is
cleaned up and any needed compensation data (e.g., disk/attachment identifiers)
is saved in the Flow’s data map for the rollback to use.

120-123: ⚠️ Potential issue | 🔴 Critical

现有数据盘在失败回滚时会被误删。

Line 121-123 把外部传入的现有卷直接放进了 volumeInventory。现在 attach-volume-to-vm 改成 NoRollbackFlow 后,后续任一步失败都会落到外层 rollback(),而 Line 474-490 会无条件对 volumeInventory 发送 DeleteVolumeMsg。这会把“用户原本要挂载的现有数据盘”当成“本流程新建的临时卷”删除,而不是执行 detach。至少要区分“本流程创建的卷”和“外部传入的卷”,对后者走补偿卸载或跳过删除。

Also applies to: 413-434, 468-493

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstantiateOtherDiskFlow.java`
around lines 120 - 123, The flow currently treats any VolumeInventory in
VmInstantiateOtherDiskFlow (set when isAttachDataVolume() and via
setupAttachVolumeFlows()) as if it was created by this flow and unconditionally
sends DeleteVolumeMsg during rollback; change this so the code distinguishes
volumes created by this flow from externally provided volumes: when you load a
volume by diskAO.getSourceUuid() in the isAttachDataVolume() branch, mark it as
external (e.g., set a boolean like volumeCreatedByFlow=false or store
externalVolumeUuid) and ensure setupAttachVolumeFlows() does not mark it as
newly created; when creating volumes in other branches mark them
volumeCreatedByFlow=true; then update rollback() (the block that currently sends
DeleteVolumeMsg for volumeInventory) to only send DeleteVolumeMsg if
volumeCreatedByFlow==true, and for external volumes perform the correct
compensation (run detach logic or skip deletion) instead of deleting; keep
references to VmInstantiateOtherDiskFlow, isAttachDataVolume(),
setupAttachVolumeFlows(), volumeInventory, rollback(), and DeleteVolumeMsg to
locate the changes.
compute/src/main/java/org/zstack/compute/vm/VmInstanceManagerImpl.java (1)

1558-1580: ⚠️ Potential issue | 🟠 Major

start-vm 失败后仍可能绕过 DeleteVmGC 补偿。

Line 1560 这里新增的 NoRollbackFlow 一旦失败,会回滚到前一个 instantiate-new-created-vmInstance.rollback();但那个回滚在 DestroyVmInstanceMsg 失败时只是记录日志然后继续,没有统一提交 DeleteVmGC。这样“VM 已创建成功,但后续挂盘/启动失败”的分支里,仍可能留下宿主机上的孤儿 VM 和被占用的 GPU 资源,和本 PR 要修的问题是同一类泄漏。建议把这条回滚链也复用“无论错误类型都提交 DeleteVmGC,非 GC_ELIGIBLE 再补 canonical event”的处理。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceManagerImpl.java`
around lines 1558 - 1580, The new NoRollbackFlow named "start-vm" can fail and
skip the DeleteVmGC submission path used by
instantiate-new-created-vmInstance.rollback(), leaving orphaned VMs/GPU
allocations; modify the failure handling so that when StartVmInstanceMsg in the
"start-vm" flow fails you always enqueue a DeleteVmGC for instantiateVm (use the
same DeleteVmGC submission code/path as in
instantiate-new-created-vmInstance.rollback()), and if DestroyVmInstanceMsg
returns a non-GC_ELIGIBLE error follow the same logic to emit the canonical
event instead of silently logging; update the failure branch in the flow's
CloudBusCallBack (and/or the rollback hook tied to instantiateVm/finalVo) to
reuse the DeleteVmGC submission and GC_ELIGIBLE handling so orphan cleanup is
consistent.
🧹 Nitpick comments (12)
core/src/main/java/org/zstack/core/gc/GarbageCollectorVO.java (1)

8-8: 考虑将 javax.persistence.* 通配符导入替换为显式导入以保持一致性。

第 4-5 行已将 org.zstack.header.vo 的通配符导入改为显式导入,但第 8 行的 javax.persistence.* 仍使用通配符。为保持导入风格一致,建议将其改为显式导入实际使用的类:EntityTableColumnEnumeratedEnumTypePreUpdate

♻️ 建议的重构
-import javax.persistence.*;
+import javax.persistence.Column;
+import javax.persistence.Entity;
+import javax.persistence.EnumType;
+import javax.persistence.Enumerated;
+import javax.persistence.PreUpdate;
+import javax.persistence.Table;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/gc/GarbageCollectorVO.java` at line 8, 在
GarbageCollectorVO 中把通配符导入 javax.persistence.* 替换为显式导入所用的注解/类型以保持导入风格一致;具体在该文件(类
GarbageCollectorVO)中移除 javax.persistence.* 并添加显式导入
Entity、Table、Column、Enumerated、EnumType 和 PreUpdate(仅导入实际使用到的类型)。
core/src/main/java/org/zstack/core/errorcode/ElaborationManagerImpl.java (1)

131-132: 技术债务注释应更具描述性

多处 DEBT: NoRollbackFlow — reason TBD 注释(如第 131、147、180、203、237、254 行)缺乏实际意义。如果是在追踪技术债务,建议说明:

  1. 为什么这是技术债务
  2. 未来需要采取什么行动

例如:// DEBT: NoRollbackFlow — should evaluate if rollback logic is needed for file scanning errors

这样后续开发者能够理解并处理这些标记。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/errorcode/ElaborationManagerImpl.java`
around lines 131 - 132, Replace the vague "DEBT: NoRollbackFlow — reason TBD"
comments around the NoRollbackFlow anonymous flows in ElaborationManagerImpl
with specific, actionable debt notes: state why using NoRollbackFlow is
potentially risky here (e.g., "skips rollback for X operation such as file
scanning/state mutation"), describe what should be evaluated later (e.g.,
"determine if rollback is required when Y fails or add compensating action"),
and include an owner or ticket reference for follow-up; update each occurrence
adjacent to the flow(...) / new NoRollbackFlow() blocks so future developers can
understand the debt and planned remediation.
compute/src/main/java/org/zstack/compute/vm/VmDeleteVolumeFlow.java (1)

90-90: TBD 改成可执行的上下文。

这条注释现在只说明“这里有债务”,但没有解释为什么这里必须保留 NoRollbackFlow,后续维护时很难判断这是有意设计还是临时绕过。建议补上具体原因、风险边界,或者直接引用对应工单号。

As per coding guidelines,“对于较长的注释,需要仔细校对并随代码更新,确保内容正确。”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmDeleteVolumeFlow.java` at line
90, 在 VmDeleteVolumeFlow 类中把现有“// DEBT: NoRollbackFlow — reason
TBD”注释替换为可执行的上下文说明:在注释中明确指出为何该 Flow 必须使用
NoRollbackFlow(例如资源已被物理删除无法回滚、幂等性保障或已在上层事务中处理回滚),列出相关风险边界(什么情况下会导致资源泄漏或不一致)并引用相关工单或设计文档编号;如果决定不是永久方案,请注明预期的整改时间点或条件。定位符:VmDeleteVolumeFlow
类和该包含 “NoRollbackFlow” 的注释位置,直接更新注释内容以包含原因、风险和工单/文档引用。
core/src/main/java/org/zstack/core/cascade/CascadeFacadeImpl.java (1)

231-231: DEBT 注释补上可执行上下文。

这几处现在只标明了 NoRollbackFlow 是技术债,但没有说明这里为什么可以接受“无回滚”,也没有关联跟踪项;后续很容易变成长期悬空的注释。建议至少补一句风险/约束说明,或附上对应的 issue/ticket。

Also applies to: 290-290, 313-313

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/cascade/CascadeFacadeImpl.java` at line
231, 把只是写“DEBT: NoRollbackFlow”的注释替换为有执行上下文的说明:在 CascadeFacadeImpl 的
asyncCascade 路径(标记为
NoRollbackFlow)的地方,补充一到两句说明为什么这里可以接受无回滚(例如幂等性、外部补偿机制或可接受的不一致窗口)、相关风险/约束(可能的数据不一致场景、重试/补偿触发条件)以及关联跟踪项(Issue/任务编号或链接)以便后续可追溯;定位符包括类名
CascadeFacadeImpl、方法/流程标识 asyncCascade 和注释中的 NoRollbackFlow。
network/src/main/java/org/zstack/network/l3/L3NetworkManagerImpl.java (1)

619-620: 建议补充债务原因说明。

当前注释 reason TBD 含义模糊,建议明确说明为何此处使用 NoRollbackFlow 以及后续改进方向,例如:数据库持久化后的回滚策略、SDN 控制器状态回滚等。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@network/src/main/java/org/zstack/network/l3/L3NetworkManagerImpl.java` around
lines 619 - 620, 在 L3NetworkManagerImpl 中对使用 NoRollbackFlow 的注释过于模糊(当前为 "reason
TBD"),请替换该注释为明确说明为何选择 NoRollbackFlow(例如:该 flow 只做非持久化的本地状态变更或外部 SDN
操作无法回滚到数据库事务),并列出后续改进点(如在未来加入补偿逻辑、引入持久化回滚或对 SDN 控制器进行状态回滚方案)。确保注释引用
NoRollbackFlow 和相关上下文以便后续维护者理解设计权衡和待办项。
core/src/main/java/org/zstack/core/config/GlobalConfigFacadeImpl.java (1)

117-149: reason TBD 补成可执行的债务说明。

Line 148 这条注释太泛,后续维护时很难判断这里为什么必须保留 NoRollbackFlow,以及在什么条件下可以替换。建议至少补上当前约束或后续跟进入口。

✏️ 可选修改
-        // DEBT: NoRollbackFlow — reason TBD
+        // DEBT: `NoRollbackFlow` is intentional here because rollback semantics are not defined
+        // for partial failures across management nodes yet. Replace this after that path is designed.
As per coding guidelines, “对于较长的注释,需要仔细校对并随代码更新,确保内容正确。”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/config/GlobalConfigFacadeImpl.java` around
lines 117 - 149, 在这段 chain.then(new NoRollbackFlow() { ... }) 的注释“DEBT:
NoRollbackFlow — reason TBD”中写出可执行的债务说明:说明为什么这里必须使用
NoRollbackFlow(例如该流程不可回滚因为它异步对所有 ManagementNode 发送 RefreshGuestOsMetadataMsg
并在任一失败时仍需继续或统一汇总错误),列出当前的约束条件(依赖的消息类型 RefreshGuestOsMetadataMsg、使用 While<> 并汇总
ErrorCodeList 的行为、对 bus.send/CloudBusCallBack
的异常处理方式),以及何种条件下可以安全替换为可回滚流程(比如当消息调用改为幂等且支持事务补偿、或改为逐节点事务化并能回滚时),并在注释末尾加入一个跟进入口(例如关联的
issue/PR 编号或负责人)以便将来跟进和验证替换。
network/src/main/java/org/zstack/network/l2/L2VlanNetworkFactory.java (1)

91-92: 把这条债务注释放到 NoRollbackFlow 之前。

这条注释当前还挂在前一个匿名 Flow 的类体里,读起来更像是在说明 rollback(),而不是后面的 NoRollbackFlow。建议挪到 then(new NoRollbackFlow() { 的正上方,避免误导后续维护者。

可选调整示例
-        // DEBT: NoRollbackFlow — in rollback
-        }).then(new NoRollbackFlow() {
+        })
+        // DEBT: This step intentionally uses NoRollbackFlow; cleanup relies on the previous flow's rollback.
+        .then(new NoRollbackFlow() {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@network/src/main/java/org/zstack/network/l2/L2VlanNetworkFactory.java` around
lines 91 - 92, 将当前位于匿名 Flow 类体内、写作 “// DEBT: NoRollbackFlow — in rollback”
的债务注释移到 then(new NoRollbackFlow() { 这一行的正上方;具体来说:在包含匿名 Flow 的类体结束处删除该注释,并在调用
then(new NoRollbackFlow() { 之前插入该注释,以便该注释明确关联到 NoRollbackFlow 而非前一个 Flow 的
rollback() 方法(定位符:匿名 Flow 的 rollback() 与 NoRollbackFlow)。
core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java (1)

125-143: 注释含义不够清晰。

// DEBT: NoRollbackFlow — in url 这个注释的意图不够明确。如果是标记技术债务,建议说明具体需要改进的内容,例如:为什么是 debt?需要做什么改进?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java` around lines
125 - 143, Replace the vague comment "// DEBT: NoRollbackFlow — in url" with a
clear TODO that states why this is technical debt and what to change: mention
that the anonymous NoRollbackFlow named "echo-server" uses
restf.echo(url(ECHO_PATH)) inside run(FlowTrigger, Map) and that the problem is
either using an anonymous flow, mixing URL construction with flow logic, or
missing rollback semantics; specify the desired fix (e.g., extract URL
construction to a helper, make a named class instead of anonymous
NoRollbackFlow, or add proper rollback behavior) and reference the affected
symbols (NoRollbackFlow, __name__ = "echo-server", run, restf.echo,
url(ECHO_PATH), Completion) so future maintainers know the exact location and
intended refactor.
network/src/main/java/org/zstack/network/l3/FirstAvailableIpAllocatorStrategy.java (1)

3-3: 移除未使用的导入。

org.apache.commons.math3.analysis.function.Add 在此文件中未被使用,应当移除。

♻️ 建议修复
-import org.apache.commons.math3.analysis.function.Add;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@network/src/main/java/org/zstack/network/l3/FirstAvailableIpAllocatorStrategy.java`
at line 3, Remove the unused import
org.apache.commons.math3.analysis.function.Add from
FirstAvailableIpAllocatorStrategy by deleting that import statement; confirm no
other references to Add exist in the class (FirstAvailableIpAllocatorStrategy)
and run a build/IDE cleanup to ensure no unused imports remain.
compute/src/main/java/org/zstack/compute/host/HostBase.java (3)

630-631: DEBT 注释引用的方法名不正确

注释中标注 in stopFailedToMigrateVms,但此 flow 的 __name__"run-extension-point-after-maintenance"。建议修正注释以准确反映所在上下文:

-                // DEBT: NoRollbackFlow — in stopFailedToMigrateVms
+                // DEBT: NoRollbackFlow — in run-extension-point-after-maintenance
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/host/HostBase.java` around lines 630
- 631, The DEBT comment incorrectly references stopFailedToMigrateVms while the
flow being declared is a NoRollbackFlow with __name__
"run-extension-point-after-maintenance"; update the comment to accurately
describe this flow/context (e.g., "DEBT: NoRollbackFlow — in
run-extension-point-after-maintenance") so it matches the flow's __name__ and
avoids confusion when inspecting HostBase, NoRollbackFlow, and the flow block
with __name__ "run-extension-point-after-maintenance".

1391-1412: DEBT 注释引用了不相关的方法名

这两处注释都标注 in getSyncLevel,但实际 flow 名称分别是 "check-conditions-of-connection""connect-host"getSyncLevel 是当前 ChainTask 的一个重写方法,与这些 flow 的用途无关。

建议修正为更准确的描述:

-                        // DEBT: NoRollbackFlow — in getSyncLevel
+                        // DEBT: NoRollbackFlow — check-conditions-of-connection
                         flow(new NoRollbackFlow() {
                             String __name__ = "check-conditions-of-connection";
-                        // DEBT: NoRollbackFlow — in getSyncLevel
+                        // DEBT: NoRollbackFlow — connect-host
                         flow(new NoRollbackFlow() {
                             String __name__ = "connect-host";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/host/HostBase.java` around lines
1391 - 1412, The DEBT comments incorrectly reference "in getSyncLevel" even
though these NoRollbackFlow instances in HostBase pertain to the
"check-conditions-of-connection" and "connect-host" flows; update the comments
to accurately describe the debt (e.g., "DEBT: NoRollbackFlow — consider making
this a rollback-capable flow" or mention the specific flow purpose) adjacent to
the flow declarations for "check-conditions-of-connection" and "connect-host"
(in HostBase) so they no longer incorrectly reference getSyncLevel.

576-577: 建议为 "reason TBD" 注释添加更具体的说明

多处 DEBT 注释使用 reason TBD(如此处、773、1432、1465、1498 行),这类标注对后续维护帮助有限。建议:

  1. 如果已知不需要回滚的原因,直接写明(如 "no state to rollback" 或 "idempotent operation")
  2. 如果需要后续调查,考虑创建 issue 追踪,并在注释中引用 issue 编号

这样能让技术债务标记更具可操作性。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/host/HostBase.java` around lines 576
- 577, Replace vague DEBT "reason TBD" comments on the flow(new
NoRollbackFlow()) usages in HostBase (and other DEBT entries in the same class)
with concrete guidance: either state the precise reason why rollback is
unnecessary (e.g., "no state to rollback" or "idempotent operation") or add a
reference to a tracking issue (e.g., "investigate rollback behavior — see
ISSUE-1234") so future maintainers have actionable context; update the comment
text adjacent to the NoRollbackFlow instantiation and any other DEBT comments in
this class accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@core/src/main/java/org/zstack/core/errorcode/GlobalErrorCodeI18nServiceImpl.java`:
- Line 24: The resource path constant I18N_FOLDER in
GlobalErrorCodeI18nServiceImpl currently uses File.separator which yields
platform-specific separators; change the constant to use the classpath separator
'/' instead (e.g., "i18n/globalErrorCodeMapping") so classpath resource loading
works on Windows; update any usages of I18N_FOLDER in the class to rely on the
new value.
- Around line 70-72: The outer catch in GlobalErrorCodeI18nServiceImpl currently
logs only e.getMessage() losing the stacktrace; update the logger.warn call in
that catch block (the one currently doing logger.warn(String.format("failed to
scan i18n folder: %s", e.getMessage()))) to pass the exception object itself so
the stack trace is recorded (e.g., logger.warn("failed to scan i18n folder",
e)); keep the original message text for context and only change the logger
invocation to include the Exception parameter.
- Around line 61-63: The JSON deserialization into LinkedHashMap.class in
JSONObjectUtil.toObject(content, LinkedHashMap.class) loses generic type info
and can cause ClassCastException when values aren't strings; update the
deserialization to preserve generics (e.g., use a TypeReference/TypeToken for
LinkedHashMap<String,String> or the library's equivalent) when constructing
Map<String,String> messages, and add a runtime validation step before putting
into localeMessages to ensure every key and value are strings (or convert
non-string values to their string representation) so localeMessages.put(locale,
messages) is safe; target the JSONObjectUtil.toObject call, the
Map<String,String> messages variable and the localeMessages.put usage for the
change.

In `@core/src/main/java/org/zstack/core/upgrade/AgentVersionVO.java`:
- Around line 3-5: AgentVersionVO's SoftDeletionCascade references joinColumn =
"resourceUuid" which doesn't exist on the class (it only declares uuid); update
the annotation to reference the actual field or add the missing
field—specifically, in the AgentVersionVO class locate the SoftDeletionCascade
annotation (joinColumn = "resourceUuid") and change joinColumn to "uuid" to
match the existing uuid field (or alternatively add a resourceUuid property if
semantics require it) so the soft-delete cascade can correctly find related
records.

---

Outside diff comments:
In
`@compute/src/main/java/org/zstack/compute/vm/VmAllocateNicForStartingVmFlow.java`:
- Line 192: Fix the typo in the inline comment inside the
VmAllocateNicForStartingVmFlow class: change "happend" to "happen" (search for
the comment string "this case should not happend" in
VmAllocateNicForStartingVmFlow.java and update it to "this case should not
happen"); keep the comment punctuation and spacing consistent with surrounding
comments and commit as a small cleanup.

In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeCacheVolumeFlow.java`:
- Line 32: The logger is instantiated with the wrong class; update the logger
declaration in VmExpungeCacheVolumeFlow by replacing
Utils.getLogger(VmExpungeRootVolumeFlow.class) with
Utils.getLogger(VmExpungeCacheVolumeFlow.class) so the CLogger references the
current class (update the private static final CLogger logger assignment
accordingly).
- Around line 64-66: The warning log in VmExpungeCacheVolumeFlow is using
spec.getVmInventory().getRootVolumeUuid() which is the root volume UUID instead
of the cache volume being processed; update the logger.warn call to log the
cache volume's UUID by using vol.getUuid() (keep the rest of the message the
same and continue to include the VM UUID/name via
spec.getVmInventory().getUuid()/getName() and reply.getError() for the error
details).
- Around line 40-42: 在 VmExpungeCacheVolumeFlow 中,当
spec.getVmInventory().getAllVolumes() 返回 null 时当前只调用了 trigger.next() 但没有
return,导致后续对 getAllVolumes().stream() 的调用抛出 NPE;修改方法是在调用 trigger.next()
后立即返回以中止当前 flow(或改为使用 if (getAllVolumes() == null) { trigger.next(); return;
}),确保在执行 stream() 前 getAllVolumes() 不为 null。

In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeMemoryVolumeFlow.java`:
- Around line 67-69: The log message in VmExpungeMemoryVolumeFlow incorrectly
references "root volume" and uses spec.getVmInventory().getRootVolumeUuid();
update the warning in the exception handling to reference the actual memory
volume by using vol.getUuid() and change the text to "memory volume" (e.g.,
"failed to expunge the memory volume[uuid:%s] of the vm[uuid:%s, name:%s], %s")
so that the flow (VmExpungeMemoryVolumeFlow) correctly records the volume UUID
and accurate operation context.
- Around line 40-42: In VmExpungeMemoryVolumeFlow, when
spec.getVmInventory().getAllVolumes() == null you call trigger.next() but do not
return, so execution continues and a subsequent call to
spec.getVmInventory().getAllVolumes().stream() can NPE; fix by returning
immediately after trigger.next() (add a return after the trigger.next() call) so
the method exits when getAllVolumes() is null.
- Line 32: The logger is instantiated with the wrong class reference: replace
the incorrect VmExpungeRootVolumeFlow.class passed to Utils.getLogger with
VmExpungeMemoryVolumeFlow.class so the static field logger in
VmExpungeMemoryVolumeFlow references the correct class; update the line that
declares private static final CLogger logger = Utils.getLogger(...) accordingly.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceManagerImpl.java`:
- Around line 1558-1580: The new NoRollbackFlow named "start-vm" can fail and
skip the DeleteVmGC submission path used by
instantiate-new-created-vmInstance.rollback(), leaving orphaned VMs/GPU
allocations; modify the failure handling so that when StartVmInstanceMsg in the
"start-vm" flow fails you always enqueue a DeleteVmGC for instantiateVm (use the
same DeleteVmGC submission code/path as in
instantiate-new-created-vmInstance.rollback()), and if DestroyVmInstanceMsg
returns a non-GC_ELIGIBLE error follow the same logic to emit the canonical
event instead of silently logging; update the failure branch in the flow's
CloudBusCallBack (and/or the rollback hook tied to instantiateVm/finalVo) to
reuse the DeleteVmGC submission and GC_ELIGIBLE handling so orphan cleanup is
consistent.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstantiateOtherDiskFlow.java`:
- Around line 150-195: In setupCreateVolumeFromDiskSizeFlows, the rollback
currently uses diskAO.getSize() which can be 0 when size came from diskSize;
change the ReleasePrimaryStorageSpaceMsg to setDiskSize(diskSize) (use the same
diskSize passed into AllocatePrimaryStorageSpaceMsg) and keep
allocatedInstallUrl and allocatedPrimaryStorageUuid as-is so the rollback
releases the exact amount allocated; update any references to diskAO.getSize()
in the rollback block accordingly.
- Around line 438-462: The flow uses NoRollbackFlow for attaching external media
via VmAttachOtherDiskExtensionPoint.attachOtherDiskToVm, leaving no compensation
path if later steps fail; change to a rollback-capable Flow (not NoRollbackFlow)
and invoke the extension point’s detach on rollback or add a symmetric detach
callback on VmAttachOtherDiskExtensionPoint that the Flow calls from its
rollback method, ensuring any state created by attachOtherDiskToVm is cleaned up
and any needed compensation data (e.g., disk/attachment identifiers) is saved in
the Flow’s data map for the rollback to use.
- Around line 120-123: The flow currently treats any VolumeInventory in
VmInstantiateOtherDiskFlow (set when isAttachDataVolume() and via
setupAttachVolumeFlows()) as if it was created by this flow and unconditionally
sends DeleteVolumeMsg during rollback; change this so the code distinguishes
volumes created by this flow from externally provided volumes: when you load a
volume by diskAO.getSourceUuid() in the isAttachDataVolume() branch, mark it as
external (e.g., set a boolean like volumeCreatedByFlow=false or store
externalVolumeUuid) and ensure setupAttachVolumeFlows() does not mark it as
newly created; when creating volumes in other branches mark them
volumeCreatedByFlow=true; then update rollback() (the block that currently sends
DeleteVolumeMsg for volumeInventory) to only send DeleteVolumeMsg if
volumeCreatedByFlow==true, and for external volumes perform the correct
compensation (run detach logic or skip deletion) instead of deleting; keep
references to VmInstantiateOtherDiskFlow, isAttachDataVolume(),
setupAttachVolumeFlows(), volumeInventory, rollback(), and DeleteVolumeMsg to
locate the changes.

In `@core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java`:
- Around line 145-163: The formatting of the anonymous NoRollbackFlow's name
field is inconsistent: update the declaration String __name__= "init-server";
inside the NoRollbackFlow (the anonymous class passed to flow) to match the
project's style by adding a space before the equals sign (String __name__ =
"init-server";), ensuring consistency with other occurrences like the __name__ =
usage earlier.

In
`@header/src/main/java/org/zstack/header/volume/APICreateDataVolumeFromVolumeSnapshotMsg.java`:
- Line 54: The class APICreateDataVolumeFromVolumeSnapshotMsg has its
`@DefaultTimeout` annotation changed to 3 hours (TimeUnit.HOURS, value = 3) but
the `@msg` example in the class Javadoc still shows the old timeout value
(1800000); either update the example value to the equivalent millisecond value
(10800000) to reflect the new default or revert/adjust the `@DefaultTimeout`
annotation to the intended timeout (e.g., value matching 1800000ms) so code and
example stay consistent—locate the `@DefaultTimeout` on
APICreateDataVolumeFromVolumeSnapshotMsg and the `@msg` example in the class
comment and make the matching change.

In `@image/src/main/java/org/zstack/image/ImageBase.java`:
- Around line 492-535: The current NoRollbackFlow that iterates
IsoOperator.getVmUuidByIsoUuid and sends DetachIsoFromVmInstanceMsg via bus.send
can leave partial state because it calls trigger.fail() on any error without
compensating for earlier successful detaches; change the flow to track
successfulVmUuids (List<String>) as you iterate, and if any detach fails use
those recorded UUIDs to send compensating attach messages (e.g., create
AttachIsoToVmInstanceMsg per UUID and bus.send them) before calling
trigger.fail(), or alternatively convert this step into a pure validation check
that only verifies all VMs can accept detach (no state change) and move actual
DetachIsoFromVmInstanceMsg logic into a rollback-capable flow; update references
in this block: NoRollbackFlow, IsoOperator.getVmUuidByIsoUuid,
DetachIsoFromVmInstanceMsg, bus.send, trigger.fail, trigger.next accordingly.

In `@network/src/main/java/org/zstack/network/l3/NormalIpRangeFactory.java`:
- Around line 132-185: The two NoRollbackFlow steps "enable-sdn-dhcp" and
"add-sdn-subnet" leave external SDN controller state if later steps fail; add
compensating cleanup or schedule async GC/reconcile on failure. Modify the flow
chain so that either these steps are implemented as RollbackFlow with rollback
logic that reverses SdnControllerUpdateDHCPMsg and SdnControllerL3.createIpRange
(e.g., send delete/disable messages to the SDN controller using the same service
IDs / resource UUIDs), or, in the enclosing error handler, detect whether
enable-sdn-dhcp and/or add-sdn-subnet succeeded and then invoke async cleanup
tasks (send delete DHCP msg and call SdnControllerL3.deleteIpRange or schedule
reconcile) for the resources created by SdnControllerUpdateDHCPMsg and
SdnControllerL3.createIpRange to ensure external state is removed or reconciled
on failure.

---

Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/host/HostBase.java`:
- Around line 630-631: The DEBT comment incorrectly references
stopFailedToMigrateVms while the flow being declared is a NoRollbackFlow with
__name__ "run-extension-point-after-maintenance"; update the comment to
accurately describe this flow/context (e.g., "DEBT: NoRollbackFlow — in
run-extension-point-after-maintenance") so it matches the flow's __name__ and
avoids confusion when inspecting HostBase, NoRollbackFlow, and the flow block
with __name__ "run-extension-point-after-maintenance".
- Around line 1391-1412: The DEBT comments incorrectly reference "in
getSyncLevel" even though these NoRollbackFlow instances in HostBase pertain to
the "check-conditions-of-connection" and "connect-host" flows; update the
comments to accurately describe the debt (e.g., "DEBT: NoRollbackFlow — consider
making this a rollback-capable flow" or mention the specific flow purpose)
adjacent to the flow declarations for "check-conditions-of-connection" and
"connect-host" (in HostBase) so they no longer incorrectly reference
getSyncLevel.
- Around line 576-577: Replace vague DEBT "reason TBD" comments on the flow(new
NoRollbackFlow()) usages in HostBase (and other DEBT entries in the same class)
with concrete guidance: either state the precise reason why rollback is
unnecessary (e.g., "no state to rollback" or "idempotent operation") or add a
reference to a tracking issue (e.g., "investigate rollback behavior — see
ISSUE-1234") so future maintainers have actionable context; update the comment
text adjacent to the NoRollbackFlow instantiation and any other DEBT comments in
this class accordingly.

In `@compute/src/main/java/org/zstack/compute/vm/VmDeleteVolumeFlow.java`:
- Line 90: 在 VmDeleteVolumeFlow 类中把现有“// DEBT: NoRollbackFlow — reason
TBD”注释替换为可执行的上下文说明:在注释中明确指出为何该 Flow 必须使用
NoRollbackFlow(例如资源已被物理删除无法回滚、幂等性保障或已在上层事务中处理回滚),列出相关风险边界(什么情况下会导致资源泄漏或不一致)并引用相关工单或设计文档编号;如果决定不是永久方案,请注明预期的整改时间点或条件。定位符:VmDeleteVolumeFlow
类和该包含 “NoRollbackFlow” 的注释位置,直接更新注释内容以包含原因、风险和工单/文档引用。

In `@core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java`:
- Around line 125-143: Replace the vague comment "// DEBT: NoRollbackFlow — in
url" with a clear TODO that states why this is technical debt and what to
change: mention that the anonymous NoRollbackFlow named "echo-server" uses
restf.echo(url(ECHO_PATH)) inside run(FlowTrigger, Map) and that the problem is
either using an anonymous flow, mixing URL construction with flow logic, or
missing rollback semantics; specify the desired fix (e.g., extract URL
construction to a helper, make a named class instead of anonymous
NoRollbackFlow, or add proper rollback behavior) and reference the affected
symbols (NoRollbackFlow, __name__ = "echo-server", run, restf.echo,
url(ECHO_PATH), Completion) so future maintainers know the exact location and
intended refactor.

In `@core/src/main/java/org/zstack/core/cascade/CascadeFacadeImpl.java`:
- Line 231: 把只是写“DEBT: NoRollbackFlow”的注释替换为有执行上下文的说明:在 CascadeFacadeImpl 的
asyncCascade 路径(标记为
NoRollbackFlow)的地方,补充一到两句说明为什么这里可以接受无回滚(例如幂等性、外部补偿机制或可接受的不一致窗口)、相关风险/约束(可能的数据不一致场景、重试/补偿触发条件)以及关联跟踪项(Issue/任务编号或链接)以便后续可追溯;定位符包括类名
CascadeFacadeImpl、方法/流程标识 asyncCascade 和注释中的 NoRollbackFlow。

In `@core/src/main/java/org/zstack/core/config/GlobalConfigFacadeImpl.java`:
- Around line 117-149: 在这段 chain.then(new NoRollbackFlow() { ... }) 的注释“DEBT:
NoRollbackFlow — reason TBD”中写出可执行的债务说明:说明为什么这里必须使用
NoRollbackFlow(例如该流程不可回滚因为它异步对所有 ManagementNode 发送 RefreshGuestOsMetadataMsg
并在任一失败时仍需继续或统一汇总错误),列出当前的约束条件(依赖的消息类型 RefreshGuestOsMetadataMsg、使用 While<> 并汇总
ErrorCodeList 的行为、对 bus.send/CloudBusCallBack
的异常处理方式),以及何种条件下可以安全替换为可回滚流程(比如当消息调用改为幂等且支持事务补偿、或改为逐节点事务化并能回滚时),并在注释末尾加入一个跟进入口(例如关联的
issue/PR 编号或负责人)以便将来跟进和验证替换。

In `@core/src/main/java/org/zstack/core/errorcode/ElaborationManagerImpl.java`:
- Around line 131-132: Replace the vague "DEBT: NoRollbackFlow — reason TBD"
comments around the NoRollbackFlow anonymous flows in ElaborationManagerImpl
with specific, actionable debt notes: state why using NoRollbackFlow is
potentially risky here (e.g., "skips rollback for X operation such as file
scanning/state mutation"), describe what should be evaluated later (e.g.,
"determine if rollback is required when Y fails or add compensating action"),
and include an owner or ticket reference for follow-up; update each occurrence
adjacent to the flow(...) / new NoRollbackFlow() blocks so future developers can
understand the debt and planned remediation.

In `@core/src/main/java/org/zstack/core/gc/GarbageCollectorVO.java`:
- Line 8: 在 GarbageCollectorVO 中把通配符导入 javax.persistence.*
替换为显式导入所用的注解/类型以保持导入风格一致;具体在该文件(类 GarbageCollectorVO)中移除 javax.persistence.*
并添加显式导入 Entity、Table、Column、Enumerated、EnumType 和 PreUpdate(仅导入实际使用到的类型)。

In `@network/src/main/java/org/zstack/network/l2/L2VlanNetworkFactory.java`:
- Around line 91-92: 将当前位于匿名 Flow 类体内、写作 “// DEBT: NoRollbackFlow — in rollback”
的债务注释移到 then(new NoRollbackFlow() { 这一行的正上方;具体来说:在包含匿名 Flow 的类体结束处删除该注释,并在调用
then(new NoRollbackFlow() { 之前插入该注释,以便该注释明确关联到 NoRollbackFlow 而非前一个 Flow 的
rollback() 方法(定位符:匿名 Flow 的 rollback() 与 NoRollbackFlow)。

In
`@network/src/main/java/org/zstack/network/l3/FirstAvailableIpAllocatorStrategy.java`:
- Line 3: Remove the unused import
org.apache.commons.math3.analysis.function.Add from
FirstAvailableIpAllocatorStrategy by deleting that import statement; confirm no
other references to Add exist in the class (FirstAvailableIpAllocatorStrategy)
and run a build/IDE cleanup to ensure no unused imports remain.

In `@network/src/main/java/org/zstack/network/l3/L3NetworkManagerImpl.java`:
- Around line 619-620: 在 L3NetworkManagerImpl 中对使用 NoRollbackFlow 的注释过于模糊(当前为
"reason TBD"),请替换该注释为明确说明为何选择 NoRollbackFlow(例如:该 flow 只做非持久化的本地状态变更或外部 SDN
操作无法回滚到数据库事务),并列出后续改进点(如在未来加入补偿逻辑、引入持久化回滚或对 SDN 控制器进行状态回滚方案)。确保注释引用
NoRollbackFlow 和相关上下文以便后续维护者理解设计权衡和待办项。

public class GlobalErrorCodeI18nServiceImpl implements GlobalErrorCodeI18nService, Component {
private static final CLogger logger = Utils.getLogger(GlobalErrorCodeI18nServiceImpl.class);

private static final String I18N_FOLDER = "i18n" + File.separator + "globalErrorCodeMapping";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

类路径资源路径应使用 / 而非 File.separator

File.separator 在 Windows 上返回 \,但类路径资源始终使用 / 作为分隔符,这可能导致在 Windows 环境下文件加载失败。

🐛 建议修复
-    private static final String I18N_FOLDER = "i18n" + File.separator + "globalErrorCodeMapping";
+    private static final String I18N_FOLDER = "i18n/globalErrorCodeMapping";
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
private static final String I18N_FOLDER = "i18n" + File.separator + "globalErrorCodeMapping";
private static final String I18N_FOLDER = "i18n/globalErrorCodeMapping";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@core/src/main/java/org/zstack/core/errorcode/GlobalErrorCodeI18nServiceImpl.java`
at line 24, The resource path constant I18N_FOLDER in
GlobalErrorCodeI18nServiceImpl currently uses File.separator which yields
platform-specific separators; change the constant to use the classpath separator
'/' instead (e.g., "i18n/globalErrorCodeMapping") so classpath resource loading
works on Windows; update any usages of I18N_FOLDER in the class to rely on the
new value.

Comment on lines +61 to +63
@SuppressWarnings("unchecked")
Map<String, String> messages = JSONObjectUtil.toObject(content, LinkedHashMap.class);
localeMessages.put(locale, messages);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

JSON 反序列化存在类型安全风险。

将 JSON 反序列化为 LinkedHashMap.class 时丢失了泛型类型信息。如果 JSON 文件中包含非字符串值(如嵌套对象或数字),在运行时访问 map 时会抛出 ClassCastException

🛡️ 建议添加类型校验或使用 TypeToken
                 `@SuppressWarnings`("unchecked")
                 Map<String, String> messages = JSONObjectUtil.toObject(content, LinkedHashMap.class);
+                // Validate all values are strings to fail fast on malformed JSON
+                for (Map.Entry<String, String> entry : messages.entrySet()) {
+                    if (!(entry.getValue() instanceof String)) {
+                        throw new IllegalArgumentException("Invalid value type for key: " + entry.getKey());
+                    }
+                }
                 localeMessages.put(locale, messages);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@core/src/main/java/org/zstack/core/errorcode/GlobalErrorCodeI18nServiceImpl.java`
around lines 61 - 63, The JSON deserialization into LinkedHashMap.class in
JSONObjectUtil.toObject(content, LinkedHashMap.class) loses generic type info
and can cause ClassCastException when values aren't strings; update the
deserialization to preserve generics (e.g., use a TypeReference/TypeToken for
LinkedHashMap<String,String> or the library's equivalent) when constructing
Map<String,String> messages, and add a runtime validation step before putting
into localeMessages to ensure every key and value are strings (or convert
non-string values to their string representation) so localeMessages.put(locale,
messages) is safe; target the JSONObjectUtil.toObject call, the
Map<String,String> messages variable and the localeMessages.put usage for the
change.

Comment on lines +70 to +72
} catch (Exception e) {
logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage()));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

外层异常日志丢失了堆栈信息。

与内层 catch 块(Line 67)不同,外层 catch 块只记录了异常消息,未包含异常对象本身,这会导致调试时缺少堆栈跟踪信息。

📝 建议修复
         } catch (Exception e) {
-            logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage()));
+            logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage()), e);
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} catch (Exception e) {
logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage()));
}
} catch (Exception e) {
logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage()), e);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@core/src/main/java/org/zstack/core/errorcode/GlobalErrorCodeI18nServiceImpl.java`
around lines 70 - 72, The outer catch in GlobalErrorCodeI18nServiceImpl
currently logs only e.getMessage() losing the stacktrace; update the logger.warn
call in that catch block (the one currently doing
logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage()))) to
pass the exception object itself so the stack trace is recorded (e.g.,
logger.warn("failed to scan i18n folder", e)); keep the original message text
for context and only change the logger invocation to include the Exception
parameter.

Comment on lines +3 to +5
import org.zstack.header.vo.ResourceVO;
import org.zstack.header.vo.SoftDeletionCascade;
import org.zstack.header.vo.SoftDeletionCascades;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== AgentVersionVO: annotation and fields =="
rg -n -C2 'class AgentVersionVO|SoftDeletionCascade|private String uuid|resourceUuid' core/src/main/java/org/zstack/core/upgrade/AgentVersionVO.java

echo
echo "== Repo-wide ResourceVO soft-delete mappings =="
rg -n -C1 '@SoftDeletionCascade\(parent\s*=\s*ResourceVO\.class,\s*joinColumn\s*=\s*"[^"]+"\)' --glob '**/*.java'

echo
echo "== Repo-wide declarations/usages of resourceUuid =="
rg -n -C1 '\bresourceUuid\b' --glob '**/*.java'

Repository: MatheMatrix/zstack

Length of output: 50375


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== SoftDeletionCascade annotation definition =="
rg -n 'public `@interface` SoftDeletionCascade|joinColumn' --glob '**/*.java' -A5 -B2 | head -50

echo
echo "== KvmHypervisorInfoVO: field structure =="
sed -n '1,50p' plugin/kvm/src/main/java/org/zstack/kvm/hypervisor/datatype/KvmHypervisorInfoVO.java

echo
echo "== SshKeyPairRefVO: field structure =="
rg -n -A30 'public class SshKeyPairRefVO' plugin/sshKeyPair/src/main/java/org/zstack/header/sshkeypair/SshKeyPairRefVO.java | head -50

Repository: MatheMatrix/zstack

Length of output: 8415


SoftDeletionCascade 的关联列与实体字段不匹配。

AgentVersionVO 只声明了 uuid 字段,但在 Line 15 的软删注解中使用了 joinColumn = "resourceUuid"。查看仓库内其他例子(如 KvmHypervisorInfoVO 使用 joinColumn = "uuid" 对应其 uuid 字段、SshKeyPairRefVO 使用 joinColumn = "resourceUuid" 对应其 resourceUuid 字段),joinColumn 必须引用类中实际存在的字段。这个不匹配会导致软删级联无法找到对应的记录。

建议修正
 `@SoftDeletionCascades`({
-        `@SoftDeletionCascade`(parent = ResourceVO.class, joinColumn = "resourceUuid")
+        `@SoftDeletionCascade`(parent = ResourceVO.class, joinColumn = "uuid")
 })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/upgrade/AgentVersionVO.java` around lines
3 - 5, AgentVersionVO's SoftDeletionCascade references joinColumn =
"resourceUuid" which doesn't exist on the class (it only declares uuid); update
the annotation to reference the actual field or add the missing
field—specifically, in the AgentVersionVO class locate the SoftDeletionCascade
annotation (joinColumn = "resourceUuid") and change joinColumn to "uuid" to
match the existing uuid field (or alternatively add a resourceUuid property if
semantics require it) so the soft-delete cascade can correctly find related
records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants