<fix>[vm]: ZSTAC-68874 always submit DeleteVmGC on create rollback#3464
<fix>[vm]: ZSTAC-68874 always submit DeleteVmGC on create rollback#3464MatheMatrix wants to merge 1 commit into5.5.12from
Conversation
Resolves: ZSTAC-68874 Change-Id: I49fd973a29ddea3299acfca2d9c70f88ea5e7906
Walkthrough回滚路径中对虚拟机删除的垃圾回收提交由有条件改为无条件提交;大量文件将通用的通配符导入替换为显式导入;在若干流程处加入“DEBT: NoRollbackFlow”注释;新增全局错误码国际化实现并替换 LocaleUtils;对若干 VO/消息添加注解/超时标注。 Changes
预估代码审查工作量🎯 3 (中等) | ⏱️ ~20 分钟 诗
✨ Finishing Touches
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/VmCreateOnHypervisorFlow.java`:
- Around line 85-96: In VmCreateOnHypervisorFlow, the DeleteVmGC is being
registered with the wrong host UUID: change gc.hostUuid from
spec.getVmInventory().getHostUuid() to spec.getDestHost().getUuid() so the GC
(DeleteVmGC) runs and listens for host events on the actual destination host
where the VM was created; update the assignment near gcName creation (refer to
DeleteVmGC, gc.hostUuid, gcName and the logger.debug call) to use
spec.getDestHost().getUuid().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: cc5753a6-8268-427c-b9cd-3e36cd1b39ce
📒 Files selected for processing (1)
compute/src/main/java/org/zstack/compute/vm/VmCreateOnHypervisorFlow.java
| String gcName = String.format("gc-vm-%s-on-host-%s", spec.getVmInventory().getUuid(), spec.getDestHost().getUuid()); | ||
|
|
||
| DeleteVmGC gc = new DeleteVmGC(); | ||
| gc.NAME = gcName; | ||
| gc.hostUuid = spec.getVmInventory().getHostUuid(); | ||
| gc.inventory = spec.getVmInventory(); | ||
| if (gc.existedAndNotCompleted()) { | ||
| logger.debug(String.format("There is already a DeleteVmGC of vm[uuid:%s] " + | ||
| "on host[uuid:%s], skip.", spec.getVmInventory().getUuid(), spec.getDestHost().getUuid())); | ||
| } else { | ||
| gc.submit(); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# 查找 DeleteVmGC 类的实现
ast-grep --pattern $'class DeleteVmGC {
$$$
}'
# 查找 existedAndNotCompleted 方法的实现
rg -n -A 10 'existedAndNotCompleted' --type javaRepository: MatheMatrix/zstack
Length of output: 6777
🏁 Script executed:
#!/bin/bash
# 查找 DeleteVmGC 类的定义和实现
find . -type f -name "*.java" | xargs grep -l "class DeleteVmGC"
# 查看 DeleteVmGC 的完整实现
rg -B 5 -A 30 "class DeleteVmGC" --type javaRepository: MatheMatrix/zstack
Length of output: 3642
🏁 Script executed:
#!/bin/bash
# 查看 VmCreateOnHypervisorFlow 完整代码段
rg -B 20 -A 10 "setHostUuid" compute/src/main/java/org/zstack/compute/vm/VmCreateOnHypervisorFlow.javaRepository: MatheMatrix/zstack
Length of output: 3295
gc.hostUuid 应使用 spec.getDestHost().getUuid() 而非 spec.getVmInventory().getHostUuid(),确保 GC 任务在正确的主机上执行。
在 VM 创建失败的回滚流程中,VM 被创建在 destHost 上。DeleteVmGC 的 hostUuid 字段被用于:
- 匹配主机事件(
HOST_DELETED_PATH、HOST_STATUS_CHANGED_PATH) - 查询主机状态以决定是否执行清理
第 85 行的 gcName 和第 93 行的日志都正确地使用了 spec.getDestHost().getUuid(),但第 89 行的 gc.hostUuid 使用了 spec.getVmInventory().getHostUuid(),导致 GC 可能在错误的主机上执行清理或无法正确监听主机事件。
建议修复
DeleteVmGC gc = new DeleteVmGC();
gc.NAME = gcName;
- gc.hostUuid = spec.getVmInventory().getHostUuid();
+ gc.hostUuid = spec.getDestHost().getUuid();
gc.inventory = spec.getVmInventory();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@compute/src/main/java/org/zstack/compute/vm/VmCreateOnHypervisorFlow.java`
around lines 85 - 96, In VmCreateOnHypervisorFlow, the DeleteVmGC is being
registered with the wrong host UUID: change gc.hostUuid from
spec.getVmInventory().getHostUuid() to spec.getDestHost().getUuid() so the GC
(DeleteVmGC) runs and listens for host events on the actual destination host
where the VM was created; update the assignment near gcName creation (refer to
DeleteVmGC, gc.hostUuid, gcName and the logger.debug call) to use
spec.getDestHost().getUuid().
e79c739 to
8797213
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (15)
compute/src/main/java/org/zstack/compute/vm/VmExpungeCacheVolumeFlow.java (3)
32-32:⚠️ Potential issue | 🟡 MinorLogger 使用了错误的类引用。
当前 logger 使用
VmExpungeRootVolumeFlow.class,应该使用VmExpungeCacheVolumeFlow.class。🐛 建议修复
- private static final CLogger logger = Utils.getLogger(VmExpungeRootVolumeFlow.class); + private static final CLogger logger = Utils.getLogger(VmExpungeCacheVolumeFlow.class);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeCacheVolumeFlow.java` at line 32, The logger is instantiated with the wrong class; update the logger declaration in VmExpungeCacheVolumeFlow by replacing Utils.getLogger(VmExpungeRootVolumeFlow.class) with Utils.getLogger(VmExpungeCacheVolumeFlow.class) so the CLogger references the current class (update the private static final CLogger logger assignment accordingly).
64-66:⚠️ Potential issue | 🟡 Minor日志中记录的卷 UUID 不正确。
日志记录的是
getRootVolumeUuid(),但实际正在处理的是 cache volume,应该使用vol.getUuid()记录当前处理的缓存卷 UUID。🐛 建议修复
if (!reply.isSuccess()) { logger.warn(String.format("failed to expunge the cache volume[uuid:%s] of the vm[uuid:%s, name:%s], %s", - spec.getVmInventory().getRootVolumeUuid(), spec.getVmInventory().getUuid(), + vol.getUuid(), spec.getVmInventory().getUuid(), spec.getVmInventory().getName(), reply.getError()));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeCacheVolumeFlow.java` around lines 64 - 66, The warning log in VmExpungeCacheVolumeFlow is using spec.getVmInventory().getRootVolumeUuid() which is the root volume UUID instead of the cache volume being processed; update the logger.warn call to log the cache volume's UUID by using vol.getUuid() (keep the rest of the message the same and continue to include the VM UUID/name via spec.getVmInventory().getUuid()/getName() and reply.getError() for the error details).
40-42:⚠️ Potential issue | 🔴 Critical缺少
return语句导致潜在的 NullPointerException。与
VmExpungeMemoryVolumeFlow相同的问题:当getAllVolumes()返回null时,流程会继续执行到第 44 行,对null调用stream()抛出 NPE。🐛 建议修复
if (spec.getVmInventory().getAllVolumes() == null) { trigger.next(); + return; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeCacheVolumeFlow.java` around lines 40 - 42, 在 VmExpungeCacheVolumeFlow 中,当 spec.getVmInventory().getAllVolumes() 返回 null 时当前只调用了 trigger.next() 但没有 return,导致后续对 getAllVolumes().stream() 的调用抛出 NPE;修改方法是在调用 trigger.next() 后立即返回以中止当前 flow(或改为使用 if (getAllVolumes() == null) { trigger.next(); return; }),确保在执行 stream() 前 getAllVolumes() 不为 null。compute/src/main/java/org/zstack/compute/vm/VmExpungeMemoryVolumeFlow.java (3)
67-69:⚠️ Potential issue | 🟡 Minor日志消息内容与实际操作不符。
日志中写的是 "root volume",但这个流程处理的是 memory volume。同时
getRootVolumeUuid()应该改为vol.getUuid()以正确记录实际处理的卷 UUID。🐛 建议修复
- logger.warn(String.format("failed to expunge the root volume[uuid:%s] of the vm[uuid:%s, name:%s], %s", - spec.getVmInventory().getRootVolumeUuid(), spec.getVmInventory().getUuid(), + logger.warn(String.format("failed to expunge the memory volume[uuid:%s] of the vm[uuid:%s, name:%s], %s", + vol.getUuid(), spec.getVmInventory().getUuid(), spec.getVmInventory().getName(), reply.getError()));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeMemoryVolumeFlow.java` around lines 67 - 69, The log message in VmExpungeMemoryVolumeFlow incorrectly references "root volume" and uses spec.getVmInventory().getRootVolumeUuid(); update the warning in the exception handling to reference the actual memory volume by using vol.getUuid() and change the text to "memory volume" (e.g., "failed to expunge the memory volume[uuid:%s] of the vm[uuid:%s, name:%s], %s") so that the flow (VmExpungeMemoryVolumeFlow) correctly records the volume UUID and accurate operation context.
40-42:⚠️ Potential issue | 🔴 Critical缺少
return语句导致潜在的 NullPointerException。当
getAllVolumes()返回null时,trigger.next()被调用后流程没有返回,代码会继续执行到第 44 行,对null调用stream()方法将抛出 NPE。🐛 建议修复
if (spec.getVmInventory().getAllVolumes() == null) { trigger.next(); + return; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeMemoryVolumeFlow.java` around lines 40 - 42, In VmExpungeMemoryVolumeFlow, when spec.getVmInventory().getAllVolumes() == null you call trigger.next() but do not return, so execution continues and a subsequent call to spec.getVmInventory().getAllVolumes().stream() can NPE; fix by returning immediately after trigger.next() (add a return after the trigger.next() call) so the method exits when getAllVolumes() is null.
32-32:⚠️ Potential issue | 🟡 MinorLogger 使用了错误的类引用。
当前 logger 使用
VmExpungeRootVolumeFlow.class,应该使用VmExpungeMemoryVolumeFlow.class。这是复制粘贴遗留的错误,会导致日志归类不正确。🐛 建议修复
- private static final CLogger logger = Utils.getLogger(VmExpungeRootVolumeFlow.class); + private static final CLogger logger = Utils.getLogger(VmExpungeMemoryVolumeFlow.class);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeMemoryVolumeFlow.java` at line 32, The logger is instantiated with the wrong class reference: replace the incorrect VmExpungeRootVolumeFlow.class passed to Utils.getLogger with VmExpungeMemoryVolumeFlow.class so the static field logger in VmExpungeMemoryVolumeFlow references the correct class; update the line that declares private static final CLogger logger = Utils.getLogger(...) accordingly.compute/src/main/java/org/zstack/compute/vm/VmAllocateNicForStartingVmFlow.java (1)
192-192:⚠️ Potential issue | 🟡 Minor注释拼写错误:
happend→happen根据编码规范,注释应使用正确的英文拼写。建议在此次导入清理时一并修复。
🔧 建议修复
- /* this case should not happend */ + /* this case should not happen */🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmAllocateNicForStartingVmFlow.java` at line 192, Fix the typo in the inline comment inside the VmAllocateNicForStartingVmFlow class: change "happend" to "happen" (search for the comment string "this case should not happend" in VmAllocateNicForStartingVmFlow.java and update it to "this case should not happen"); keep the comment punctuation and spacing consistent with surrounding comments and commit as a small cleanup.header/src/main/java/org/zstack/header/volume/APICreateDataVolumeFromVolumeSnapshotMsg.java (1)
54-54:⚠️ Potential issue | 🟡 Minor默认超时改成 3 小时后,示例里的
timeout已经过期。当前类注释里的
@msg示例仍是1800000,但 Line 54 的默认值已经变成了 3 小时,即10800000。如果 3 小时是预期行为,请同步更新示例文档;如果不是,这里的注解值需要再确认。📝 建议同步的最小修改
- * "timeout": 1800000, + * "timeout": 10800000,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@header/src/main/java/org/zstack/header/volume/APICreateDataVolumeFromVolumeSnapshotMsg.java` at line 54, The class APICreateDataVolumeFromVolumeSnapshotMsg has its `@DefaultTimeout` annotation changed to 3 hours (TimeUnit.HOURS, value = 3) but the `@msg` example in the class Javadoc still shows the old timeout value (1800000); either update the example value to the equivalent millisecond value (10800000) to reflect the new default or revert/adjust the `@DefaultTimeout` annotation to the intended timeout (e.g., value matching 1800000ms) so code and example stay consistent—locate the `@DefaultTimeout` on APICreateDataVolumeFromVolumeSnapshotMsg and the `@msg` example in the class comment and make the matching change.core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java (1)
145-163:⚠️ Potential issue | 🟡 Minor修正代码格式不一致:Line 147 缺少空格
Line 147 的
__name__=与 Line 131 的__name__ =格式不一致,应在=前添加空格。🛠️ 建议的修复方案
// DEBT: NoRollbackFlow — in url flow(new NoRollbackFlow() { - String __name__= "init-server"; + String __name__ = "init-server"; `@Override` public void run(FlowTrigger trigger, Map data) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java` around lines 145 - 163, The formatting of the anonymous NoRollbackFlow's name field is inconsistent: update the declaration String __name__= "init-server"; inside the NoRollbackFlow (the anonymous class passed to flow) to match the project's style by adding a space before the equals sign (String __name__ = "init-server";), ensuring consistency with other occurrences like the __name__ = usage earlier.network/src/main/java/org/zstack/network/l3/NormalIpRangeFactory.java (1)
132-185:⚠️ Potential issue | 🟠 Major请为 SDN 侧副作用补偿/GC,否则失败后会留下外部脏状态。
这里把
enable-sdn-dhcp和add-sdn-subnet都放进了NoRollbackFlow。如果前者已经成功、后者再失败,链路只会回滚本地save-db,但 SDN 控制器侧的 DHCP/子网状态不会被清理,最终会出现数据库与控制器状态不一致,后续重试也可能变得不确定。建议至少为这两个步骤补充补偿回滚,或者在失败时提交异步 GC / reconcile。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@network/src/main/java/org/zstack/network/l3/NormalIpRangeFactory.java` around lines 132 - 185, The two NoRollbackFlow steps "enable-sdn-dhcp" and "add-sdn-subnet" leave external SDN controller state if later steps fail; add compensating cleanup or schedule async GC/reconcile on failure. Modify the flow chain so that either these steps are implemented as RollbackFlow with rollback logic that reverses SdnControllerUpdateDHCPMsg and SdnControllerL3.createIpRange (e.g., send delete/disable messages to the SDN controller using the same service IDs / resource UUIDs), or, in the enclosing error handler, detect whether enable-sdn-dhcp and/or add-sdn-subnet succeeded and then invoke async cleanup tasks (send delete DHCP msg and call SdnControllerL3.deleteIpRange or schedule reconcile) for the resources created by SdnControllerUpdateDHCPMsg and SdnControllerL3.createIpRange to ensure external state is removed or reconciled on failure.image/src/main/java/org/zstack/image/ImageBase.java (1)
492-535:⚠️ Potential issue | 🟠 Major这段删除前置卸载会在部分失败时留下半完成状态。
这里对多个 VM 逐个卸载 ISO,只要后面任意一个失败就会
trigger.fail(),但前面已经成功卸载的 VM 没有任何补偿。结果会变成:镜像删除失败,但部分 VM 已经丢失了 ISO 挂载。建议至少记录成功卸载的 VM,并在失败路径里补偿恢复;否则这一步更适合作为无副作用的前置校验,而不是直接改动 VM 状态。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@image/src/main/java/org/zstack/image/ImageBase.java` around lines 492 - 535, The current NoRollbackFlow that iterates IsoOperator.getVmUuidByIsoUuid and sends DetachIsoFromVmInstanceMsg via bus.send can leave partial state because it calls trigger.fail() on any error without compensating for earlier successful detaches; change the flow to track successfulVmUuids (List<String>) as you iterate, and if any detach fails use those recorded UUIDs to send compensating attach messages (e.g., create AttachIsoToVmInstanceMsg per UUID and bus.send them) before calling trigger.fail(), or alternatively convert this step into a pure validation check that only verifies all VMs can accept detach (no state change) and move actual DetachIsoFromVmInstanceMsg logic into a rollback-capable flow; update references in this block: NoRollbackFlow, IsoOperator.getVmUuidByIsoUuid, DetachIsoFromVmInstanceMsg, bus.send, trigger.fail, trigger.next accordingly.compute/src/main/java/org/zstack/compute/vm/VmInstantiateOtherDiskFlow.java (3)
150-195:⚠️ Potential issue | 🟠 Major回滚释放容量时不要再读
diskAO.getSize()。Line 159 申请主存储用的是
diskSize,但 Line 189 回滚释放的是diskAO.getSize()。走diskOfferingUuid分支时,diskAO.getSize()很可能是0,失败回滚会少释放容量,主存储账面空间会被污染。建议修改
`@Override` public void rollback(FlowRollback chain, Map data) { if (!isSuccessAllocatePS) { chain.rollback(); return; } ReleasePrimaryStorageSpaceMsg msg = new ReleasePrimaryStorageSpaceMsg(); - msg.setDiskSize(diskAO.getSize()); + msg.setDiskSize(diskSize); msg.setAllocatedInstallUrl(allocatedInstallUrl); msg.setPrimaryStorageUuid(allocatedPrimaryStorageUuid); bus.makeTargetServiceIdByResourceUuid(msg, PrimaryStorageConstant.SERVICE_ID, allocatedPrimaryStorageUuid); bus.send(msg); chain.rollback();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmInstantiateOtherDiskFlow.java` around lines 150 - 195, In setupCreateVolumeFromDiskSizeFlows, the rollback currently uses diskAO.getSize() which can be 0 when size came from diskSize; change the ReleasePrimaryStorageSpaceMsg to setDiskSize(diskSize) (use the same diskSize passed into AllocatePrimaryStorageSpaceMsg) and keep allocatedInstallUrl and allocatedPrimaryStorageUuid as-is so the rollback releases the exact amount allocated; update any references to diskAO.getSize() in the rollback block accordingly.
438-462:⚠️ Potential issue | 🟠 Major
attachOtherDisk现在没有任何补偿路径。这段改成
NoRollbackFlow后,只要后续步骤失败,外层rollback()既不会回调扩展点做 detach,也没有保存任何可用于补偿的数据;当前外层回滚只会处理volumeInventory。对VmAttachOtherDiskExtensionPoint管理的磁盘/介质,这会遗留宿主机或控制面的关联状态。建议这里保留 rollback-capableFlow,或者给扩展点补一个对称的 detach 回调。Also applies to: 468-493
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmInstantiateOtherDiskFlow.java` around lines 438 - 462, The flow uses NoRollbackFlow for attaching external media via VmAttachOtherDiskExtensionPoint.attachOtherDiskToVm, leaving no compensation path if later steps fail; change to a rollback-capable Flow (not NoRollbackFlow) and invoke the extension point’s detach on rollback or add a symmetric detach callback on VmAttachOtherDiskExtensionPoint that the Flow calls from its rollback method, ensuring any state created by attachOtherDiskToVm is cleaned up and any needed compensation data (e.g., disk/attachment identifiers) is saved in the Flow’s data map for the rollback to use.
120-123:⚠️ Potential issue | 🔴 Critical现有数据盘在失败回滚时会被误删。
Line 121-123 把外部传入的现有卷直接放进了
volumeInventory。现在attach-volume-to-vm改成NoRollbackFlow后,后续任一步失败都会落到外层rollback(),而 Line 474-490 会无条件对volumeInventory发送DeleteVolumeMsg。这会把“用户原本要挂载的现有数据盘”当成“本流程新建的临时卷”删除,而不是执行 detach。至少要区分“本流程创建的卷”和“外部传入的卷”,对后者走补偿卸载或跳过删除。Also applies to: 413-434, 468-493
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmInstantiateOtherDiskFlow.java` around lines 120 - 123, The flow currently treats any VolumeInventory in VmInstantiateOtherDiskFlow (set when isAttachDataVolume() and via setupAttachVolumeFlows()) as if it was created by this flow and unconditionally sends DeleteVolumeMsg during rollback; change this so the code distinguishes volumes created by this flow from externally provided volumes: when you load a volume by diskAO.getSourceUuid() in the isAttachDataVolume() branch, mark it as external (e.g., set a boolean like volumeCreatedByFlow=false or store externalVolumeUuid) and ensure setupAttachVolumeFlows() does not mark it as newly created; when creating volumes in other branches mark them volumeCreatedByFlow=true; then update rollback() (the block that currently sends DeleteVolumeMsg for volumeInventory) to only send DeleteVolumeMsg if volumeCreatedByFlow==true, and for external volumes perform the correct compensation (run detach logic or skip deletion) instead of deleting; keep references to VmInstantiateOtherDiskFlow, isAttachDataVolume(), setupAttachVolumeFlows(), volumeInventory, rollback(), and DeleteVolumeMsg to locate the changes.compute/src/main/java/org/zstack/compute/vm/VmInstanceManagerImpl.java (1)
1558-1580:⚠️ Potential issue | 🟠 Major
start-vm失败后仍可能绕过 DeleteVmGC 补偿。Line 1560 这里新增的
NoRollbackFlow一旦失败,会回滚到前一个instantiate-new-created-vmInstance.rollback();但那个回滚在DestroyVmInstanceMsg失败时只是记录日志然后继续,没有统一提交 DeleteVmGC。这样“VM 已创建成功,但后续挂盘/启动失败”的分支里,仍可能留下宿主机上的孤儿 VM 和被占用的 GPU 资源,和本 PR 要修的问题是同一类泄漏。建议把这条回滚链也复用“无论错误类型都提交 DeleteVmGC,非 GC_ELIGIBLE 再补 canonical event”的处理。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceManagerImpl.java` around lines 1558 - 1580, The new NoRollbackFlow named "start-vm" can fail and skip the DeleteVmGC submission path used by instantiate-new-created-vmInstance.rollback(), leaving orphaned VMs/GPU allocations; modify the failure handling so that when StartVmInstanceMsg in the "start-vm" flow fails you always enqueue a DeleteVmGC for instantiateVm (use the same DeleteVmGC submission code/path as in instantiate-new-created-vmInstance.rollback()), and if DestroyVmInstanceMsg returns a non-GC_ELIGIBLE error follow the same logic to emit the canonical event instead of silently logging; update the failure branch in the flow's CloudBusCallBack (and/or the rollback hook tied to instantiateVm/finalVo) to reuse the DeleteVmGC submission and GC_ELIGIBLE handling so orphan cleanup is consistent.
🧹 Nitpick comments (12)
core/src/main/java/org/zstack/core/gc/GarbageCollectorVO.java (1)
8-8: 考虑将javax.persistence.*通配符导入替换为显式导入以保持一致性。第 4-5 行已将
org.zstack.header.vo的通配符导入改为显式导入,但第 8 行的javax.persistence.*仍使用通配符。为保持导入风格一致,建议将其改为显式导入实际使用的类:Entity、Table、Column、Enumerated、EnumType、PreUpdate。♻️ 建议的重构
-import javax.persistence.*; +import javax.persistence.Column; +import javax.persistence.Entity; +import javax.persistence.EnumType; +import javax.persistence.Enumerated; +import javax.persistence.PreUpdate; +import javax.persistence.Table;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/zstack/core/gc/GarbageCollectorVO.java` at line 8, 在 GarbageCollectorVO 中把通配符导入 javax.persistence.* 替换为显式导入所用的注解/类型以保持导入风格一致;具体在该文件(类 GarbageCollectorVO)中移除 javax.persistence.* 并添加显式导入 Entity、Table、Column、Enumerated、EnumType 和 PreUpdate(仅导入实际使用到的类型)。core/src/main/java/org/zstack/core/errorcode/ElaborationManagerImpl.java (1)
131-132: 技术债务注释应更具描述性多处
DEBT: NoRollbackFlow — reason TBD注释(如第 131、147、180、203、237、254 行)缺乏实际意义。如果是在追踪技术债务,建议说明:
- 为什么这是技术债务
- 未来需要采取什么行动
例如:
// DEBT: NoRollbackFlow — should evaluate if rollback logic is needed for file scanning errors这样后续开发者能够理解并处理这些标记。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/zstack/core/errorcode/ElaborationManagerImpl.java` around lines 131 - 132, Replace the vague "DEBT: NoRollbackFlow — reason TBD" comments around the NoRollbackFlow anonymous flows in ElaborationManagerImpl with specific, actionable debt notes: state why using NoRollbackFlow is potentially risky here (e.g., "skips rollback for X operation such as file scanning/state mutation"), describe what should be evaluated later (e.g., "determine if rollback is required when Y fails or add compensating action"), and include an owner or ticket reference for follow-up; update each occurrence adjacent to the flow(...) / new NoRollbackFlow() blocks so future developers can understand the debt and planned remediation.compute/src/main/java/org/zstack/compute/vm/VmDeleteVolumeFlow.java (1)
90-90: 把TBD改成可执行的上下文。这条注释现在只说明“这里有债务”,但没有解释为什么这里必须保留
NoRollbackFlow,后续维护时很难判断这是有意设计还是临时绕过。建议补上具体原因、风险边界,或者直接引用对应工单号。As per coding guidelines,“对于较长的注释,需要仔细校对并随代码更新,确保内容正确。”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/vm/VmDeleteVolumeFlow.java` at line 90, 在 VmDeleteVolumeFlow 类中把现有“// DEBT: NoRollbackFlow — reason TBD”注释替换为可执行的上下文说明:在注释中明确指出为何该 Flow 必须使用 NoRollbackFlow(例如资源已被物理删除无法回滚、幂等性保障或已在上层事务中处理回滚),列出相关风险边界(什么情况下会导致资源泄漏或不一致)并引用相关工单或设计文档编号;如果决定不是永久方案,请注明预期的整改时间点或条件。定位符:VmDeleteVolumeFlow 类和该包含 “NoRollbackFlow” 的注释位置,直接更新注释内容以包含原因、风险和工单/文档引用。core/src/main/java/org/zstack/core/cascade/CascadeFacadeImpl.java (1)
231-231: 给DEBT注释补上可执行上下文。这几处现在只标明了
NoRollbackFlow是技术债,但没有说明这里为什么可以接受“无回滚”,也没有关联跟踪项;后续很容易变成长期悬空的注释。建议至少补一句风险/约束说明,或附上对应的 issue/ticket。Also applies to: 290-290, 313-313
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/zstack/core/cascade/CascadeFacadeImpl.java` at line 231, 把只是写“DEBT: NoRollbackFlow”的注释替换为有执行上下文的说明:在 CascadeFacadeImpl 的 asyncCascade 路径(标记为 NoRollbackFlow)的地方,补充一到两句说明为什么这里可以接受无回滚(例如幂等性、外部补偿机制或可接受的不一致窗口)、相关风险/约束(可能的数据不一致场景、重试/补偿触发条件)以及关联跟踪项(Issue/任务编号或链接)以便后续可追溯;定位符包括类名 CascadeFacadeImpl、方法/流程标识 asyncCascade 和注释中的 NoRollbackFlow。network/src/main/java/org/zstack/network/l3/L3NetworkManagerImpl.java (1)
619-620: 建议补充债务原因说明。当前注释
reason TBD含义模糊,建议明确说明为何此处使用NoRollbackFlow以及后续改进方向,例如:数据库持久化后的回滚策略、SDN 控制器状态回滚等。🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@network/src/main/java/org/zstack/network/l3/L3NetworkManagerImpl.java` around lines 619 - 620, 在 L3NetworkManagerImpl 中对使用 NoRollbackFlow 的注释过于模糊(当前为 "reason TBD"),请替换该注释为明确说明为何选择 NoRollbackFlow(例如:该 flow 只做非持久化的本地状态变更或外部 SDN 操作无法回滚到数据库事务),并列出后续改进点(如在未来加入补偿逻辑、引入持久化回滚或对 SDN 控制器进行状态回滚方案)。确保注释引用 NoRollbackFlow 和相关上下文以便后续维护者理解设计权衡和待办项。core/src/main/java/org/zstack/core/config/GlobalConfigFacadeImpl.java (1)
117-149: 把reason TBD补成可执行的债务说明。Line 148 这条注释太泛,后续维护时很难判断这里为什么必须保留
NoRollbackFlow,以及在什么条件下可以替换。建议至少补上当前约束或后续跟进入口。As per coding guidelines, “对于较长的注释,需要仔细校对并随代码更新,确保内容正确。”✏️ 可选修改
- // DEBT: NoRollbackFlow — reason TBD + // DEBT: `NoRollbackFlow` is intentional here because rollback semantics are not defined + // for partial failures across management nodes yet. Replace this after that path is designed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/zstack/core/config/GlobalConfigFacadeImpl.java` around lines 117 - 149, 在这段 chain.then(new NoRollbackFlow() { ... }) 的注释“DEBT: NoRollbackFlow — reason TBD”中写出可执行的债务说明:说明为什么这里必须使用 NoRollbackFlow(例如该流程不可回滚因为它异步对所有 ManagementNode 发送 RefreshGuestOsMetadataMsg 并在任一失败时仍需继续或统一汇总错误),列出当前的约束条件(依赖的消息类型 RefreshGuestOsMetadataMsg、使用 While<> 并汇总 ErrorCodeList 的行为、对 bus.send/CloudBusCallBack 的异常处理方式),以及何种条件下可以安全替换为可回滚流程(比如当消息调用改为幂等且支持事务补偿、或改为逐节点事务化并能回滚时),并在注释末尾加入一个跟进入口(例如关联的 issue/PR 编号或负责人)以便将来跟进和验证替换。network/src/main/java/org/zstack/network/l2/L2VlanNetworkFactory.java (1)
91-92: 把这条债务注释放到NoRollbackFlow之前。这条注释当前还挂在前一个匿名
Flow的类体里,读起来更像是在说明rollback(),而不是后面的NoRollbackFlow。建议挪到then(new NoRollbackFlow() {的正上方,避免误导后续维护者。可选调整示例
- // DEBT: NoRollbackFlow — in rollback - }).then(new NoRollbackFlow() { + }) + // DEBT: This step intentionally uses NoRollbackFlow; cleanup relies on the previous flow's rollback. + .then(new NoRollbackFlow() {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@network/src/main/java/org/zstack/network/l2/L2VlanNetworkFactory.java` around lines 91 - 92, 将当前位于匿名 Flow 类体内、写作 “// DEBT: NoRollbackFlow — in rollback” 的债务注释移到 then(new NoRollbackFlow() { 这一行的正上方;具体来说:在包含匿名 Flow 的类体结束处删除该注释,并在调用 then(new NoRollbackFlow() { 之前插入该注释,以便该注释明确关联到 NoRollbackFlow 而非前一个 Flow 的 rollback() 方法(定位符:匿名 Flow 的 rollback() 与 NoRollbackFlow)。core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java (1)
125-143: 注释含义不够清晰。
// DEBT: NoRollbackFlow — in url这个注释的意图不够明确。如果是标记技术债务,建议说明具体需要改进的内容,例如:为什么是 debt?需要做什么改进?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java` around lines 125 - 143, Replace the vague comment "// DEBT: NoRollbackFlow — in url" with a clear TODO that states why this is technical debt and what to change: mention that the anonymous NoRollbackFlow named "echo-server" uses restf.echo(url(ECHO_PATH)) inside run(FlowTrigger, Map) and that the problem is either using an anonymous flow, mixing URL construction with flow logic, or missing rollback semantics; specify the desired fix (e.g., extract URL construction to a helper, make a named class instead of anonymous NoRollbackFlow, or add proper rollback behavior) and reference the affected symbols (NoRollbackFlow, __name__ = "echo-server", run, restf.echo, url(ECHO_PATH), Completion) so future maintainers know the exact location and intended refactor.network/src/main/java/org/zstack/network/l3/FirstAvailableIpAllocatorStrategy.java (1)
3-3: 移除未使用的导入。
org.apache.commons.math3.analysis.function.Add在此文件中未被使用,应当移除。♻️ 建议修复
-import org.apache.commons.math3.analysis.function.Add;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@network/src/main/java/org/zstack/network/l3/FirstAvailableIpAllocatorStrategy.java` at line 3, Remove the unused import org.apache.commons.math3.analysis.function.Add from FirstAvailableIpAllocatorStrategy by deleting that import statement; confirm no other references to Add exist in the class (FirstAvailableIpAllocatorStrategy) and run a build/IDE cleanup to ensure no unused imports remain.compute/src/main/java/org/zstack/compute/host/HostBase.java (3)
630-631: DEBT 注释引用的方法名不正确注释中标注
in stopFailedToMigrateVms,但此 flow 的__name__是"run-extension-point-after-maintenance"。建议修正注释以准确反映所在上下文:- // DEBT: NoRollbackFlow — in stopFailedToMigrateVms + // DEBT: NoRollbackFlow — in run-extension-point-after-maintenance🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/host/HostBase.java` around lines 630 - 631, The DEBT comment incorrectly references stopFailedToMigrateVms while the flow being declared is a NoRollbackFlow with __name__ "run-extension-point-after-maintenance"; update the comment to accurately describe this flow/context (e.g., "DEBT: NoRollbackFlow — in run-extension-point-after-maintenance") so it matches the flow's __name__ and avoids confusion when inspecting HostBase, NoRollbackFlow, and the flow block with __name__ "run-extension-point-after-maintenance".
1391-1412: DEBT 注释引用了不相关的方法名这两处注释都标注
in getSyncLevel,但实际 flow 名称分别是"check-conditions-of-connection"和"connect-host"。getSyncLevel是当前ChainTask的一个重写方法,与这些 flow 的用途无关。建议修正为更准确的描述:
- // DEBT: NoRollbackFlow — in getSyncLevel + // DEBT: NoRollbackFlow — check-conditions-of-connection flow(new NoRollbackFlow() { String __name__ = "check-conditions-of-connection";- // DEBT: NoRollbackFlow — in getSyncLevel + // DEBT: NoRollbackFlow — connect-host flow(new NoRollbackFlow() { String __name__ = "connect-host";🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/host/HostBase.java` around lines 1391 - 1412, The DEBT comments incorrectly reference "in getSyncLevel" even though these NoRollbackFlow instances in HostBase pertain to the "check-conditions-of-connection" and "connect-host" flows; update the comments to accurately describe the debt (e.g., "DEBT: NoRollbackFlow — consider making this a rollback-capable flow" or mention the specific flow purpose) adjacent to the flow declarations for "check-conditions-of-connection" and "connect-host" (in HostBase) so they no longer incorrectly reference getSyncLevel.
576-577: 建议为 "reason TBD" 注释添加更具体的说明多处 DEBT 注释使用
reason TBD(如此处、773、1432、1465、1498 行),这类标注对后续维护帮助有限。建议:
- 如果已知不需要回滚的原因,直接写明(如 "no state to rollback" 或 "idempotent operation")
- 如果需要后续调查,考虑创建 issue 追踪,并在注释中引用 issue 编号
这样能让技术债务标记更具可操作性。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compute/src/main/java/org/zstack/compute/host/HostBase.java` around lines 576 - 577, Replace vague DEBT "reason TBD" comments on the flow(new NoRollbackFlow()) usages in HostBase (and other DEBT entries in the same class) with concrete guidance: either state the precise reason why rollback is unnecessary (e.g., "no state to rollback" or "idempotent operation") or add a reference to a tracking issue (e.g., "investigate rollback behavior — see ISSUE-1234") so future maintainers have actionable context; update the comment text adjacent to the NoRollbackFlow instantiation and any other DEBT comments in this class accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@core/src/main/java/org/zstack/core/errorcode/GlobalErrorCodeI18nServiceImpl.java`:
- Line 24: The resource path constant I18N_FOLDER in
GlobalErrorCodeI18nServiceImpl currently uses File.separator which yields
platform-specific separators; change the constant to use the classpath separator
'/' instead (e.g., "i18n/globalErrorCodeMapping") so classpath resource loading
works on Windows; update any usages of I18N_FOLDER in the class to rely on the
new value.
- Around line 70-72: The outer catch in GlobalErrorCodeI18nServiceImpl currently
logs only e.getMessage() losing the stacktrace; update the logger.warn call in
that catch block (the one currently doing logger.warn(String.format("failed to
scan i18n folder: %s", e.getMessage()))) to pass the exception object itself so
the stack trace is recorded (e.g., logger.warn("failed to scan i18n folder",
e)); keep the original message text for context and only change the logger
invocation to include the Exception parameter.
- Around line 61-63: The JSON deserialization into LinkedHashMap.class in
JSONObjectUtil.toObject(content, LinkedHashMap.class) loses generic type info
and can cause ClassCastException when values aren't strings; update the
deserialization to preserve generics (e.g., use a TypeReference/TypeToken for
LinkedHashMap<String,String> or the library's equivalent) when constructing
Map<String,String> messages, and add a runtime validation step before putting
into localeMessages to ensure every key and value are strings (or convert
non-string values to their string representation) so localeMessages.put(locale,
messages) is safe; target the JSONObjectUtil.toObject call, the
Map<String,String> messages variable and the localeMessages.put usage for the
change.
In `@core/src/main/java/org/zstack/core/upgrade/AgentVersionVO.java`:
- Around line 3-5: AgentVersionVO's SoftDeletionCascade references joinColumn =
"resourceUuid" which doesn't exist on the class (it only declares uuid); update
the annotation to reference the actual field or add the missing
field—specifically, in the AgentVersionVO class locate the SoftDeletionCascade
annotation (joinColumn = "resourceUuid") and change joinColumn to "uuid" to
match the existing uuid field (or alternatively add a resourceUuid property if
semantics require it) so the soft-delete cascade can correctly find related
records.
---
Outside diff comments:
In
`@compute/src/main/java/org/zstack/compute/vm/VmAllocateNicForStartingVmFlow.java`:
- Line 192: Fix the typo in the inline comment inside the
VmAllocateNicForStartingVmFlow class: change "happend" to "happen" (search for
the comment string "this case should not happend" in
VmAllocateNicForStartingVmFlow.java and update it to "this case should not
happen"); keep the comment punctuation and spacing consistent with surrounding
comments and commit as a small cleanup.
In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeCacheVolumeFlow.java`:
- Line 32: The logger is instantiated with the wrong class; update the logger
declaration in VmExpungeCacheVolumeFlow by replacing
Utils.getLogger(VmExpungeRootVolumeFlow.class) with
Utils.getLogger(VmExpungeCacheVolumeFlow.class) so the CLogger references the
current class (update the private static final CLogger logger assignment
accordingly).
- Around line 64-66: The warning log in VmExpungeCacheVolumeFlow is using
spec.getVmInventory().getRootVolumeUuid() which is the root volume UUID instead
of the cache volume being processed; update the logger.warn call to log the
cache volume's UUID by using vol.getUuid() (keep the rest of the message the
same and continue to include the VM UUID/name via
spec.getVmInventory().getUuid()/getName() and reply.getError() for the error
details).
- Around line 40-42: 在 VmExpungeCacheVolumeFlow 中,当
spec.getVmInventory().getAllVolumes() 返回 null 时当前只调用了 trigger.next() 但没有
return,导致后续对 getAllVolumes().stream() 的调用抛出 NPE;修改方法是在调用 trigger.next()
后立即返回以中止当前 flow(或改为使用 if (getAllVolumes() == null) { trigger.next(); return;
}),确保在执行 stream() 前 getAllVolumes() 不为 null。
In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeMemoryVolumeFlow.java`:
- Around line 67-69: The log message in VmExpungeMemoryVolumeFlow incorrectly
references "root volume" and uses spec.getVmInventory().getRootVolumeUuid();
update the warning in the exception handling to reference the actual memory
volume by using vol.getUuid() and change the text to "memory volume" (e.g.,
"failed to expunge the memory volume[uuid:%s] of the vm[uuid:%s, name:%s], %s")
so that the flow (VmExpungeMemoryVolumeFlow) correctly records the volume UUID
and accurate operation context.
- Around line 40-42: In VmExpungeMemoryVolumeFlow, when
spec.getVmInventory().getAllVolumes() == null you call trigger.next() but do not
return, so execution continues and a subsequent call to
spec.getVmInventory().getAllVolumes().stream() can NPE; fix by returning
immediately after trigger.next() (add a return after the trigger.next() call) so
the method exits when getAllVolumes() is null.
- Line 32: The logger is instantiated with the wrong class reference: replace
the incorrect VmExpungeRootVolumeFlow.class passed to Utils.getLogger with
VmExpungeMemoryVolumeFlow.class so the static field logger in
VmExpungeMemoryVolumeFlow references the correct class; update the line that
declares private static final CLogger logger = Utils.getLogger(...) accordingly.
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceManagerImpl.java`:
- Around line 1558-1580: The new NoRollbackFlow named "start-vm" can fail and
skip the DeleteVmGC submission path used by
instantiate-new-created-vmInstance.rollback(), leaving orphaned VMs/GPU
allocations; modify the failure handling so that when StartVmInstanceMsg in the
"start-vm" flow fails you always enqueue a DeleteVmGC for instantiateVm (use the
same DeleteVmGC submission code/path as in
instantiate-new-created-vmInstance.rollback()), and if DestroyVmInstanceMsg
returns a non-GC_ELIGIBLE error follow the same logic to emit the canonical
event instead of silently logging; update the failure branch in the flow's
CloudBusCallBack (and/or the rollback hook tied to instantiateVm/finalVo) to
reuse the DeleteVmGC submission and GC_ELIGIBLE handling so orphan cleanup is
consistent.
In `@compute/src/main/java/org/zstack/compute/vm/VmInstantiateOtherDiskFlow.java`:
- Around line 150-195: In setupCreateVolumeFromDiskSizeFlows, the rollback
currently uses diskAO.getSize() which can be 0 when size came from diskSize;
change the ReleasePrimaryStorageSpaceMsg to setDiskSize(diskSize) (use the same
diskSize passed into AllocatePrimaryStorageSpaceMsg) and keep
allocatedInstallUrl and allocatedPrimaryStorageUuid as-is so the rollback
releases the exact amount allocated; update any references to diskAO.getSize()
in the rollback block accordingly.
- Around line 438-462: The flow uses NoRollbackFlow for attaching external media
via VmAttachOtherDiskExtensionPoint.attachOtherDiskToVm, leaving no compensation
path if later steps fail; change to a rollback-capable Flow (not NoRollbackFlow)
and invoke the extension point’s detach on rollback or add a symmetric detach
callback on VmAttachOtherDiskExtensionPoint that the Flow calls from its
rollback method, ensuring any state created by attachOtherDiskToVm is cleaned up
and any needed compensation data (e.g., disk/attachment identifiers) is saved in
the Flow’s data map for the rollback to use.
- Around line 120-123: The flow currently treats any VolumeInventory in
VmInstantiateOtherDiskFlow (set when isAttachDataVolume() and via
setupAttachVolumeFlows()) as if it was created by this flow and unconditionally
sends DeleteVolumeMsg during rollback; change this so the code distinguishes
volumes created by this flow from externally provided volumes: when you load a
volume by diskAO.getSourceUuid() in the isAttachDataVolume() branch, mark it as
external (e.g., set a boolean like volumeCreatedByFlow=false or store
externalVolumeUuid) and ensure setupAttachVolumeFlows() does not mark it as
newly created; when creating volumes in other branches mark them
volumeCreatedByFlow=true; then update rollback() (the block that currently sends
DeleteVolumeMsg for volumeInventory) to only send DeleteVolumeMsg if
volumeCreatedByFlow==true, and for external volumes perform the correct
compensation (run detach logic or skip deletion) instead of deleting; keep
references to VmInstantiateOtherDiskFlow, isAttachDataVolume(),
setupAttachVolumeFlows(), volumeInventory, rollback(), and DeleteVolumeMsg to
locate the changes.
In `@core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java`:
- Around line 145-163: The formatting of the anonymous NoRollbackFlow's name
field is inconsistent: update the declaration String __name__= "init-server";
inside the NoRollbackFlow (the anonymous class passed to flow) to match the
project's style by adding a space before the equals sign (String __name__ =
"init-server";), ensuring consistency with other occurrences like the __name__ =
usage earlier.
In
`@header/src/main/java/org/zstack/header/volume/APICreateDataVolumeFromVolumeSnapshotMsg.java`:
- Line 54: The class APICreateDataVolumeFromVolumeSnapshotMsg has its
`@DefaultTimeout` annotation changed to 3 hours (TimeUnit.HOURS, value = 3) but
the `@msg` example in the class Javadoc still shows the old timeout value
(1800000); either update the example value to the equivalent millisecond value
(10800000) to reflect the new default or revert/adjust the `@DefaultTimeout`
annotation to the intended timeout (e.g., value matching 1800000ms) so code and
example stay consistent—locate the `@DefaultTimeout` on
APICreateDataVolumeFromVolumeSnapshotMsg and the `@msg` example in the class
comment and make the matching change.
In `@image/src/main/java/org/zstack/image/ImageBase.java`:
- Around line 492-535: The current NoRollbackFlow that iterates
IsoOperator.getVmUuidByIsoUuid and sends DetachIsoFromVmInstanceMsg via bus.send
can leave partial state because it calls trigger.fail() on any error without
compensating for earlier successful detaches; change the flow to track
successfulVmUuids (List<String>) as you iterate, and if any detach fails use
those recorded UUIDs to send compensating attach messages (e.g., create
AttachIsoToVmInstanceMsg per UUID and bus.send them) before calling
trigger.fail(), or alternatively convert this step into a pure validation check
that only verifies all VMs can accept detach (no state change) and move actual
DetachIsoFromVmInstanceMsg logic into a rollback-capable flow; update references
in this block: NoRollbackFlow, IsoOperator.getVmUuidByIsoUuid,
DetachIsoFromVmInstanceMsg, bus.send, trigger.fail, trigger.next accordingly.
In `@network/src/main/java/org/zstack/network/l3/NormalIpRangeFactory.java`:
- Around line 132-185: The two NoRollbackFlow steps "enable-sdn-dhcp" and
"add-sdn-subnet" leave external SDN controller state if later steps fail; add
compensating cleanup or schedule async GC/reconcile on failure. Modify the flow
chain so that either these steps are implemented as RollbackFlow with rollback
logic that reverses SdnControllerUpdateDHCPMsg and SdnControllerL3.createIpRange
(e.g., send delete/disable messages to the SDN controller using the same service
IDs / resource UUIDs), or, in the enclosing error handler, detect whether
enable-sdn-dhcp and/or add-sdn-subnet succeeded and then invoke async cleanup
tasks (send delete DHCP msg and call SdnControllerL3.deleteIpRange or schedule
reconcile) for the resources created by SdnControllerUpdateDHCPMsg and
SdnControllerL3.createIpRange to ensure external state is removed or reconciled
on failure.
---
Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/host/HostBase.java`:
- Around line 630-631: The DEBT comment incorrectly references
stopFailedToMigrateVms while the flow being declared is a NoRollbackFlow with
__name__ "run-extension-point-after-maintenance"; update the comment to
accurately describe this flow/context (e.g., "DEBT: NoRollbackFlow — in
run-extension-point-after-maintenance") so it matches the flow's __name__ and
avoids confusion when inspecting HostBase, NoRollbackFlow, and the flow block
with __name__ "run-extension-point-after-maintenance".
- Around line 1391-1412: The DEBT comments incorrectly reference "in
getSyncLevel" even though these NoRollbackFlow instances in HostBase pertain to
the "check-conditions-of-connection" and "connect-host" flows; update the
comments to accurately describe the debt (e.g., "DEBT: NoRollbackFlow — consider
making this a rollback-capable flow" or mention the specific flow purpose)
adjacent to the flow declarations for "check-conditions-of-connection" and
"connect-host" (in HostBase) so they no longer incorrectly reference
getSyncLevel.
- Around line 576-577: Replace vague DEBT "reason TBD" comments on the flow(new
NoRollbackFlow()) usages in HostBase (and other DEBT entries in the same class)
with concrete guidance: either state the precise reason why rollback is
unnecessary (e.g., "no state to rollback" or "idempotent operation") or add a
reference to a tracking issue (e.g., "investigate rollback behavior — see
ISSUE-1234") so future maintainers have actionable context; update the comment
text adjacent to the NoRollbackFlow instantiation and any other DEBT comments in
this class accordingly.
In `@compute/src/main/java/org/zstack/compute/vm/VmDeleteVolumeFlow.java`:
- Line 90: 在 VmDeleteVolumeFlow 类中把现有“// DEBT: NoRollbackFlow — reason
TBD”注释替换为可执行的上下文说明:在注释中明确指出为何该 Flow 必须使用
NoRollbackFlow(例如资源已被物理删除无法回滚、幂等性保障或已在上层事务中处理回滚),列出相关风险边界(什么情况下会导致资源泄漏或不一致)并引用相关工单或设计文档编号;如果决定不是永久方案,请注明预期的整改时间点或条件。定位符:VmDeleteVolumeFlow
类和该包含 “NoRollbackFlow” 的注释位置,直接更新注释内容以包含原因、风险和工单/文档引用。
In `@core/src/main/java/org/zstack/core/agent/AgentManagerImpl.java`:
- Around line 125-143: Replace the vague comment "// DEBT: NoRollbackFlow — in
url" with a clear TODO that states why this is technical debt and what to
change: mention that the anonymous NoRollbackFlow named "echo-server" uses
restf.echo(url(ECHO_PATH)) inside run(FlowTrigger, Map) and that the problem is
either using an anonymous flow, mixing URL construction with flow logic, or
missing rollback semantics; specify the desired fix (e.g., extract URL
construction to a helper, make a named class instead of anonymous
NoRollbackFlow, or add proper rollback behavior) and reference the affected
symbols (NoRollbackFlow, __name__ = "echo-server", run, restf.echo,
url(ECHO_PATH), Completion) so future maintainers know the exact location and
intended refactor.
In `@core/src/main/java/org/zstack/core/cascade/CascadeFacadeImpl.java`:
- Line 231: 把只是写“DEBT: NoRollbackFlow”的注释替换为有执行上下文的说明:在 CascadeFacadeImpl 的
asyncCascade 路径(标记为
NoRollbackFlow)的地方,补充一到两句说明为什么这里可以接受无回滚(例如幂等性、外部补偿机制或可接受的不一致窗口)、相关风险/约束(可能的数据不一致场景、重试/补偿触发条件)以及关联跟踪项(Issue/任务编号或链接)以便后续可追溯;定位符包括类名
CascadeFacadeImpl、方法/流程标识 asyncCascade 和注释中的 NoRollbackFlow。
In `@core/src/main/java/org/zstack/core/config/GlobalConfigFacadeImpl.java`:
- Around line 117-149: 在这段 chain.then(new NoRollbackFlow() { ... }) 的注释“DEBT:
NoRollbackFlow — reason TBD”中写出可执行的债务说明:说明为什么这里必须使用
NoRollbackFlow(例如该流程不可回滚因为它异步对所有 ManagementNode 发送 RefreshGuestOsMetadataMsg
并在任一失败时仍需继续或统一汇总错误),列出当前的约束条件(依赖的消息类型 RefreshGuestOsMetadataMsg、使用 While<> 并汇总
ErrorCodeList 的行为、对 bus.send/CloudBusCallBack
的异常处理方式),以及何种条件下可以安全替换为可回滚流程(比如当消息调用改为幂等且支持事务补偿、或改为逐节点事务化并能回滚时),并在注释末尾加入一个跟进入口(例如关联的
issue/PR 编号或负责人)以便将来跟进和验证替换。
In `@core/src/main/java/org/zstack/core/errorcode/ElaborationManagerImpl.java`:
- Around line 131-132: Replace the vague "DEBT: NoRollbackFlow — reason TBD"
comments around the NoRollbackFlow anonymous flows in ElaborationManagerImpl
with specific, actionable debt notes: state why using NoRollbackFlow is
potentially risky here (e.g., "skips rollback for X operation such as file
scanning/state mutation"), describe what should be evaluated later (e.g.,
"determine if rollback is required when Y fails or add compensating action"),
and include an owner or ticket reference for follow-up; update each occurrence
adjacent to the flow(...) / new NoRollbackFlow() blocks so future developers can
understand the debt and planned remediation.
In `@core/src/main/java/org/zstack/core/gc/GarbageCollectorVO.java`:
- Line 8: 在 GarbageCollectorVO 中把通配符导入 javax.persistence.*
替换为显式导入所用的注解/类型以保持导入风格一致;具体在该文件(类 GarbageCollectorVO)中移除 javax.persistence.*
并添加显式导入 Entity、Table、Column、Enumerated、EnumType 和 PreUpdate(仅导入实际使用到的类型)。
In `@network/src/main/java/org/zstack/network/l2/L2VlanNetworkFactory.java`:
- Around line 91-92: 将当前位于匿名 Flow 类体内、写作 “// DEBT: NoRollbackFlow — in rollback”
的债务注释移到 then(new NoRollbackFlow() { 这一行的正上方;具体来说:在包含匿名 Flow 的类体结束处删除该注释,并在调用
then(new NoRollbackFlow() { 之前插入该注释,以便该注释明确关联到 NoRollbackFlow 而非前一个 Flow 的
rollback() 方法(定位符:匿名 Flow 的 rollback() 与 NoRollbackFlow)。
In
`@network/src/main/java/org/zstack/network/l3/FirstAvailableIpAllocatorStrategy.java`:
- Line 3: Remove the unused import
org.apache.commons.math3.analysis.function.Add from
FirstAvailableIpAllocatorStrategy by deleting that import statement; confirm no
other references to Add exist in the class (FirstAvailableIpAllocatorStrategy)
and run a build/IDE cleanup to ensure no unused imports remain.
In `@network/src/main/java/org/zstack/network/l3/L3NetworkManagerImpl.java`:
- Around line 619-620: 在 L3NetworkManagerImpl 中对使用 NoRollbackFlow 的注释过于模糊(当前为
"reason TBD"),请替换该注释为明确说明为何选择 NoRollbackFlow(例如:该 flow 只做非持久化的本地状态变更或外部 SDN
操作无法回滚到数据库事务),并列出后续改进点(如在未来加入补偿逻辑、引入持久化回滚或对 SDN 控制器进行状态回滚方案)。确保注释引用
NoRollbackFlow 和相关上下文以便后续维护者理解设计权衡和待办项。
| public class GlobalErrorCodeI18nServiceImpl implements GlobalErrorCodeI18nService, Component { | ||
| private static final CLogger logger = Utils.getLogger(GlobalErrorCodeI18nServiceImpl.class); | ||
|
|
||
| private static final String I18N_FOLDER = "i18n" + File.separator + "globalErrorCodeMapping"; |
There was a problem hiding this comment.
类路径资源路径应使用 / 而非 File.separator。
File.separator 在 Windows 上返回 \,但类路径资源始终使用 / 作为分隔符,这可能导致在 Windows 环境下文件加载失败。
🐛 建议修复
- private static final String I18N_FOLDER = "i18n" + File.separator + "globalErrorCodeMapping";
+ private static final String I18N_FOLDER = "i18n/globalErrorCodeMapping";📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| private static final String I18N_FOLDER = "i18n" + File.separator + "globalErrorCodeMapping"; | |
| private static final String I18N_FOLDER = "i18n/globalErrorCodeMapping"; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@core/src/main/java/org/zstack/core/errorcode/GlobalErrorCodeI18nServiceImpl.java`
at line 24, The resource path constant I18N_FOLDER in
GlobalErrorCodeI18nServiceImpl currently uses File.separator which yields
platform-specific separators; change the constant to use the classpath separator
'/' instead (e.g., "i18n/globalErrorCodeMapping") so classpath resource loading
works on Windows; update any usages of I18N_FOLDER in the class to rely on the
new value.
| @SuppressWarnings("unchecked") | ||
| Map<String, String> messages = JSONObjectUtil.toObject(content, LinkedHashMap.class); | ||
| localeMessages.put(locale, messages); |
There was a problem hiding this comment.
JSON 反序列化存在类型安全风险。
将 JSON 反序列化为 LinkedHashMap.class 时丢失了泛型类型信息。如果 JSON 文件中包含非字符串值(如嵌套对象或数字),在运行时访问 map 时会抛出 ClassCastException。
🛡️ 建议添加类型校验或使用 TypeToken
`@SuppressWarnings`("unchecked")
Map<String, String> messages = JSONObjectUtil.toObject(content, LinkedHashMap.class);
+ // Validate all values are strings to fail fast on malformed JSON
+ for (Map.Entry<String, String> entry : messages.entrySet()) {
+ if (!(entry.getValue() instanceof String)) {
+ throw new IllegalArgumentException("Invalid value type for key: " + entry.getKey());
+ }
+ }
localeMessages.put(locale, messages);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@core/src/main/java/org/zstack/core/errorcode/GlobalErrorCodeI18nServiceImpl.java`
around lines 61 - 63, The JSON deserialization into LinkedHashMap.class in
JSONObjectUtil.toObject(content, LinkedHashMap.class) loses generic type info
and can cause ClassCastException when values aren't strings; update the
deserialization to preserve generics (e.g., use a TypeReference/TypeToken for
LinkedHashMap<String,String> or the library's equivalent) when constructing
Map<String,String> messages, and add a runtime validation step before putting
into localeMessages to ensure every key and value are strings (or convert
non-string values to their string representation) so localeMessages.put(locale,
messages) is safe; target the JSONObjectUtil.toObject call, the
Map<String,String> messages variable and the localeMessages.put usage for the
change.
| } catch (Exception e) { | ||
| logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage())); | ||
| } |
There was a problem hiding this comment.
外层异常日志丢失了堆栈信息。
与内层 catch 块(Line 67)不同,外层 catch 块只记录了异常消息,未包含异常对象本身,这会导致调试时缺少堆栈跟踪信息。
📝 建议修复
} catch (Exception e) {
- logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage()));
+ logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage()), e);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } catch (Exception e) { | |
| logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage())); | |
| } | |
| } catch (Exception e) { | |
| logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage()), e); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@core/src/main/java/org/zstack/core/errorcode/GlobalErrorCodeI18nServiceImpl.java`
around lines 70 - 72, The outer catch in GlobalErrorCodeI18nServiceImpl
currently logs only e.getMessage() losing the stacktrace; update the logger.warn
call in that catch block (the one currently doing
logger.warn(String.format("failed to scan i18n folder: %s", e.getMessage()))) to
pass the exception object itself so the stack trace is recorded (e.g.,
logger.warn("failed to scan i18n folder", e)); keep the original message text
for context and only change the logger invocation to include the Exception
parameter.
| import org.zstack.header.vo.ResourceVO; | ||
| import org.zstack.header.vo.SoftDeletionCascade; | ||
| import org.zstack.header.vo.SoftDeletionCascades; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== AgentVersionVO: annotation and fields =="
rg -n -C2 'class AgentVersionVO|SoftDeletionCascade|private String uuid|resourceUuid' core/src/main/java/org/zstack/core/upgrade/AgentVersionVO.java
echo
echo "== Repo-wide ResourceVO soft-delete mappings =="
rg -n -C1 '@SoftDeletionCascade\(parent\s*=\s*ResourceVO\.class,\s*joinColumn\s*=\s*"[^"]+"\)' --glob '**/*.java'
echo
echo "== Repo-wide declarations/usages of resourceUuid =="
rg -n -C1 '\bresourceUuid\b' --glob '**/*.java'Repository: MatheMatrix/zstack
Length of output: 50375
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== SoftDeletionCascade annotation definition =="
rg -n 'public `@interface` SoftDeletionCascade|joinColumn' --glob '**/*.java' -A5 -B2 | head -50
echo
echo "== KvmHypervisorInfoVO: field structure =="
sed -n '1,50p' plugin/kvm/src/main/java/org/zstack/kvm/hypervisor/datatype/KvmHypervisorInfoVO.java
echo
echo "== SshKeyPairRefVO: field structure =="
rg -n -A30 'public class SshKeyPairRefVO' plugin/sshKeyPair/src/main/java/org/zstack/header/sshkeypair/SshKeyPairRefVO.java | head -50Repository: MatheMatrix/zstack
Length of output: 8415
SoftDeletionCascade 的关联列与实体字段不匹配。
AgentVersionVO 只声明了 uuid 字段,但在 Line 15 的软删注解中使用了 joinColumn = "resourceUuid"。查看仓库内其他例子(如 KvmHypervisorInfoVO 使用 joinColumn = "uuid" 对应其 uuid 字段、SshKeyPairRefVO 使用 joinColumn = "resourceUuid" 对应其 resourceUuid 字段),joinColumn 必须引用类中实际存在的字段。这个不匹配会导致软删级联无法找到对应的记录。
建议修正
`@SoftDeletionCascades`({
- `@SoftDeletionCascade`(parent = ResourceVO.class, joinColumn = "resourceUuid")
+ `@SoftDeletionCascade`(parent = ResourceVO.class, joinColumn = "uuid")
})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/main/java/org/zstack/core/upgrade/AgentVersionVO.java` around lines
3 - 5, AgentVersionVO's SoftDeletionCascade references joinColumn =
"resourceUuid" which doesn't exist on the class (it only declares uuid); update
the annotation to reference the actual field or add the missing
field—specifically, in the AgentVersionVO class locate the SoftDeletionCascade
annotation (joinColumn = "resourceUuid") and change joinColumn to "uuid" to
match the existing uuid field (or alternatively add a resourceUuid property if
semantics require it) so the soft-delete cascade can correctly find related
records.
ZSTAC-68874 推理服务断开vip导致GPU脏数据
Root Cause
VM创建失败rollback时,DestroyVm也失败的场景下,旧逻辑只在GC_ELIGIBLE错误时提交DeleteVmGC。
非GC_ELIGIBLE错误(如MN不可用)不创建GC,导致orphan VM残留在hypervisor上占用GPU资源。
Fix
rollback始终提交DeleteVmGC,无论错误类型。非GC_ELIGIBLE错误额外触发canonical event(保持告警行为不变)。
Cross-repo
sync from gitlab !9323