Skip to content

<fix>[compute]: meta#3468

Open
MatheMatrix wants to merge 1 commit intofeature-zsv-5.0.0-vm-registrationfrom
sync/tao.gan/res-vm-ZSV-10000@@4
Open

<fix>[compute]: meta#3468
MatheMatrix wants to merge 1 commit intofeature-zsv-5.0.0-vm-registrationfrom
sync/tao.gan/res-vm-ZSV-10000@@4

Conversation

@MatheMatrix
Copy link
Owner

Resolves: ZSV-10000

Change-Id: I746a6d7266686d637a6d76756861636a74766b74

sync from gitlab !9328

@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

Walkthrough

新增 VM 实例元数据管理系统,支持元数据脏标记、dirty polling、路径指纹跟踪、孤立检测和恢复。涵盖数据库架构、API 消息、拦截器、存储后端集成和全局配置。

Changes

群组 / 文件 摘要
核心元数据管理
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java, VmMetadataBuilder.java, VmMetadataUpdateInterceptor.java
实现元数据脏标记、收集和拦截器链。支持声明式标记、异步 flush、并发控制和升级驱动的刷新机制。
检测和恢复组件
compute/src/main/java/org/zstack/compute/vm/metadata/MetadataOrphanDetector.java, MetadataPathDriftDetector.java, MetadataStaleRecoveryTask.java
实现周期性任务检测孤立元数据、路径漂移和恢复失败的 flush 操作。
UUID 解析器链
compute/src/main/java/org/zstack/compute/vm/metadata/DefaultVmUuidFromApiResolver.java, VolumeBasedVmUuidFromApiResolver.java, SnapshotBasedVmUuidFromApiResolver.java, ResourceBasedVmUuidFromApiResolver.java, NicBasedVmUuidFromApiResolver.java, ReflectionBasedVmUuidFromApiResolver.java
通过多种策略从 API 消息提取关联 VM UUID,支持反射回退。
元数据字段处理
compute/src/main/java/org/zstack/compute/vm/VmInstanceMetadataFieldProcessor.java, VmMetadataBuilder.java
在注册期间处理和清理 VM 元数据 JSON,包括字段清空、路径规范化和存储过滤。
级联扩展
compute/src/main/java/org/zstack/compute/vm/metadata/MetadataCascadeExtension.java
监听卷删除事件并标记关联 VM 元数据为脏,处理存储拓扑变化。
元数据序列化
header/src/main/java/org/zstack/header/vm/VmInstanceMetadataCodec.java, VmInstanceMetadataDTO.java, VmInstanceMetadataValidator.java
DTO 编码/解码、schema 版本和资源 UUID 一致性验证。
API 消息 - 注册
header/src/main/java/org/zstack/header/storage/primary/APIRegisterVmInstanceMsg.java, APIRegisterVmInstanceReply.java, APIRegisterVmInstanceFromMetadataMsg.java, APIRegisterVmInstanceFromMetadataEvent.java
REST API 用于从元数据注册 VM 实例,支持预检查和本地/远程路径。
API 消息 - 读写操作
header/src/main/java/org/zstack/header/storage/primary/ReadVmInstanceMetadataMsg.java, ReadVmInstanceMetadataReply.java, ReadVmInstanceMetadataOnHypervisorMsg.java, APIReadVmInstanceMetadataMsg.java, APIReadVmInstanceMetadataReply.java
支持在主存储和 hypervisor 层读取元数据,REST API 用于 VM 元数据检索。
API 消息 - 管理操作
header/src/main/java/org/zstack/header/vm/APICleanupVmInstanceMetadataMsg.java, APICleanupVmInstanceMetadataEvent.java, APIScanVmInstanceMetadataMsg.java, APIScanVmInstanceMetadataReply.java, APICheckVmInstanceMetadataConsistencyMsg.java, APICheckVmInstanceMetadataConsistencyReply.java
REST API 用于清理、扫描和一致性检查 VM 元数据操作。
内部消息
header/src/main/java/org/zstack/header/vm/UpdateVmInstanceMetadataMsg.java, UpdateVmInstanceMetadataOnPrimaryStorageMsg.java, UpdateVmInstanceMetadataOnHypervisorMsg.java, BatchCheckMetadataStatusMsg.java, RepairMetadataMsg.java
管理平面和存储层之间的内部消息协议,支持元数据更新和修复。
业务影响注解
header/src/main/java/org/zstack/header/vm/MetadataImpact.java,应用于多个 API 消息
声明性注解标记 API 对 VM 元数据的影响(NONE/CONFIG/STORAGE),驱动自动脏标记。
存储后端集成
plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageBase.java, LocalStorageKvmBackend.java, plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java, NfsPrimaryStorageKVMBackend.java
在本地和 NFS 存储中实现元数据更新处理,包括并发控制和 hypervisor 协调。
Hypervisor 支持
plugin/kvm/src/main/java/org/zstack/kvm/KVMAgentCommands.java, KVMHost.java, KVMConstant.java
KVM hypervisor 命令/响应和 HTTP 端点支持元数据写入和读取操作。
数据库架构
conf/db/upgrade/V4.10.29__schema.sql
新建 VmMetadataDirtyVO(脏标记跟踪)和 VmMetadataPathFingerprintVO(路径指纹)表。
配置系统
compute/src/main/java/org/zstack/compute/vm/VmGlobalConfig.java
新增 27 个配置参数,控制并发、GC、重试、批量、路径检查、升级和恢复行为。
数据模型
header/src/main/java/org/zstack/header/vm/VmMetadata.java, VmMetadataScanResult.java, ConsistencyCheckResult.java, MetadataStatusResult.java, PreCheckItem.java
DTO 和数据持有类用于元数据操作的序列化和 API 响应。
常量和工具
header/src/main/java/org/zstack/header/vm/VmInstanceMetadataConstants.java, VmMetadataConstants.java, VmMetadataCanonicalEvents.java, VmMetadataCategory.java
定义 metadata 大小限制、路径后缀、事件类型和 VM 分类常量。
VM 状态和消息路由
header/src/main/java/org/zstack/header/vm/VmInstanceState.java, compute/src/main/java/org/zstack/compute/vm/AbstractVmInstance.java, VmInstanceBase.java, VmInstanceApiInterceptor.java
新增 Registering 状态、消息路由和 API 拦截,支持元数据更新处理。
删除和清理
compute/src/main/java/org/zstack/compute/vm/VmExpungeMetadataFlow.java
新建 NoRollbackFlow 实现 VM 驱逐期间的最大努力元数据清理。
路径快照
compute/src/main/java/org/zstack/compute/vm/metadata/MetadataPathSnapshotBuilder.java
构建确定性 JSON 指纹用于 VM 存储路径漂移检测。
SDK 和测试
sdk/src/main/java/org/zstack/sdk/RegisterVmInstanceAction.java, RegisterVmInstanceResult.java, testlib/src/main/java/org/zstack/testlib/ApiHelper.groovy, KVMSimulator.groovy
SDK 操作类和测试支持,包括模拟 hypervisor 元数据端点。

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant API as API Gateway
    participant Interceptor as MetadataUpdateInterceptor
    participant Marker as VmMetadataDirtyMarker
    participant Storage as Primary Storage
    participant Hypervisor as KVM Hypervisor

    User->>API: Send API (e.g., AttachVolume)
    activate API
    API->>Interceptor: BeforeDeliveryMessageInterceptor<br/>Resolve VM UUIDs via resolver chain
    activate Interceptor
    Interceptor-->>API: Cached VM UUIDs
    deactivate Interceptor
    
    API->>API: Process API Request
    
    API->>Interceptor: BeforePublishEventInterceptor<br/>API succeeded
    activate Interceptor
    Interceptor->>Marker: markDirty(vmUuid, storageStructureChange)
    activate Marker
    Marker->>Marker: Insert/Update VmMetadataDirtyVO
    Marker-->>Interceptor: Dirty record created
    deactivate Marker
    Interceptor-->>API: Done
    deactivate Interceptor
    
    API-->>User: API Response
    deactivate API
    
    Note over Marker: Background polling
    Marker->>Marker: Polling task claims dirty row
    activate Marker
    Marker->>Marker: Build VmInstanceMetadataDTO<br/>via VmMetadataBuilder
    Marker->>Storage: UpdateVmInstanceMetadataOnPrimaryStorageMsg
    activate Storage
    Storage->>Hypervisor: UpdateVmInstanceMetadataOnHypervisorMsg
    activate Hypervisor
    Hypervisor->>Hypervisor: Write metadata to path
    Hypervisor-->>Storage: Success/Error reply
    deactivate Hypervisor
    Storage-->>Marker: Metadata updated
    deactivate Storage
    Marker->>Marker: Update VmMetadataDirtyVO<br/>Clear claim, save fingerprint
    deactivate Marker
Loading
sequenceDiagram
    actor User
    participant API as API Gateway
    participant Validator as VmInstanceMetadataValidator
    participant Processor as VmInstanceMetadataFieldProcessor
    participant Storage as Primary Storage
    participant DB as Database

    User->>API: APIRegisterVmInstanceFromMetadataMsg<br/>(metadataPath, primaryStorageUuid, ...)
    activate API
    
    API->>API: Pre-check validation
    activate API
    API->>Validator: validate(DTO, schemaVersion)
    activate Validator
    Validator->>Validator: Check schemaVersion match
    Validator->>Validator: Validate resource UUID consistency
    Validator->>Validator: Validate snapshot group integrity
    Validator-->>API: Validation result
    deactivate Validator
    deactivate API
    
    API->>Storage: ReadVmInstanceMetadataMsg<br/>from metadataPath
    activate Storage
    Storage-->>API: Metadata JSON payload
    deactivate Storage
    
    API->>Processor: processVmInstanceVO<br/>processVolumeVO<br/>processSystemTagVO<br/>etc.
    activate Processor
    Processor->>Processor: Field sanitization<br/>Null fields, set zone/account<br/>Normalize install paths<br/>Regenerate UUIDs
    Processor-->>API: Cleaned metadata
    deactivate Processor
    
    API->>DB: Create VM, volumes,<br/>system tags, configs
    activate DB
    DB-->>API: VM registered
    deactivate DB
    
    API->>API: Update path fingerprint<br/>Save to VmMetadataPathFingerprintVO
    
    API-->>User: APIRegisterVmInstanceFromMetadataEvent<br/>with VmInstanceInventory
    deactivate API
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 元数据的舞蹈开始了,
脏标记轻声低语,
从存储到 hypervisor,
路径指纹记录变化,
孤立的数据被找到,
恢复任务修复伤口。
兔兔欢呼,元数据自由流动! 🐇✨

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sync/tao.gan/res-vm-ZSV-10000@@4

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (18)
header/src/main/java/org/zstack/header/storage/primary/RegisterVmInstanceException.java-3-3 (1)

3-3: ⚠️ Potential issue | 🟠 Major

接口命名不当:Exception 后缀应仅用于异常类。

此接口名为 RegisterVmInstanceException,但它实际上是一个接口而非异常类。根据编码规范,"异常类应以 Exception 结尾",这意味着只有真正的异常类才应使用此后缀。

当前命名会误导开发者认为这是一个异常类型。建议重命名为更能体现其职责的名称,例如:

  • RegisterVmInstancePathUpdater
  • VmInstanceRegistrationHandler
  • RegisterVmInstanceExceptionHandler(如果其职责是处理注册异常)
♻️ 建议的命名修改
-public interface RegisterVmInstanceException {
+public interface RegisterVmInstancePathUpdater {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/storage/primary/RegisterVmInstanceException.java`
at line 3, The interface currently named RegisterVmInstanceException is
misleading because the Exception suffix should be reserved for throwable
classes; rename the interface to a clear responsibility-driven name (e.g.,
VmInstanceRegistrationHandler or RegisterVmInstancePathUpdater), update the
interface declaration (RegisterVmInstanceException ->
VmInstanceRegistrationHandler) and all references/usages across the codebase,
and adjust any Javadoc/comments and implementing classes to reflect the new name
so imports and builds remain consistent.
header/src/main/java/org/zstack/header/storage/snapshot/VolumeSnapshotTree.java-446-460 (1)

446-460: ⚠️ Potential issue | 🟠 Major

遍历结果现在不具备稳定顺序。

这里按 children 的现有插入顺序入队;而 children 顺序又取决于 fromInventories() 收到的输入顺序。只要上游查询结果顺序波动,同一棵快照树的遍历结果就会变化,后续如果拿它参与元数据序列化/指纹计算,会出现伪漂移。

🛠️ 可参考的修正方式
     public List<VolumeSnapshotInventory> levelOrderTraversal() {
         List<VolumeSnapshotInventory> result = new ArrayList<>();
         if (this.root == null) {
             return result;
         }

         Queue<SnapshotLeaf> queue = new LinkedList<>();
         queue.offer(this.root);

         while (!queue.isEmpty()) {
             SnapshotLeaf currentLeaf = queue.poll();
             result.add(currentLeaf.getInventory());
-            for (SnapshotLeaf child : currentLeaf.getChildren()) {
+            List<SnapshotLeaf> orderedChildren = new ArrayList<>(currentLeaf.getChildren());
+            orderedChildren.sort(Comparator.comparing(SnapshotLeaf::getUuid));
+            for (SnapshotLeaf child : orderedChildren) {
                 queue.offer(child);
             }
         }

         return result;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/storage/snapshot/VolumeSnapshotTree.java`
around lines 446 - 460, The levelOrderTraversal() result is non-deterministic
because it enqueues SnapshotLeaf children in the order returned by
getChildren(), which depends on input ordering; to fix, make the enqueue order
stable by sorting the children collection before offering them to the queue
(inside levelOrderTraversal()). Use a stable key from each child’s inventory
(e.g. SnapshotLeaf.getInventory().getUuid() or a timestamp like
getInventory().getCreateDate() if present) and sort via Comparator before the
for-loop that calls queue.offer(child) so the BFS traversal order becomes
deterministic.
plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java-1966-1968 (1)

1966-1968: ⚠️ Potential issue | 🟠 Major

这里应该按 hostUuid 选择 backend。

前面已经解析出了目标 host,但这里又回退到 getUsableBackend() 取任意可用实现。多 cluster / 多 hypervisor 场景下,这可能拿到与 hostUuid 不匹配的 backend;本文件已有 getBackendByHostUuid(hostUuid) 可直接复用。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java`
around lines 1966 - 1968, 这里目前在处理消息时用 getUsableBackend()
随机取回端实现,但应该按目标主机选择后端以避免多 cluster/多 hypervisor 情况下选错,实现方式:在调用 backend.handle(msg,
hostUuid, ...) 之前改用 getBackendByHostUuid(hostUuid) 获取后端(即用 getBackendByHostUuid
替换 getUsableBackend),并在找不到返回值时做好错误处理/抛出或返回失败回复;参考要修改的标识符包括 getUsableBackend(),
getBackendByHostUuid(hostUuid) 和 backend.handle(...)。
plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java-1959-1960 (1)

1959-1960: ⚠️ Potential issue | 🟠 Major

根盘在排队期间被删除时,这里会先抛异常而不是回失败。

getHostUuidFromVolume() 内部会直接解引用 dbf.findByUuid() 的结果。元数据更新走异步链任务,rootVolumeUuid 在执行前被清理并不罕见,这里最好先做空检查并返回结构化错误。 Based on learnings,dbf.findByUuid() 可能在找不到记录时返回 null,需要做 null 检查。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java`
around lines 1959 - 1960, 在调用 getHostUuidFromVolume(msg.getRootVolumeUuid())
后不要直接使用其返回值;getHostUuidFromVolume 内部会直接解引用 dbf.findByUuid() 的结果,dbf.findByUuid()
可能返回 null 当根盘在队列中被删除时会导致 NPE 并抛出异常。修改点:在 NfsPrimaryStorage 的处理逻辑里对 hostUuid 做
null/空检查(检查 msg.getRootVolumeUuid() 的存在性并检查 getHostUuidFromVolume 返回值),如果为 null
或空则不要抛异常,构造并返回一个结构化的失败响应(例如带明确错误码和描述的 FailureReply/ReturnValue),并在日志中记录
rootVolumeUuid 与上下文信息以便排查;同时可在 getHostUuidFromVolume 中加入对 dbf.findByUuid() 返回
null 的防护并返回 null 而非直接解引用。
plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageKvmBackend.java-3801-3823 (1)

3801-3823: ⚠️ Potential issue | 🟠 Major

不要在这个 handle() 里直接 bus.reply()

这个分支拿到了 ReturnValueCompletion 却绕过它直接回包,和本类其他 handle(..., completion) 的约定不一致,容易让上层分发链收不到完成信号。另外这里把 self.getUuid() 当成 VM UUID 写进错误信息,实际打印出来的是主存储 UUID。

🛠 建议修改
-        bus.send(umsg, new CloudBusCallBack(msg) {
+        bus.send(umsg, new CloudBusCallBack(completion) {
             `@Override`
             public void run(MessageReply r) {
                 UpdateVmInstanceMetadataOnPrimaryStorageReply reply = new UpdateVmInstanceMetadataOnPrimaryStorageReply();
                 if (!r.isSuccess()) {
-                    reply.setError(Platform.operr("failed to update vm[uuid=%s] on hypervisor.", self.getUuid())
+                    reply.setError(Platform.operr(
+                            "failed to update VM metadata for rootVolume[uuid=%s] on host[uuid=%s].",
+                            msg.getRootVolumeUuid(), hostUuid)
                             .withCause(r.getError()));
                 }
-                bus.reply(msg, reply);
+                completion.success(reply);
             }
         });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageKvmBackend.java`
around lines 3801 - 3823, The handler currently calls bus.reply(...) inside the
async callback and uses self.getUuid() in the error message; instead, follow the
method contract that accepts
ReturnValueCompletion<UpdateVmInstanceMetadataOnPrimaryStorageReply> by invoking
the completion callbacks instead of bus.reply: inside the CloudBusCallBack.run,
construct an UpdateVmInstanceMetadataOnPrimaryStorageReply, and on success call
completion.success(reply) (or completion.complete(reply) per local helper) and
on failure call completion.fail(reply.getError()) (or completion.fail(...) with
the constructed ErrorCode); also fix the error message to reference the VM UUID
from the incoming msg (e.g. msg.getVmInstanceUuid()) instead of self.getUuid()
so the logged UUID is the VM's.
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java-3198-3208 (1)

3198-3208: ⚠️ Potential issue | 🟠 Major

注册态把只读 API 也一起拦掉了

Line 3202 这里只放行了 APIDestroyVmInstanceMsg,所以后面这些 APIGet*/只读请求在 Registering 期间也会直接失败,和注释里“允许 Destroy/read”不一致。注册中的轮询与查询会被误伤,建议改成“只拦截写操作”或补齐只读白名单。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
3198 - 3208, The guard in handleApiMessage currently only allows
APIDestroyVmInstanceMsg during VmInstanceState.Registering, causing read/query
APIs to be blocked contrary to the comment; update the check to allow read-only
API messages (e.g., APIQuery*/APIGet* classes) or invert the logic to block only
write/mutating messages: locate handleApiMessage and the Registering-state
branch in VmInstanceBase, and either extend the whitelist beyond
APIDestroyVmInstanceMsg to include the relevant read message classes
(APIGetVmInstanceMsg, APIQueryVmInstanceMsg, etc.) or change the condition to
reject only known mutating message types rather than permitting a single destroy
type. Ensure bus.replyErrorByMessageType remains used for blocked messages.
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java-9447-9450 (1)

9447-9450: ⚠️ Potential issue | 🟠 Major

root volume 查询缺少兜底,会把失败变成超时

Line 9447 的 findTuple() 在注册早期、删除并发或回滚后都可能拿不到 root volume;现在 Line 9449/9450 直接解引用会抛异常,VmMetadataDirtyMarker 看到的会是超时/未回复,而不是正常的失败 reply。这里至少要先判空并显式 bus.reply(...)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
9447 - 9450, The current code calls Q.New(VolumeVO.class)...findTuple() and
immediately dereferences the returned Tuple into primaryStorageUuid and
rootVolumeUuid, which can be null when the root volume is gone/never created;
modify the logic around the Tuple returned by findTuple() (in VmInstanceBase
where Tuple is assigned and primaryStorageUuid/rootVolumeUuid are set) to check
for null (or empty) before accessing elements, and if missing construct and send
an explicit failure reply on the bus (using the existing message/reply
mechanism) instead of letting a NullPointerException propagate and cause a
timeout; ensure the failure reply includes sufficient context (vm UUID and
reason) so callers like VmMetadataDirtyMarker observe a proper failure not a
timeout.
plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorageKVMBackend.java-2058-2064 (1)

2058-2064: ⚠️ Potential issue | 🟠 Major

潜在的空指针异常:installPath 可能为 null

rootVolumeUuid 对应的 VolumeVO 不存在时,findValue() 返回 null,后续在第 2062 行调用 replaceFirst 会抛出 NullPointerException

此外,如果 installPath 的格式不符合预期的正则模式 ^(.+/vol-[^/]+/).*$replaceFirst 将返回原始字符串,导致 metadataPath 错误。

🐛 建议的修复方案
         String installPath = Q.New(VolumeVO.class)
                 .eq(VolumeVO_.uuid, msg.getRootVolumeUuid())
                 .select(VolumeVO_.installPath)
                 .findValue();
+        if (installPath == null) {
+            completion.success(new UpdateVmInstanceMetadataOnPrimaryStorageReply());
+            return;
+        }
         String path = installPath.replaceFirst("^(.+/vol-[^/]+/).*$", "$1");
+        if (path.equals(installPath)) {
+            logger.warn(String.format("failed to extract volume directory from installPath[%s] for vm[uuid:%s]",
+                    installPath, msg.getVmInstanceUuid()));
+            completion.success(new UpdateVmInstanceMetadataOnPrimaryStorageReply());
+            return;
+        }
         String metadataPath = String.format("%s%s", path, VM_META_SUFFIX);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorageKVMBackend.java`
around lines 2058 - 2064, The code in NfsPrimaryStorageKVMBackend that builds
metadataPath uses installPath from VolumeVO without null or format checks; guard
against a null installPath (from Q.New(...).findValue()) and handle it (return
an error response, throw a meaningful exception, or set a safe default) before
calling replaceFirst, and ensure the installPath matches the expected pattern
(e.g., via Pattern/Matcher or String.matches) so you only apply replaceFirst
when the regex will produce the intended prefix; update the code paths around
installPath, path, and umsg.setMetadataPath to handle the null/invalid-format
cases with a clear error or fallback.
compute/src/main/java/org/zstack/compute/vm/VmExpungeMetadataFlow.java-51-71 (1)

51-71: ⚠️ Potential issue | 🟠 Major

这里再查 VolumeVO 会把 metadata 清理直接短路

类注释已经说明这个 Flow 位于根卷删除之后;这样一来,Line 60-63 很可能已经查不到 primaryStorageUuid,随后 Line 65-70 会直接跳过清理,导致 expunge 后的 metadata 残留在主存上。这里应该在删卷前把 root PS UUID 放进 flow data,或者直接从 VmInstanceSpec 里已有的 root volume 信息取值,不要依赖删除后的 VolumeVO 记录。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeMetadataFlow.java`
around lines 51 - 71, The flow currently queries VolumeVO for the root volume's
primaryStorageUuid at runtime (using Q.New(VolumeVO.class) with
VolumeVO_.primaryStorageUuid), which can return null after the root volume is
deleted and short-circuit metadata cleanup; instead obtain the root
primaryStorageUuid from the pre-deletion data: read it from VmInstanceSpec
(spec.getVmInventory() / spec.getRootVolume() or existing root volume info) or
store the root primaryStorageUuid into the flow data before the volume-delete
flow step, and then use that stored value in VmExpungeMetadataFlow to drive
metadata cleanup rather than querying VolumeVO after deletion.
compute/src/main/java/org/zstack/compute/vm/metadata/MetadataStaleRecoveryTask.java-75-78 (1)

75-78: ⚠️ Potential issue | 🟠 Major

多 MN 下会重复处理同一批 stale 记录。

managementNodeReady() 会在每个 MN 上启动这个任务,但这里的查询和后续 staleRecoveryCount = staleRecoveryCount + 1 没有任何 claim/CAS。两个 MN 同时扫到同一行时,都会 markDirty() 并各自计数,maxCycles 会被按 MN 数量提前耗尽。这里需要先用条件更新抢占这行,只有抢到的一方才能 requeue 和累加计数。

Also applies to: 133-174

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/MetadataStaleRecoveryTask.java`
around lines 75 - 78, 当前实现会在每个 management node
上并发启动任务(managementNodeReady()/startTask()),但对待处理的 stale 条目在查询后没有做原子抢占/条件更新,导致多个
MN 同时处理同一条并各自调用 markDirty() 并累加 staleRecoveryCount,致使 maxCycles
被多次消耗;请在处理前实现条件更新(CAS)来“claim”该记录——例如在处理前用带条件的 UPDATE 或新增 owner/ownerUuid 字段并以
WHERE staleRecoveryCount = <expected> 或 owner IS NULL 的条件去设置 owner 为当前 MN 的
id,只有更新成功的一方才继续后续 requeue/累计 staleRecoveryCount 和调用
markDirty(),并把同样的改动应用到涉及相同行为的代码块(参考处理范围中提到的 133-174 行对应的方法/逻辑)。
docs/design/vm-metadata-04c-sblk写入流程.md-74-79 (1)

74-79: ⚠️ Potential issue | 🟠 Major

摘要区的偏移定义前后不一致。

这里把 VM 摘要区写成 [96, 928),但 SummaryChecksum 又对 bytes[96:896] 取哈希。896-928 到底是摘要数据还是 checksum 存储区现在说不清,这类字节布局文档最好拆成两个明确区间。

📝 建议把数据区和校验区分开写
-  VM 摘要区 [96, 928):保持旧值不变(Phase 1 不更新摘要)
+  VM 摘要数据区 [96, 896):保持旧值不变(Phase 1 不更新摘要)
+  SummaryChecksum 区 [896, 928):保持旧值不变
...
-  VM 摘要区 [96, 928):
+  VM 摘要数据区 [96, 896):
     VmCategory     = vm_category       ← 此时更新
     VmUuid         = vm_uuid           ← 此时更新(首次写入后不变)
     VmNameLen      = len(vm_name_utf8) ← 此时更新
     VmName         = vm_name_utf8      ← 此时更新
   SummaryChecksum  = SHA-256(bytes[96:896])

Also applies to: 122-129

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-04c-sblk写入流程.md` around lines 74 - 79, Document
splits the VM summary region ambiguously: the header says VM summary is [96,
928) but SummaryChecksum is defined as hash of bytes[96:896], leaving bytes
896–928 unclear; update the layout to explicitly separate "summary data" and
"checksum" ranges, e.g., define Summary Data = bytes[96:896) and SummaryChecksum
= bytes[896:928), and adjust the SummaryChecksum description to state it is the
hash of bytes[96:896]; apply the same explicit interval-splitting and consistent
range references where the doc repeats the layout (the section covering lines
122–129).
compute/src/main/java/org/zstack/compute/vm/metadata/ResourceBasedVmUuidFromApiResolver.java-58-64 (1)

58-64: ⚠️ Potential issue | 🟠 Major

需要支持查询 UserTagVO 以完整处理标签删除消息。

当前实现仅查询 SystemTagVO,但 APIDeleteTagMsg 可删除系统标签或用户标签。若删除的是 UserTagVO,查询会返回 null,导致无法正确解析 resourceTyperesourceUuid。应参照 TagManagerImpl.handle(APIDeleteTagMsg) 的实现方式,使用级联查询:先查 SystemTagVO,不存在则再查 UserTagVO

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/ResourceBasedVmUuidFromApiResolver.java`
around lines 58 - 64, The APIDeleteTagMsg handling in
ResourceBasedVmUuidFromApiResolver currently only queries SystemTagVO via
dbf.findByUuid and misses UserTagVO cases; update the APIDeleteTagMsg branch in
the resolver to first attempt to load the tag as SystemTagVO (using
dbf.findByUuid), and if that returns null then query UserTagVO the same way,
then extract resourceType and resourceUuid from whichever VO is found; follow
the same cascade logic as TagManagerImpl.handle(APIDeleteTagMsg) so both
SystemTagVO and UserTagVO deletions are handled.
storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java-1833-1864 (1)

1833-1864: ⚠️ Potential issue | 🟠 Major

失败时未设置错误信息到响应。

registerVmInstance 失败时,fail 回调中没有将 errorCode 设置到 event,导致调用方无法获知具体错误原因。

🐛 建议的修复
                     `@Override`
                     public void fail(ErrorCode errorCode) {
+                        event.setError(errorCode);
                         bus.publish(event);
                         chain.next();
                     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java`
around lines 1833 - 1864, The APIRegisterVmInstanceMsg handler does not populate
the error into the reply when registerVmInstance fails; in the fail callback of
registerVmInstance inside handle(APIRegisterVmInstanceMsg) set the error on the
APIRegisterVmInstanceReply (event) using the provided ErrorCode (e.g.,
event.setError(errorCode) or equivalent setter) before publishing the event so
callers receive the failure details.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java-898-917 (1)

898-917: ⚠️ Potential issue | 🟠 Major

批量初始化/升级刷新绕过了“仅 KVM UserVm”约束。

这两组 SQL 只过滤了 v.type = 'UserVm',没有带上 hypervisor 限制。这样在功能开启或升级时会给非 KVM 的 UserVm 也生成 dirty 行,后续 Poller 会把它们也送去 UpdateVmInstanceMetadataMsg。既然 markDirty() 已经把范围限定在 KVM UserVm,这里也要补同样的谓词。

💡 建议修正
-                "WHERE v.type = 'UserVm' AND v.uuid > :lastUuid " +
+                "WHERE v.type = 'UserVm' AND v.hypervisorType = :hvType AND v.uuid > :lastUuid " +

submitFullRefresh() 的两条 SQL,以及 submitBatchInitialization() 里的初始化 SQL/分页 SQL,都需要补这条条件并绑定相同参数。

Also applies to: 995-1003

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 898 - 917, The batch INSERT/UPDATE SQL in submitFullRefresh() (and
the initialization/paging SQL in submitBatchInitialization()) only filter on
v.type = 'UserVm' and thus wrongly include non-KVM UserVm; update those SQL
statements to also filter by hypervisor (the same predicate used in markDirty(),
e.g., add AND v.hypervisorType = :hypervisor or equivalent) and bind the new
:hypervisor parameter in the SQL.New(...).param(...) calls so only KVM UserVm
rows are inserted/updated and the poller won’t target non-KVM VMs.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java-739-748 (1)

739-748: ⚠️ Potential issue | 🟠 Major

首次 flush 重试耗尽时会把恢复入口一起删掉。

这里先更新 VmMetadataPathFingerprintVO.lastFlushFailed,再直接删除 dirty 行;但 Line 833-846 表明 fingerprint 只会在成功 flush 后创建。也就是说第一次同步就失败到上限时,这个 UPDATE 很可能命中 0 行,随后 dirty 行被删掉,后续恢复任务就没有任何记录可接管了。至少要在删除前 upsert 一条 fingerprint/stale 记录,或者保留 dirty 行并显式标记 stale。

Also applies to: 833-846

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 739 - 748, The current sequence updates
VmMetadataPathFingerprintVO.lastFlushFailed then deletes the VmMetadataDirtyVO
row, but if no fingerprint row exists (first flush failure) the UPDATE affects 0
rows and deleting the dirty row removes any recovery handle; fix by ensuring a
fingerprint/stale record exists before deleting the dirty entry: either upsert a
VmMetadataPathFingerprintVO with lastFlushFailed=true (create if missing) or set
a “stale” flag on the VmMetadataDirtyVO and skip deletion; modify the code path
around the SQL.New(...) UPDATE and the subsequent
SQL.New(VmMetadataDirtyVO.class)...delete() calls so you perform an
upsert/create of VmMetadataPathFingerprintVO (or mark VmMetadataDirtyVO as
stale) prior to deleting the dirty row, ensuring recovery tasks can find the
record.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java-490-496 (1)

490-496: ⚠️ Potential issue | 🟠 Major

claimDirtyRows() 的回查条件不够精确。

这里用 lastClaimTime >= now-5s 回查“本轮刚认领”的行,会把同一 MN 上 triggerFlushForVm() 近 5 秒抢到的行也一起带回来。结合 Line 550-553 的去重回调,重复提交会把仍在 flush 的 claim 提前释放,其他 MN 随后就能并发接管同一 VM。建议把本轮 claimAt 显式写入 lastClaimTime,再按 managementNodeUuid + claimAt 精确回查。

💡 建议修正
-        int claimed = SQL.New("UPDATE VmMetadataDirtyVO " +
-                "SET managementNodeUuid = :myId, lastClaimTime = CURRENT_TIMESTAMP " +
+        Timestamp claimAt = Timestamp.from(Instant.now());
+        int claimed = SQL.New("UPDATE VmMetadataDirtyVO " +
+                "SET managementNodeUuid = :myId, lastClaimTime = :claimAt " +
                 "WHERE managementNodeUuid IS NULL " +
                 "AND (nextRetryTime IS NULL OR nextRetryTime <= CURRENT_TIMESTAMP) " +
                 "ORDER BY lastOpDate ASC, vmInstanceUuid ASC " +
                 "LIMIT :batchSize")
                 .param("myId", Platform.getManagementServerId())
+                .param("claimAt", claimAt)
                 .param("batchSize", VmGlobalConfig.VM_METADATA_DIRTY_BATCH_SIZE.value(Integer.class))
                 .execute();
@@
-        Timestamp thisCycleCutoff = Timestamp.from(Instant.now().minus(Duration.ofSeconds(5)));
         return Q.New(VmMetadataDirtyVO.class)
                 .eq(VmMetadataDirtyVO_.managementNodeUuid, Platform.getManagementServerId())
-                .gte(VmMetadataDirtyVO_.lastClaimTime, thisCycleCutoff)
+                .eq(VmMetadataDirtyVO_.lastClaimTime, claimAt)
                 .list();

Also applies to: 550-553

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 490 - 496, In claimDirtyRows(), instead of using a time range
filter on lastClaimTime, record an explicit claimId/timestamp into each row
(e.g. set VmMetadataDirtyVO.claimAt = now when claiming in the claimDirtyRows()
flow) and then query rows by managementNodeUuid AND that exact claimAt value
(use VmMetadataDirtyVO_.claimAt in the Q.New(...).eq(...) call) so only rows
claimed in this CAS cycle are returned; also update the deduplication/release
logic around triggerFlushForVm()/the callback (the code at the current dedupe
block referenced by lines ~550-553) to use managementNodeUuid + claimAt for
identifying owned rows instead of lastClaimTime.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataUpdateInterceptor.java-149-157 (1)

149-157: ⚠️ Potential issue | 🟠 Major

不要让 resolver 异常反噬主 API 流程。

resolveVmUuids() 里直接调用 supports()/resolveVmUuids(),没有隔离任何 resolver 异常。只要其中一个 resolver 抛出 RuntimeException,异常就会从 BeforeDeliveryMessageInterceptor 冒出去,主 API 也会被一起打失败;元数据跟踪这里应该是 best-effort,而不是主流程依赖。建议逐个 resolver catch 后记录日志并继续尝试下一个,全部失败再降级为空结果。

💡 建议修正
     private List<String> resolveVmUuids(APIMessage msg) {
         for (VmUuidFromApiResolver resolver : resolvers) {
-            if (resolver.supports(msg)) {
-                List<String> vmUuids = resolver.resolveVmUuids(msg);
-                if (vmUuids != null && !vmUuids.isEmpty()) {
-                    return vmUuids;
-                }
+            try {
+                if (!resolver.supports(msg)) {
+                    continue;
+                }
+
+                List<String> vmUuids = resolver.resolveVmUuids(msg);
+                if (vmUuids != null && !vmUuids.isEmpty()) {
+                    return vmUuids;
+                }
+            } catch (Exception e) {
+                logger.warn(String.format("failed to resolve vmUuids from %s via %s",
+                        msg.getClass().getName(), resolver.getClass().getName()), e);
             }
         }
         logger.debug(String.format("no resolver could extract vmUuids from %s", msg.getClass().getName()));
         return Collections.emptyList();
     }

Also applies to: 205-215

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataUpdateInterceptor.java`
around lines 149 - 157, The call to resolveVmUuids() must not let a resolver's
RuntimeException bubble up and fail the main API; update the implementation in
VmMetadataUpdateInterceptor so that when iterating resolver instances you wrap
each resolver.supports(...) and resolver.resolveVmUuids(...) call in a
try/catch, log the exception (including resolver identity and the exception) and
continue to the next resolver, collecting results; if all resolvers throw or
none return values, treat the result as empty and proceed (so pendingApis.put
still receives an empty vmUuids list). Apply the same per-resolver exception
isolation to the other resolveVmUuids usage around the 205-215 area to ensure
metadata resolution is best-effort and never aborts the API flow.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java-518-526 (1)

518-526: ⚠️ Potential issue | 🟠 Major

全局并发限流这里有竞态窗口。

Line 520 的读取和 Line 525 的递增不是原子操作;并发 submitFlushTask() 会一起越过 VM_METADATA_GLOBAL_MAX_CONCURRENT,限流阈值实际守不住。这里需要 CAS,或者“先增后判超限再回滚”的原子路径。

💡 建议修正
-        if (globalFlushInFlight.get() >= maxConcurrent) {
-            // 全局并发已满,释放 claim,Poller 下轮重试
-            releaseClaim(vmUuid);
-            return;
-        }
-        globalFlushInFlight.incrementAndGet();
+        while (true) {
+            int inFlight = globalFlushInFlight.get();
+            if (inFlight >= maxConcurrent) {
+                // 全局并发已满,释放 claim,Poller 下轮重试
+                releaseClaim(vmUuid);
+                return;
+            }
+            if (globalFlushInFlight.compareAndSet(inFlight, inFlight + 1)) {
+                break;
+            }
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 518 - 526, The global concurrency check in VmMetadataDirtyMarker is
racy because you read globalFlushInFlight then call
globalFlushInFlight.incrementAndGet(), allowing multiple threads to exceed
VM_METADATA_GLOBAL_MAX_CONCURRENT; change to an atomic check-and-increment
pattern (e.g. use getAndIncrement() or incrementAndGet() then immediately test
the returned value against maxConcurrent and, if it exceeds the limit,
decrement/back out and call releaseClaim(vmUuid) before returning) so the path
in submitFlushTask()/wherever this block lives is atomic; ensure you reference
and update the same AtomicInteger (globalFlushInFlight) and call
releaseClaim(vmUuid) on the rollback path.
🟡 Minor comments (27)
conf/db/upgrade/V4.10.29__schema.sql-21-21 (1)

21-21: ⚠️ Potential issue | 🟡 Minor

索引创建缺少幂等性检查

CREATE INDEX 语句没有使用存在性检查,如果脚本重复执行会导致错误。建议添加存在性检查以确保升级脚本的幂等性。

🛠️ 建议的修复方案
-- Poller CAS claim query optimization: WHERE managementNodeUuid IS NULL AND lastClaimTime ... AND nextRetryTime <= NOW()
-CREATE INDEX `idxVmMetadataDirtyUnclaimed` ON `VmMetadataDirtyVO` (`managementNodeUuid`, `lastClaimTime`, `nextRetryTime`);
+DROP PROCEDURE IF EXISTS AddIndexIfNotExists;
+DELIMITER $$
+CREATE PROCEDURE AddIndexIfNotExists()
+BEGIN
+    IF NOT EXISTS (
+        SELECT 1 FROM INFORMATION_SCHEMA.STATISTICS
+        WHERE TABLE_SCHEMA = 'zstack'
+          AND TABLE_NAME = 'VmMetadataDirtyVO'
+          AND INDEX_NAME = 'idxVmMetadataDirtyUnclaimed'
+    ) THEN
+        CREATE INDEX `idxVmMetadataDirtyUnclaimed` ON `VmMetadataDirtyVO` (`managementNodeUuid`, `lastClaimTime`, `nextRetryTime`);
+    END IF;
+END$$
+DELIMITER ;
+CALL AddIndexIfNotExists();
+DROP PROCEDURE IF EXISTS AddIndexIfNotExists;

Based on learnings: "ZStack 数据库升级脚本...需要通过 information_schema 检查表和约束的存在性来确保脚本可以安全重复执行。"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@conf/db/upgrade/V4.10.29__schema.sql` at line 21, The CREATE INDEX for
idxVmMetadataDirtyUnclaimed on table VmMetadataDirtyVO lacks an existence check
and will fail if the script is re-run; modify the upgrade SQL to first query
information_schema.statistics (filtering by table_schema,
table_name='VmMetadataDirtyVO' and index_name='idxVmMetadataDirtyUnclaimed') and
only execute CREATE INDEX `idxVmMetadataDirtyUnclaimed` ON `VmMetadataDirtyVO`
(...) when that query returns no rows, ensuring idempotency of the schema
change.
header/src/main/java/org/zstack/header/vm/APIRegisterVmInstanceFromMetadataEvent.java-8-11 (1)

8-11: ⚠️ Potential issue | 🟡 Minor

@RestResponse(allTo = "inventory") 可能导致 warnings 字段在 REST 响应中丢失

当使用 allTo = "inventory" 时,所有响应字段会映射到 inventory 字段。但此类同时包含 inventorywarnings 两个字段,warnings 可能无法正确序列化到 REST 响应中。

建议改为使用 fieldsTo 显式映射两个字段:

💡 建议的修改
-@RestResponse(allTo = "inventory")
+@RestResponse(fieldsTo = {"inventory", "warnings"})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/vm/APIRegisterVmInstanceFromMetadataEvent.java`
around lines 8 - 11, The `@RestResponse`(allTo = "inventory") on class
APIRegisterVmInstanceFromMetadataEvent causes the warnings field to be omitted
from REST responses; change the annotation to explicitly map both fields (e.g.
use `@RestResponse`(fieldsTo = {"inventory", "warnings"}) or equivalent supported
syntax) so both the VmInstanceInventory inventory and List<String> warnings are
included in the serialized response.
docs/design/vm-metadata-07a-单元测试计划.md-222-222 (1)

222-222: ⚠️ Potential issue | 🟡 Minor

这里的区间写法建议修正一下。

[96:928) 看起来是笔误,容易让实现者误解偏移范围;建议改成更常见的半开区间写法,比如 [96, 928)

✏️ 建议修改
-| UT-BIN-20 | 摘要区序列化 | vmUuid(32) + vmName(256) + vmCategory(1) | 写入 [96:928) 区域 |
+| UT-BIN-20 | 摘要区序列化 | vmUuid(32) + vmName(256) + vmCategory(1) | 写入 [96, 928) 区域 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-07a-单元测试计划.md` at line 222, Update the interval
notation for UT-BIN-20's write range to use a standard half-open format: change
the current "[96:928)" to "[96, 928)" in the line describing "UT-BIN-20 | 摘要区序列化
| vmUuid(32) + vmName(256) + vmCategory(1)" so implementers clearly understand
the offset range.
compute/src/main/java/org/zstack/compute/vm/AbstractVmInstance.java-208-213 (1)

208-213: ⚠️ Potential issue | 🟡 Minor

注释与放行列表不一致。

Line 208-209 说明 Registering 阶段允许“metadata-related reads”,但这里实际只注册了 ChangeVmMetaDataMsg 和两个 destroy 消息。如果注册流程里确实会发起元数据读请求,它们仍会被 validateOperationByState() 拦住;如果不会,这段注释应删掉对应描述,避免形成错误契约。

As per coding guidelines, “对于较长的注释,需要仔细校对并随代码更新,确保内容正确。”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/AbstractVmInstance.java` around
lines 208 - 213, The comment for VmInstanceState.Registering in
AbstractVmInstance is inconsistent with the allowedOperations.addState call: it
claims "metadata-related reads" are permitted but only ChangeVmMetaDataMsg,
APIDestroyVmInstanceMsg, and DestroyVmInstanceMsg are actually registered;
update one of the two to match the intent—either add the specific metadata-read
message classes to allowedOperations.addState (so validateOperationByState()
won’t block those reads) or remove/adjust the "metadata-related reads" wording
from the comment to reflect that only ChangeVmMetaDataMsg and destroy messages
are allowed; reference VmInstanceState.Registering, allowedOperations.addState,
ChangeVmMetaDataMsg, APIDestroyVmInstanceMsg, DestroyVmInstanceMsg and
validateOperationByState() when making the change.
docs/design/vm-metadata-07d-性能与补充测试.md-91-95 (1)

91-95: ⚠️ Potential issue | 🟡 Minor

修正 Poller 吞吐表的列数。

Line 95 这一行是 5 列,但表头只有 4 列,Markdown 渲染时最后一个单元格会被吞掉,< 5ms 基准值也会丢失。建议把“500 dirty 行中 450 行有 nextRetryTime > now”并回场景列,保持和表头一致。

📝 建议修改
-| PERF-32 | 大量退避行跳过 | 500 dirty 行中 450 行有 nextRetryTime > now | WHERE 过滤效率 | < 5ms |
+| PERF-32 | 大量退避行跳过(500 dirty 行中 450 行有 nextRetryTime > now) | WHERE 过滤效率 | < 5ms |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-07d-性能与补充测试.md` around lines 91 - 95, The table row
for PERF-32 has five columns causing Markdown to drop the last cell; edit the
PERF-32 row so it has exactly four pipe-separated columns to match the header by
moving "500 dirty 行中 450 行有 nextRetryTime > now" into the 场景 column (so the row
becomes: PERF-32 | 大量退避行跳过 — 500 dirty 行中 450 行有 nextRetryTime > now | WHERE
过滤效率 | < 5ms), ensuring the benchmark "< 5ms" remains in the last column.
docs/design/vm-metadata-04e-sblk运维与IO.md-7-20 (1)

7-20: ⚠️ Potential issue | 🟡 Minor

为这些 fenced code block 补上语言标记。

当前这些代码块都会触发 MD040。这里大多是伪代码,统一标成 text;Python 片段继续标成 python 即可。

Also applies to: 44-62, 87-96, 100-111, 115-125, 129-137, 214-218, 224-251

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-04e-sblk运维与IO.md` around lines 7 - 20, Update the
fenced code blocks in the document to include explicit language markers to avoid
MD040: mark pseudocode blocks (including the example VM metadata block and the
scan_metadata_lvs pseudocode) as "text" and keep any real Python snippets marked
as "python"; specifically search for and update the blocks containing the VM
metadata example, the scan_metadata_lvs pseudocode, and the ranges noted (44-62,
87-96, 100-111, 115-125, 129-137, 214-218, 224-251) so each fenced block starts
with ```text or ```python as appropriate.
docs/design/vm-metadata-04e-sblk运维与IO.md-148-152 (1)

148-152: ⚠️ Potential issue | 🟡 Minor

_io_sanity_check() 的签名和实现对不上。

这里的调用只传了 fd,但函数体里又直接使用 lv_path。按这份伪代码实现,sanity check 失败时会先抛出 NameError,真实的 I/O 根因反而丢了。

🛠 建议修改
 def initialize_metadata_lv(lv_path, lv_size):
     fd = os.open(lv_path, os.O_RDWR | os.O_DIRECT | os.O_SYNC)
     try:
         # Step 0: O_DIRECT sanity check
-        _io_sanity_check(fd)
+        _io_sanity_check(fd, lv_path)
@@
-def _io_sanity_check(fd):
+def _io_sanity_check(fd, lv_path):

Also applies to: 191-206

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-04e-sblk运维与IO.md` around lines 148 - 152, The
_io_sanity_check signature and usage are mismatched: initialize_metadata_lv
calls _io_sanity_check(fd) but the implementation uses lv_path and will raise
NameError, hiding the real I/O error; fix by updating the call sites (e.g., in
initialize_metadata_lv and the similar block at 191-206) to pass both fd and
lv_path (or else change _io_sanity_check to accept only fd and stop referencing
lv_path), and ensure the _io_sanity_check function signature and its
implementation consistently use the same parameters so real I/O errors are
propagated instead of NameError.
docs/design/vm-metadata-04d-sblk读取与恢复.md-5-23 (1)

5-23: ⚠️ Potential issue | 🟡 Minor

补充代码块语言标记,避免 MD040 持续告警。

这些 fenced code block 目前都没有 language。大部分是伪代码,标成 text 即可;真实的 Python 片段再单独标成 python

Also applies to: 31-44, 50-66, 72-92, 109-118, 128-153, 159-169, 173-192, 196-211, 231-259, 263-289, 293-302, 351-368, 378-390

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-04d-sblk读取与恢复.md` around lines 5 - 23, Fenced code
blocks in the file are missing language tags causing MD040 warnings; update each
block (including the one around read_metadata(lv_path, lv_size) and others
listed) to include an appropriate language token (use "text" for pseudocode
blocks and "python" for real Python snippets) so linters stop flagging them;
ensure you update blocks that contain identifiers like read_metadata, Header,
SlotAOffset, SlotACapacity, SlotBOffset, SlotBCapacity and the other ranges
called out (31-44, 50-66, etc.) so every fenced block has a language specifier.
docs/design/vm-metadata-04b-sblk二进制布局.md-283-288 (1)

283-288: ⚠️ Potential issue | 🟡 Minor

标题中使用了日文汉字"不変量"而非简体中文"不变量"。

"不変" 是日文汉字,简体中文应为"不变"。建议修正为"约束与不变量"。

📝 建议修复
-## 3. 约束与不変量
+## 3. 约束与不变量
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-04b-sblk二进制布局.md` around lines 283 - 288, The header
and occurrences use the Japanese kanji "不変量" instead of the Simplified Chinese
term "不变量"; update the section title and any in-text occurrences (e.g., the
header "3. 约束与不変量" and the labeled items mentioning "不変量") to use "不变量"
consistently throughout the document, preserving surrounding punctuation and
formatting.
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java-7382-7387 (1)

7382-7387: ⚠️ Potential issue | 🟡 Minor

失败时不要继续回填 metadata

当前分支在 rsp.isSuccess()false 时仍然执行 reply.setMetadata(rsp.metadata),会让调用方同时拿到 error 和数据,响应语义不一致。

💡 建议修改
         new Http<>(readVmInstanceMetadataPath, cmd, ReadVmInstanceMetadataRsp.class).call(new ReturnValueCompletion<ReadVmInstanceMetadataRsp>(msg) {
             `@Override`
             public void success(ReadVmInstanceMetadataRsp rsp) {
                 if (!rsp.isSuccess()) {
                     reply.setError(operr("operation error, because:%s", rsp.getError()));
+                } else {
+                    reply.setMetadata(rsp.metadata);
                 }
-                reply.setMetadata(rsp.metadata);
                 bus.reply(msg, reply);
                 completion.done();
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java` around lines 7382 -
7387, In the success(ReadVmInstanceMetadataRsp rsp) handler, avoid setting reply
metadata when rsp.isSuccess() is false: if rsp.isSuccess() is false, set the
error via reply.setError(...) and immediately reply (bus.reply(msg, reply)) or
return, and only call reply.setMetadata(rsp.metadata) in the successful branch;
update the logic around reply.setError, reply.setMetadata, and bus.reply(msg,
reply) so callers never receive both an error and metadata.
docs/design/vm-metadata-02-脏标记与Poller.md-331-331 (1)

331-331: ⚠️ Potential issue | 🟡 Minor

文档中存在错别字

第 331 行中 "兗底" 应为 "兜底"。

📝 建议的修复
-对注册场景,即使元数据暂时落后于 DB,Part 3 §3.4 的 installPath 存在性检查提供额外兗底。
+对注册场景,即使元数据暂时落后于 DB,Part 3 §3.4 的 installPath 存在性检查提供额外兜底。
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-02-脏标记与Poller.md` at line 331, Replace the typo "兗底"
with the correct word "兜底" in the sentence that reads "对注册场景,即使元数据暂时落后于 DB,Part
3 §3.4 的 installPath 存在性检查提供额外兗底。" (referencing the phrase "installPath 存在性检查"
and "兗底"); also scan the document for other occurrences of "兗底" and correct them
to "兜底" to ensure consistency.
docs/design/vm-metadata-01a-数据模型与序列化.md-315-319 (1)

315-319: ⚠️ Potential issue | 🟡 Minor

列表编号重复

第 317 行和第 318 行都使用了编号 "4.",这是一个列表格式错误。

📝 建议的修复
 3. 初始 ResourceConfig 白名单:`NESTED_VIRTUALIZATION`、`VM_CPU_QUOTA`、`VM_CLOCK_TRACK`、`LIBVIRT_CACHE_MODE` 等
 4. 可通过 `@NeedRestoreOnVmApplySnapshot` 注解自动扩展白名单
-4. 未命中白名单的 SystemTag/ResourceConfig 不写入元数据
+5. 未命中白名单的 SystemTag/ResourceConfig 不写入元数据
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-01a-数据模型与序列化.md` around lines 315 - 319, Fix the
duplicated list numbering by renumbering the repeated "4." entries so the
sequence is continuous: change the line containing
"`@NeedRestoreOnVmApplySnapshot` 注解自动扩展白名单" from "4." to "5." and change the
line "未命中白名单的 SystemTag/ResourceConfig 不写入元数据" from "4." to "6." so the ordered
list reads 2, 3, 4, 5, 6.
docs/design/vm-metadata-02-脏标记与Poller.md-147-191 (1)

147-191: ⚠️ Potential issue | 🟡 Minor

伪代码中缺少 try 块

markDirty 方法的伪代码中,第 187 行有 } catch (Exception e) {,但前面缺少对应的 try { 语句。这会导致代码示例无法编译,可能会误导实现者。

📝 建议的修复
 public boolean markDirty(String vmInstanceUuid, boolean storageStructureChange) {
     // Q23 修复:返回 boolean 表示标脏是否成功(供 MetadataStaleRecoveryTask DP-03 使用)
     // 前置检查:功能开关
     if (!VmGlobalConfig.VM_METADATA_ENABLED.value(Boolean.class)) {
         return false;
     }

+    try {
     // Q2-2: Galera 集群兼容写法,避免 INSERT ON DUPLICATE KEY 在高并发下死锁
     // Step 1: INSERT IGNORE(新行)
     int inserted = SQL.New("INSERT IGNORE INTO VmMetadataDirtyVO (vmInstanceUuid, dirtyVersion, storageStructureChange) " +
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-02-脏标记与Poller.md` around lines 147 - 191, The
markDirty method's pseudo-code ends with a catch block but lacks the
corresponding try; wrap the DB operations and triggerFlushForVm call inside a
try { ... } so the existing catch (Exception e) in markDirty can catch
exceptions; locate the markDirty method (references: markDirty, SQL.New(...)
calls, VmMetadataDirtyVO, triggerFlushForVm) and add the try starting before the
INSERT IGNORE sequence and closing just before the catch to ensure proper
exception handling and compilation.
plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorageKVMBackend.java-2053-2053 (1)

2053-2053: ⚠️ Potential issue | 🟡 Minor

缺少 @Override 注解

方法在第 2053 行实现了 NfsPrimaryStorageBackend 接口中定义的 handle 方法,应添加 @Override 注解以确保编译期验证方法签名的正确性。参考 LocalStorageKvmBackend 等其他实现类已正确使用此注解。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorageKVMBackend.java`
at line 2053, Add the missing `@Override` annotation to the handle method in
NfsPrimaryStorageKVMBackend: locate the method public void
handle(UpdateVmInstanceMetadataOnPrimaryStorageMsg msg, String hostUuid,
ReturnValueCompletion<UpdateVmInstanceMetadataOnPrimaryStorageReply> completion)
and annotate it with `@Override` to indicate it implements the
NfsPrimaryStorageBackend.handle(...) contract and enable compile-time signature
checking, consistent with other implementations like LocalStorageKvmBackend.
docs/design/vm-metadata-01c-存储层与模板虚拟机.md-227-243 (1)

227-243: ⚠️ Potential issue | 🟡 Minor

这几个 fenced code block 需要补语言标记

这几处代码块会持续触发 MD040,而且渲染时也少了语法高亮。按内容补成 textjavasql 或其他实际语言即可。

Also applies to: 287-300, 372-383

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-01c-存储层与模板虚拟机.md` around lines 227 - 243, The fenced
code blocks (e.g., the block containing "INSERT IGNORE INTO VmMetadataDirtyVO
..." and the steps referencing markDirty(), initializeMetadataOnTargetPS(),
readMetadata(), and nextRetryTime) lack language labels causing MD040 and
missing syntax highlighting; fix by adding appropriate language identifiers
(e.g., sql, text, or java) to each fenced block mentioned (lines around the
INSERT/UPDATE block, and the other blocks at the reported ranges) so the
Markdown linter stops flagging MD040 and rendering shows correct highlighting.
docs/design/vm-metadata-01c-存储层与模板虚拟机.md-5-9 (1)

5-9: ⚠️ Potential issue | 🟡 Minor

目录里的锚点现在有多处跳不到目标标题

Line 6 的片段没有覆盖实际标题里的“(H3 修复)”,而 Line 8-9 还指向了文件里不存在的章节名,目录点击后会直接失效。建议把目录项和真实标题重新对齐。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-01c-存储层与模板虚拟机.md` around lines 5 - 9, The TOC anchors
don't match actual headings: update the TOC entries so each anchor text exactly
matches the target heading text (for example align "存储迁移 Poller 暂停的崩溃恢复" in the
TOC with the heading that includes the "(H3 修复)" suffix, and remove or rename
TOC items that currently point to non‑existent sections like the incorrect
"模板虚拟机与链式克隆元数据" variant); ensure any headings referenced in the TOC (e.g., the
H3 heading that contains "(H3 修复)") exist verbatim in the document or change the
TOC to the actual heading strings, and re-generate/verify the markdown anchors
so clicking each TOC entry navigates to the correct in-document heading.
header/src/main/java/org/zstack/header/storage/primary/APIGetVmInstanceMetadataFromPrimaryStorageMsg.java-9-13 (1)

9-13: ⚠️ Potential issue | 🟡 Minor

path 这里还是单数资源名

这个新接口用了 /primary-storage/...,和仓库里的 API 设计规范不一致;趁接口还没定型,建议改成复数资源路径,避免后续 SDK 和文档继续固化这个命名。

As per coding guidelines, "path: 针对资源使用复数形式。"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/storage/primary/APIGetVmInstanceMetadataFromPrimaryStorageMsg.java`
around lines 9 - 13, The `@RestRequest` path in class
APIGetVmInstanceMetadataFromPrimaryStorageMsg uses singular "primary-storage"
which violates the guideline to use plural resource names; update the path value
in the `@RestRequest` annotation on APIGetVmInstanceMetadataFromPrimaryStorageMsg
to use the plural form (e.g. "/primary-storages/vm-instances/metadata") so it
matches repository API naming conventions and prevents inconsistent SDK/docs
naming.
docs/design/vm-metadata-02b-高可用与运维.md-721-721 (1)

721-721: ⚠️ Potential issue | 🟡 Minor

标题层级跳级了。

# 10 下直接出现 ### 10.0## 12.4 下直接出现 #### D1...,会触发 MD001,目录层级也会变得不稳定。

Also applies to: 829-829

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-02b-高可用与运维.md` at line 721, The document has skipped
heading levels (e.g., under "# 10" there is "### 10.0 容量公式与常量(QX-8)" and under
"## 12.4" there is "#### D1..."), triggering MD001 and breaking the TOC; fix by
normalizing heading levels so each heading increments by only one level from its
parent (for example change "### 10.0 容量公式与常量(QX-8)" to "## 10.0 容量公式与常量(QX-8)"
or adjust the parent "# 10" to a higher level as appropriate, and similarly
demote or promote "#### D1..." to "### D1..."), apply the same correction at the
other occurrence (line ~829) to ensure consistent, non-skipped heading hierarchy
throughout.
docs/design/vm-metadata-04c-sblk写入流程.md-43-53 (1)

43-53: ⚠️ Potential issue | 🟡 Minor

这些 fenced code block 也需要 language tag。

当前会持续触发 markdownlint MD040。这类时序图、状态图和伪代码统一标成 text 就可以。

Also applies to: 57-80, 84-101, 105-133, 173-194, 232-267

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-04c-sblk写入流程.md` around lines 43 - 53, The fenced
code blocks containing pseudocode like the snippet with symbols target_slot,
new_seq, new_lv_size, calculate_extend_size, lvextend, and calculate_slot_layout
should include a language tag to avoid markdownlint MD040; update those
triple-backtick blocks to use ```text (e.g., replace ``` with ```text) for this
file and the other similar blocks (the other pseudocode/diagram fenced regions)
so the linter treats them as plain text.
docs/design/vm-metadata-02b-高可用与运维.md-28-40 (1)

28-40: ⚠️ Potential issue | 🟡 Minor

给这些 fenced code block 补上 language tag。

当前这些块会持续触发 markdownlint MD040。时序/伪代码统一标成 text,Java 片段标成 java 就够了。

Also applies to: 82-87, 104-108, 112-119, 155-161, 843-858

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-02b-高可用与运维.md` around lines 28 - 40, The fenced code
blocks (e.g., the sequence block beginning with "T0:   MN-A Poller 认领
dirty(vm-1)" and other similar timing/psuedocode blocks at the noted ranges) are
missing language tags and trigger markdownlint MD040; update each such
timing/sequence/psuedocode fence to begin with ```text and update any Java
snippets to begin with ```java so the linter recognizes the language (apply the
change to the blocks referenced in the comment: the shown block plus the other
ranges called out).
docs/design/vm-metadata-02b-高可用与运维.md-677-683 (1)

677-683: ⚠️ Potential issue | 🟡 Minor

这里的 SQL 策略说明还是旧版本。

§9.2 正文已经改成 INSERT IGNORE + UPDATE,但这张对比表还写着 INSERT ... ON DUPLICATE KEY UPDATE。同一份设计文档现在自相矛盾,后续实现和评审都很容易被带偏。

📝 建议修正文案
-| SQL 策略 | `INSERT ... ON DUPLICATE KEY UPDATE`(保证已有行也递增版本) | `INSERT IGNORE ... LEFT JOIN`(仅初始化新行) |
+| SQL 策略 | `INSERT IGNORE + UPDATE`(先补新行,再对已有行递增版本) | `INSERT IGNORE ... LEFT JOIN`(仅初始化新行) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-02b-高可用与运维.md` around lines 677 - 683, The table's
SQL strategy cell is inconsistent with §9.2: change the entry under "SQL 策略" for
the "§9 升级全量刷新" column from "INSERT ... ON DUPLICATE KEY UPDATE" to the updated
"INSERT IGNORE + UPDATE" to match the §9.2 text and remove the contradiction so
the design doc is consistent.
docs/design/vm-metadata-01b-API拦截与VM解析.md-393-393 (1)

393-393: ⚠️ Potential issue | 🟡 Minor

表格格式错误:两行被错误合并。

Line 393 的表格行被错误地合并为一行,使用了 || 分隔两个 API 条目,导致 Markdown 表格渲染异常。

📝 修复建议
-| `APIUpdateVmInstanceMsg` | CONFIG | `DirectVmUuidResolver` | false || `APIRecoverVmInstanceMsg` | CONFIG | `DirectVmUuidResolver` | false |
+| `APIUpdateVmInstanceMsg` | CONFIG | `DirectVmUuidResolver` | false |
+| `APIRecoverVmInstanceMsg` | CONFIG | `DirectVmUuidResolver` | false |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-01b-API拦截与VM解析.md` at line 393, The table row at line
containing "APIUpdateVmInstanceMsg | CONFIG | DirectVmUuidResolver | false ||
APIRecoverVmInstanceMsg | CONFIG | DirectVmUuidResolver | false" merged two
entries; split it into two separate markdown table rows so each API has its own
line: one row for APIUpdateVmInstanceMsg with columns CONFIG,
DirectVmUuidResolver, false, and a second row for APIRecoverVmInstanceMsg with
columns CONFIG, DirectVmUuidResolver, false, matching the table's column
structure used elsewhere.
docs/design/vm-metadata-03-注册与运维.md-561-561 (1)

561-561: ⚠️ Potential issue | 🟡 Minor

章节标题同样存在日文汉字问题

此处 "不変量" 应修正为 "不变量",与目录保持一致。

建议修复
-## 9. 约束与不変量
+## 9. 约束与不变量
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-03-注册与运维.md` at line 561, Update the section heading
that currently reads "## 9. 约束与不変量" to use the correct Chinese term "不变量" so it
matches the rest of the document and the table of contents; locate the heading
string "## 9. 约束与不変量" and replace "不変量" with "不变量".
docs/design/vm-metadata-03-注册与运维.md-8-8 (1)

8-8: ⚠️ Potential issue | 🟡 Minor

锚点链接格式错误

目录中的链接 #8-运维指南注册失败后的清理 与实际标题 ## 8. 运维指南:注册失败后的清理 不匹配,缺少冒号对应的编码。这会导致点击目录链接时无法跳转到正确位置。

建议修复
-8. [运维指南:注册失败后的清理](`#8-运维指南注册失败后的清理`)
+8. [运维指南:注册失败后的清理](`#8-运维指南注册失败后的清理`)

注意:Markdown 锚点生成规则会将标题中的中文冒号 移除,建议确认实际生成的锚点 ID 或考虑在标题中使用英文冒号。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-03-注册与运维.md` at line 8, 目录中的锚点与标题编码不一致,导致链接失效;把目录里的链接
`#8-运维指南注册失败后的清理` 与标题 `## 8. 运维指南:注册失败后的清理` 对齐:要么把目录项改为生成器实际的锚点(即移除中文冒号,使用
`#8-运维指南注册失败后的清理` 的确切格式),要么在标题行后显式添加自定义锚点(例如 在标题后加 `{`#8-运维指南注册失败后的清理`}`)以确保两处 ID
完全匹配;定位修改点为该 Markdown 文件中的目录条目和标题 `## 8. 运维指南:注册失败后的清理`。
docs/design/vm-metadata-03-注册与运维.md-13-13 (1)

13-13: ⚠️ Potential issue | 🟡 Minor

文字错误:使用了日文汉字

"不変量" 中的 "変" 是日文汉字,应改为中文简体 "变"。

建议修复
-9. [约束与不変量](`#9-约束与不変量`)
+9. [约束与不变量](`#9-约束与不变量`)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-03-注册与运维.md` at line 13, Replace the Japanese kanji
"変" with the Chinese simplified character "变" wherever it appears in the
document heading and links; specifically update the heading and anchor text "9.
[约束与不変量](`#9-约束与不変量`)" to "9. [约束与不变量](`#9-约束与不变量`)" so both the visible heading and
its anchor/reference match the corrected term.
storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java-1886-1891 (1)

1886-1891: ⚠️ Potential issue | 🟡 Minor

错误消息引用了错误的 UUID。

错误消息中使用 self.getUuid()(主存储 UUID),但实际操作是关于虚拟机元数据读取,应该使用更准确的上下文信息。

🐛 建议的修复
                             if (!r.isSuccess()) {
-                                trigger.fail(operr("failed to update vm[uuid=%s] on hypervisor.",
-                                        self.getUuid()).withCause(r.getError()));
+                                trigger.fail(operr("failed to read vm metadata from path[%s] on host[uuid=%s].",
+                                        msg.getMetadataPath(), msg.getHostUuid()).withCause(r.getError()));
                                 return;
                             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java`
around lines 1886 - 1891, The error message uses self.getUuid() (the primary
storage UUID) but the failing operation is about VM metadata; update the
callback in run(MessageReply r) so the operr() includes the VM's UUID instead of
self.getUuid() — capture and use the VM identifier available in scope (for
example vm.getUuid() or a vmUuid variable passed into the closure) when calling
trigger.fail(operr(...).withCause(r.getError())); leave r.getError() as the
cause.
compute/src/main/java/org/zstack/compute/vm/VmInstanceMetadataFieldProcessor.java-226-232 (1)

226-232: ⚠️ Potential issue | 🟡 Minor

路径替换可能产生意外结果。

String.replace() 会替换所有匹配项。如果 oldPathIdentifier 在路径中多次出现或作为其他部分的子串,可能导致意外替换。

🛡️ 建议使用更精确的替换逻辑
     private static void replaceInstallPath(Map<String, Object> voMap, String fieldName,
                                            VmInstanceMetadataRegistrationSpec spec) {
         String path = (String) voMap.get(fieldName);
         if (path != null && spec.getOldPathIdentifier() != null && spec.getNewPathIdentifier() != null) {
-            voMap.put(fieldName, path.replace(spec.getOldPathIdentifier(), spec.getNewPathIdentifier()));
+            // 仅替换第一次出现,避免意外替换路径中其他部分
+            voMap.put(fieldName, path.replaceFirst(
+                    java.util.regex.Pattern.quote(spec.getOldPathIdentifier()),
+                    java.util.regex.Matcher.quoteReplacement(spec.getNewPathIdentifier())));
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/VmInstanceMetadataFieldProcessor.java`
around lines 226 - 232, The current replaceInstallPath method uses
String.replace which blindly replaces all substrings and can corrupt paths;
update replaceInstallPath to perform a precise replacement using
spec.getOldPathIdentifier() and spec.getNewPathIdentifier() by treating the path
as path segments (split on file separator) or by matching only whole-segment or
prefix occurrences (not arbitrary substrings), replace only the exact segment(s)
that equal oldPathIdentifier (or the specific leading segment if intended), then
rejoin segments to form the new path; keep null checks and use the same method
name replaceInstallPath and parameters (voMap, fieldName,
VmInstanceMetadataRegistrationSpec spec) so callers remain unchanged.

Comment on lines +38 to +60
public List<String> resolveVmUuids(APIMessage msg) {
String groupUuid = ((VolumeSnapshotGroupMessage) msg).getGroupUuid();
if (groupUuid == null) {
return Collections.emptyList();
}

// groupUuid → VolumeSnapshotGroupRefVO → volumeUuid 列表
List<String> volumeUuids = SQL.New(
"SELECT DISTINCT ref.volumeUuid FROM VolumeSnapshotGroupRefVO ref " +
"WHERE ref.volumeSnapshotGroupUuid = :groupUuid",
String.class
).param("groupUuid", groupUuid).list();

if (volumeUuids.isEmpty()) {
return Collections.emptyList();
}

// volumeUuid 列表 → VolumeVO.vmInstanceUuid(去重,排除 NULL)
return SQL.New(
"SELECT DISTINCT v.vmInstanceUuid FROM VolumeVO v " +
"WHERE v.uuid IN (:volumeUuids) AND v.vmInstanceUuid IS NOT NULL",
String.class
).param("volumeUuids", volumeUuids).list();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

for f in $(fd 'APICreateVolumeSnapshotGroupMsg.java$|VolumeSnapshotGroupMessage.java$'); do
  printf '\n=== %s ===\n' "$f"
  sed -n '1,220p' "$f"
done

printf '\n=== related usages ===\n'
rg -n -C3 'class APICreateVolumeSnapshotGroupMsg|interface VolumeSnapshotGroupMessage|getGroupUuid\s*\(' --type java

Repository: MatheMatrix/zstack

Length of output: 34636


此 resolver 无法处理快照组创建场景,需要移除类级注释中对 Create 的声明。

SnapshotGroupBasedVmUuidFromApiResolver 的实现完全依赖 VolumeSnapshotGroupMessage.getGroupUuid(),但 APICreateVolumeSnapshotGroupMsg 并不实现 VolumeSnapshotGroupMessage 接口(仅实现 CreateVolumeSnapshotGroupMessage),无法被转换为 VolumeSnapshotGroupMessage,会在创建时抛出 ClassCastException

此外,创建操作在 API 调用时还没有稳定的 groupUuid(组尚未被持久化)。该 resolver 仅适用于操作已存在快照组的场景:Delete、Revert、Update、Ungroup。建议从类级注释中移除 APICreateVolumeSnapshotGroupMsg 的声明,只保留已确认可用的消息类型。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/SnapshotGroupBasedVmUuidFromApiResolver.java`
around lines 38 - 60, The resolver SnapshotGroupBasedVmUuidFromApiResolver
relies on VolumeSnapshotGroupMessage.getGroupUuid() and thus cannot handle
APICreateVolumeSnapshotGroupMsg (which doesn't implement
VolumeSnapshotGroupMessage and leads to ClassCastException and missing groupUuid
at create time); update the class-level annotation to remove
APICreateVolumeSnapshotGroupMsg and only list message types that operate on
existing groups (e.g., DeleteVolumeSnapshotGroupMsg,
RevertVolumeSnapshotGroupMsg, UpdateVolumeSnapshotGroupMsg,
UngroupVolumeSnapshotGroupMsg or the concrete message classes used elsewhere),
and adjust any related comments to reflect that this resolver only applies to
existing-group operations.

Comment on lines +47 to +48
public class VmMetadataBuilder {
private static final CLogger logger = Utils.getLogger(VmMetadataBuilder.class);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

缺少 Spring Bean 注解。

类声明缺少 @Component@Service 注解,导致无法被 Spring 容器管理,@Autowired@Transactional 注解将不会生效。

🐛 建议添加 Spring 注解
+import org.springframework.stereotype.Component;
+
 /**
  * 构建虚拟机元数据 payload 的 Spring Component。
  * ...
  */
+@Component
 public class VmMetadataBuilder {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
public class VmMetadataBuilder {
private static final CLogger logger = Utils.getLogger(VmMetadataBuilder.class);
import org.springframework.stereotype.Component;
/**
* 构建虚拟机元数据 payload Spring Component
*/
`@Component`
public class VmMetadataBuilder {
private static final CLogger logger = Utils.getLogger(VmMetadataBuilder.class);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataBuilder.java`
around lines 47 - 48, The VmMetadataBuilder class is missing a Spring bean
annotation so `@Autowired/`@Transactional won't be applied; add a Spring
stereotype (e.g., annotate VmMetadataBuilder with `@Component` or `@Service`) and
import the corresponding org.springframework.stereotype.* annotation so the
class is registered in the Spring context and its dependency-injection and
transaction annotations become effective; keep the existing class name
VmMetadataBuilder and logger declaration unchanged.

Comment on lines +9406 to +9417
private void handle(UpdateVmInstanceMetadataMsg msg) {
thdf.chainSubmit(new ChainTask(msg) {
@Override
public String getSyncSignature() {
return String.format("handle-update-vm-%s-metadata", msg.getUuid());
}

@Override
public void run(SyncTaskChain chain) {
doHandleUpdateVmInstanceMetadata(msg);
chain.next();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

同一 VM 的 metadata flush 实际上没有被串行化

Line 9416 在 bus.send() 还没回来的时候就 chain.next() 了,而真正的主存储写入是在异步回调里结束的。这样同一 VM 的后续 flush 会和当前写入并发,旧 payload 有机会晚于新 payload 落盘,和这里“串行执行”的意图相反。

建议修改
 private void handle(UpdateVmInstanceMetadataMsg msg) {
     thdf.chainSubmit(new ChainTask(msg) {
         `@Override`
         public String getSyncSignature() {
             return String.format("handle-update-vm-%s-metadata", msg.getUuid());
         }

         `@Override`
         public void run(SyncTaskChain chain) {
-            doHandleUpdateVmInstanceMetadata(msg);
-            chain.next();
+            doHandleUpdateVmInstanceMetadata(msg, chain);
         }

         `@Override`
         public String getName() {
             return String.format("handle-update-vm-%s-metadata-task", msg.getUuid());
         }
     });
 }

-private void doHandleUpdateVmInstanceMetadata(UpdateVmInstanceMetadataMsg msg) {
+private void doHandleUpdateVmInstanceMetadata(UpdateVmInstanceMetadataMsg msg, SyncTaskChain chain) {
     String metadata = vmMetadataBuilder.buildVmInstanceMetadata(msg.getUuid());

     int payloadSize = metadata.getBytes(java.nio.charset.StandardCharsets.UTF_8).length;
     if (payloadSize > org.zstack.compute.vm.metadata.VmMetadataBuilder.REJECT_THRESHOLD) {
         logger.error(String.format("metadata payload too large: %d bytes for vm[uuid:%s], rejecting",
                 payloadSize, msg.getUuid()));
         MessageReply reply = new MessageReply();
         reply.setError(Platform.operr("metadata payload too large (%d bytes, limit %d) for vm[uuid=%s]",
                 payloadSize, org.zstack.compute.vm.metadata.VmMetadataBuilder.REJECT_THRESHOLD, msg.getUuid()));
         bus.reply(msg, reply);
+        chain.next();
         return;
     }

     bus.send(umsg, new CloudBusCallBack(msg) {
         `@Override`
         public void run(MessageReply r) {
             UpdateVmInstanceMetadataOnPrimaryStorageReply reply = new UpdateVmInstanceMetadataOnPrimaryStorageReply();

             if (!r.isSuccess()) {
                 reply.setError(Platform.operr("failed to update vm[uuid=%s] metadata on primary storage",
                         msg.getUuid()).withCause(r.getError()));
             }
             bus.reply(msg, reply);
+            chain.next();
         }
     });
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
9406 - 9417, The current ChainTask (handle(UpdateVmInstanceMetadataMsg msg) ->
run) calls doHandleUpdateVmInstanceMetadata(msg) and immediately chain.next(),
but the actual main-storage flush completes asynchronously in the bus.send()
callback, so subsequent metadata flushes for the same VM can run concurrently;
fix by making the task wait for the async reply before calling chain.next():
either change doHandleUpdateVmInstanceMetadata to accept a callback/Consumer or
return a Future/CompletionStage and invoke chain.next() from the bus.send reply
handler (or when the Future completes), ensuring the chain.next() call occurs
only after the bus.send completion callback (the code paths around
doHandleUpdateVmInstanceMetadata, ChainTask.run, and the bus.send reply handler
must be updated accordingly).

Comment on lines +33 to +34
Crashed(VmInstanceStateEvent.crashed),
Registering(null);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 描述:查找代码库中对 Registering 状态的使用情况,确认是否需要状态转换

# 搜索 Registering 状态的引用
rg -n "VmInstanceState\.Registering" --type java -C 3

# 搜索是否有代码处理 Registering 状态转换
rg -n "Registering" --type java | grep -i "state\|transition\|next"

Repository: MatheMatrix/zstack

Length of output: 2992


🏁 Script executed:

cat -n header/src/main/java/org/zstack/header/vm/VmInstanceState.java | head -100

Repository: MatheMatrix/zstack

Length of output: 6027


🏁 Script executed:

# Search for where Registering state is set/transitioned to
rg -n "setState.*Registering\|Registering.*setState\|changeState.*Registering" --type java -A 3 -B 3

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Search for nextState usage and how it's called - to understand if missing transitions would be problematic
rg -n "nextState\(\)" --type java -A 2 -B 2 | head -50

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

cat -n header/src/main/java/org/zstack/header/vm/VmInstanceState.java | tail -100

Repository: MatheMatrix/zstack

Length of output: 6287


🏁 Script executed:

# Search for how VM enters and exits Registering state
rg -n "Registering" header/src/main/java/org/zstack/header/vm/VmInstanceState.java -A 5 -B 5

Repository: MatheMatrix/zstack

Length of output: 1051


🏁 Script executed:

# Search for all places where VM state is transitioned, particularly for Registering
rg -n "\.setState\(VmInstanceState\." --type java | grep -i registering

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Check if there's any code that moves VM from Registering to another state
rg -n "from.*Registering\|Registering.*to\|transition.*Registering" --type java -i

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Search for ChangeVmMetaDataMsg to understand how state transitions from Registering
rg -n "ChangeVmMetaDataMsg" --type java -A 5 -B 2 | head -80

Repository: MatheMatrix/zstack

Length of output: 5951


🏁 Script executed:

# Search for direct setState calls (bypass nextState mechanism) for Registering
rg -n "setState" --type java | grep -i "registering\|state.*transition" | head -20

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Check how Error state (also null with no transactions) is handled
rg -n "VmInstanceState.Error" --type java | head -20

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Search for how ChangeVmMetaDataMsg actually changes VM state
rg -n "changeMetaData\|AtomicVmState" --type java -A 10 | head -100

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Check if Registering transitions are handled via AtomicVmState in ChangeVmMetaDataMsg
rg -n "expected\|value\|AtomicVmState" header/src/main/java/org/zstack/header/vm/ChangeVmMetaDataMsg.java -A 3

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Look for actual usage of Registering in tests or real workflows
rg -n "setState.*Registering\|Registering.*state" --type java -B 2 -A 2

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Look for the actual implementation of changeMetaData and how it uses AtomicVmState
rg -n "private void changeMetaData" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java -A 50

Repository: MatheMatrix/zstack

Length of output: 2236


🏁 Script executed:

# Check how AtomicVmState is used for state transitions
rg -n "getAtomicVmStates\|AtomicVmState" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java -B 2 -A 5

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Verify if Error state (also null, no transactions) is used or if there's a pattern here
rg -n "VmInstanceState.Error" --type java -B 2 -A 2

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Search for where Registering state is actually set from (what state comes before it)
rg -n "changeVmStateInDb.*registered\|Registering.*drivenEvent\|VmInstanceStateEvent\." compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java -B 3 -A 3 | head -80

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Check what states have Registering as a possible nextState (looking at all transaction definitions)
grep -rn "Registering" header/src/main/java/org/zstack/header/vm/VmInstanceState.java

Repository: MatheMatrix/zstack

Length of output: 130


🏁 Script executed:

# Look for where VM enters Registering state from Created or other states
rg -n "register\|Register" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java --type java -i | grep -i "state\|transition" | head -20

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Check if Registering can be transitioned from Created state
cat -n header/src/main/java/org/zstack/header/vm/VmInstanceState.java | grep -A 3 "Created.transactions"

Repository: MatheMatrix/zstack

Length of output: 394


🏁 Script executed:

# Search for where a VM transitions TO Registering state (what's the source state)
rg -n "registered\|Register" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java --type java -i -B 2 -A 2 | head -60

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Check VmInstanceStateEvent to see if there's a registered event
rg -n "class VmInstanceStateEvent\|registered" header/src/main/java/org/zstack/header/vm/VmInstanceStateEvent.java -i

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Search for test files that might show how Registering state is used
find . -name "*Test.java" -o -name "*Case.java" | xargs rg -l "Registering" 2>/dev/null

Repository: MatheMatrix/zstack

Length of output: 46


🏁 Script executed:

# Search for VmInstanceStateEvent definition to understand available events
cat header/src/main/java/org/zstack/header/vm/VmInstanceStateEvent.java | head -50

Repository: MatheMatrix/zstack

Length of output: 632


🏁 Script executed:

# Look for where Registering state is set using setState directly (not nextState)
rg -n "\.setState\(" compute/src/main/java/org/zstack/compute/vm/ --type java -B 2 -A 2 | grep -i "registering" -B 3 -A 3

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Check if there are workflow/action classes that might handle Registering transitions
find compute/src/main/java -name "*Register*" -type f

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Look at how Error state (also null with no transactions) is used
rg -n "VmInstanceState\.Error\|Error.transactions" header/src/main/java/org/zstack/header/vm/VmInstanceState.java

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Check the Registering state workflow more carefully - search for where it transitions from/to
rg -n "getState\(\) == VmInstanceState\|equals.*Registering" compute/src/main/java/org/zstack/compute/vm/ --type java -B 1 -A 1 | head -50

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Look at the complete changeMetaData method to understand the state change mechanism
cat -n compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java | sed -n '1863,1920p'

Repository: MatheMatrix/zstack

Length of output: 2785


🏁 Script executed:

# Check changeVmStateInDb method to see how it changes state
rg -n "private.*changeVmStateInDb\|protected.*changeVmStateInDb" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java -A 15

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Find changeVmStateInDb method definition
rg -n "void changeVmStateInDb" compute/src/main/java/org/zstack/compute/vm/ --type java -A 20 | head -80

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Look for where Registering state is actually created/set in the codebase
rg -n "setState\|changeVmStateInDb" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java | head -30

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Check if there are any comments or documentation about Registering state design
rg -n "Registering\|registration\|register" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java -i -B 1 -A 1 | head -60

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Search more broadly for how state changes happen
rg -n "changeVmStateInDb\|setState" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java --type java | head -40

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Check if there's documentation about the Registering state feature
rg -n "Registering\|registration" --type java -i | head -40

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Check how the state is currently being handled - look at the entire VmInstanceBase file for state management patterns
wc -l compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java

Repository: MatheMatrix/zstack

Length of output: 130


🏁 Script executed:

# Find changeVmStateInDb in parent classes
rg -n "changeVmStateInDb" compute/src/main/java/org/zstack/compute/vm/AbstractVmInstance.java --type java -A 25

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Get the context around the Registering state use in VmInstanceBase
rg -n "Registering" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java --type java -B 10 -A 10

Repository: MatheMatrix/zstack

Length of output: 1400


🏁 Script executed:

# Check if the state transitions for Registering are supposed to be user-provided via ChangeVmMetaDataMsg
rg -n "changeVmStateInDb.*drivenEvent\|s.getValue\(\).getDrivenEvent" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java -B 3 -A 3

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Verify the pattern - check what happens when changeVmStateInDb is called with Registering as current state
rg -n "changeVmStateInDb" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java --type java -B 2 -A 5 | head -100

Repository: MatheMatrix/zstack

Length of output: 5086


🏁 Script executed:

# Confirm the design - check if Registering->nextState() would ever be called
rg -n "\.nextState\(" compute/src/main/java/org/zstack/compute/vm/ --type java -B 3 -A 3 | head -60

Repository: MatheMatrix/zstack

Length of output: 2067


🏁 Script executed:

# Confirm the issue - check if changeVmStateInDb can be called with Registering as source state
# and what would happen when nextState is called
rg -n "self.getState\(\).nextState" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java -B 5 -A 2

Repository: MatheMatrix/zstack

Length of output: 415


🏁 Script executed:

# Double-check: when ChangeVmMetaDataMsg is used to transition FROM Registering, what happens?
# The call is: changeVmStateInDb(s.getValue().getDrivenEvent())
# where s.getValue() is the TARGET state, not current state
# So if current is Registering and target is Stopped, it calls:
# changeVmStateInDb(VmInstanceStateEvent.stopped)
# which then tries: Registering.nextState(VmInstanceStateEvent.stopped)
# But Registering has no transactions!

# Search for all calls to changeVmStateInDb to verify this pattern
rg -n "changeVmStateInDb\(" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java --type java | wc -l

Repository: MatheMatrix/zstack

Length of output: 64


🏁 Script executed:

# Look for any special handling for Registering state in changeVmStateInDb
rg -n "changeVmStateInDb\|Registering" compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java --type java | grep -E "(changeVmStateInDb|Registering)" | head -20

Repository: MatheMatrix/zstack

Length of output: 44


🏁 Script executed:

# Let me check the full changeVmStateInDb implementation to see if there's special handling for certain states
cat -n compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java | sed -n '346,400p'

Repository: MatheMatrix/zstack

Length of output: 2936


🏁 Script executed:

# Verify: does the code have any try-catch or special handling for states without transactions?
cat -n compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java | sed -n '346,380p'

Repository: MatheMatrix/zstack

Length of output: 1743


🏁 Script executed:

# Now I see the issue - verify that when ChangeVmMetaDataMsg tries to transition FROM Registering,
# it will call nextState on Registering which has no transactions.
# Let me verify the exact flow one more time by looking at how AtomicVmState works

cat header/src/main/java/org/zstack/header/vm/ChangeVmMetaDataMsg.java | head -80

Repository: MatheMatrix/zstack

Length of output: 2119


🏁 Script executed:

# Confirm: when changeMetaData calls changeVmStateInDb(s.getValue().getDrivenEvent()),
# s.getValue() is the TARGET state, so getDrivenEvent() returns the event of the TARGET state
# Then changeVmStateInDb tries: currentState.nextState(targetStateEvent)
# So if current is Registering, it tries: Registering.nextState(stopped/running/etc)
# which requires Registering to have those transitions defined

rg -n "getDrivenEvent\|getExpected\|getValue" header/src/main/java/org/zstack/header/vm/ChangeVmMetaDataMsg.java -A 2

Repository: MatheMatrix/zstack

Length of output: 44


Registering 状态缺少状态转换定义。

新增的 Registering 状态已添加到 offlineStates 集合中,但在 static 块中没有定义任何状态转换(.transactions(...))。

当 VM 处于 Registering 状态时,ChangeVmMetaDataMsg 会尝试通过调用 changeVmStateInDb(targetState.getDrivenEvent()) 将其转换到其他状态。这会触发 self.getState().nextState(event) 的调用,即 Registering.nextState(event)。由于 Registering 状态没有定义任何转换规则,nextState() 方法将抛出 CloudRuntimeException

建议补充 Registering 的状态转换定义,例如:

建议的转换定义
Registering.transactions(
        new Transaction(VmInstanceStateEvent.stopped, VmInstanceState.Stopped),
        new Transaction(VmInstanceStateEvent.running, VmInstanceState.Running),
        new Transaction(VmInstanceStateEvent.destroying, VmInstanceState.Destroying)
);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/VmInstanceState.java` around lines
33 - 34, Registering enum value in VmInstanceState is added to offlineStates but
lacks any .transactions(...) definitions, so calls like ChangeVmMetaDataMsg ->
changeVmStateInDb(...) -> self.getState().nextState(event) will throw
CloudRuntimeException; fix by adding transactions for Registering inside the
static initialization of VmInstanceState (the same place other states call
.transactions(...)) to map relevant events to target states (for example map
VmInstanceStateEvent.stopped -> VmInstanceState.Stopped,
VmInstanceStateEvent.running -> VmInstanceState.Running, and
VmInstanceStateEvent.destroying -> VmInstanceState.Destroying) so
nextState(event) can resolve transitions correctly.

Comment on lines +3350 to +3352
public void run(SyncTaskChain chain) {
doHandleUpdateMetadata(msg);
chain.next();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

不要在异步写入完成前提前 chain.next()

这里一发起 bkd.handle(...) 就释放了链路槽位,VM_METADATA_PS_MAX_CONCURRENT 实际上只限制了“派发速度”,没有限制真正的 PS 元数据写入并发;而且 doHandleUpdateMetadata() 如果同步抛异常,这条队列还可能被卡住。把 chain.next() 挪到 success/fail/同步异常的统一收口里。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageBase.java`
around lines 3350 - 3352, 当前在 run(SyncTaskChain chain) 中立即调用 chain.next()
导致在异步元数据写入完成前释放链路槽位(例如 bkd.handle(...) 的异步回调),破坏了 VM_METADATA_PS_MAX_CONCURRENT
的并发限制且同步抛异常会卡队列;请将 chain.next() 从 run 里移除,改为在 doHandleUpdateMetadata
发起的异步写入的成功/失败回调里统一调用 chain.next(),并在 doHandleUpdateMetadata 同步抛异常的 catch 分支里也调用
chain.next(),以确保无论成功、失败或同步异常都在写入完成或处理结束后释放链路槽位(参考方法名: run(SyncTaskChain),
doHandleUpdateMetadata, 以及触发异步写入的 bkd.handle)。

Comment on lines +1944 to +1946
public void run(SyncTaskChain chain) {
doHandleUpdateMetadata(msg);
chain.next();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

不要在异步回调返回前提前 chain.next()

这样同步队列只包住了消息派发,真正的 NFS 元数据写入仍会并发执行;如果 doHandleUpdateMetadata() 同步抛异常,这个队列还可能不再前进。把 chain.next() 放到统一 completion 收口里。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java`
around lines 1944 - 1946, 当前 run(SyncTaskChain chain) 回调在调用
doHandleUpdateMetadata(msg) 之后立即执行 chain.next(), 导致同步队列仅包住消息派发而非实际的 NFS
元数据写入;请把对 chain.next() 的调用从 run 方法中移除,并在 doHandleUpdateMetadata 的统一
completion/回调成功或失败处(即在其完成处理的收口)触发 chain.next(),确保 SyncTaskChain
在元数据写入真正完成或处理异常后再推进;参考方法/符号:run, doHandleUpdateMetadata, SyncTaskChain,
chain.next(),并在异常路径也保证调用 chain.next() 或相应的失败回调以防任务队列阻塞。

@MatheMatrix MatheMatrix force-pushed the sync/tao.gan/res-vm-ZSV-10000@@4 branch from a57104d to f12598a Compare March 12, 2026 05:08
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

♻️ Duplicate comments (3)
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java (1)

9413-9416: ⚠️ Potential issue | 🔴 Critical

同一 VM 的 metadata flush 仍然没有真正串行化

这里在异步 bus.send() 结束前就执行了 chain.next(),后续同 VM 的 flush 仍可能并发进入,旧 payload 也可能晚于新 payload 落盘。chain.next() 需要放到异步回调和所有早退分支里。

建议修改
 `@Override`
 public void run(SyncTaskChain chain) {
-    doHandleUpdateVmInstanceMetadata(msg);
-    chain.next();
+    doHandleUpdateVmInstanceMetadata(msg, chain);
 }

-private void doHandleUpdateVmInstanceMetadata(UpdateVmInstanceMetadataMsg msg) {
+private void doHandleUpdateVmInstanceMetadata(UpdateVmInstanceMetadataMsg msg, SyncTaskChain chain) {
     String metadata = vmMetadataBuilder.buildVmInstanceMetadata(msg.getUuid());

     int payloadSize = metadata.getBytes(java.nio.charset.StandardCharsets.UTF_8).length;
     if (payloadSize > org.zstack.compute.vm.metadata.VmMetadataBuilder.REJECT_THRESHOLD) {
         logger.error(String.format("metadata payload too large: %d bytes for vm[uuid:%s], rejecting",
                 payloadSize, msg.getUuid()));
         MessageReply reply = new MessageReply();
         reply.setError(Platform.operr("metadata payload too large (%d bytes, limit %d) for vm[uuid=%s]",
                 payloadSize, org.zstack.compute.vm.metadata.VmMetadataBuilder.REJECT_THRESHOLD, msg.getUuid()));
         bus.reply(msg, reply);
+        chain.next();
         return;
     }

     bus.send(umsg, new CloudBusCallBack(msg) {
         `@Override`
         public void run(MessageReply r) {
             UpdateVmInstanceMetadataOnPrimaryStorageReply reply = new UpdateVmInstanceMetadataOnPrimaryStorageReply();

             if (!r.isSuccess()) {
                 reply.setError(Platform.operr("failed to update vm[uuid=%s] metadata on primary storage",
                         msg.getUuid()).withCause(r.getError()));
             }
             bus.reply(msg, reply);
+            chain.next();
         }
     });
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
9413 - 9416, 在 run(SyncTaskChain) 中当前在调用 doHandleUpdateVmInstanceMetadata(msg)
之后立刻执行 chain.next(),这会在异步 bus.send() 返回前放行下一个任务,导致同一 VM 的 metadata flush 未串行化。修改
run 方法:把 chain.next() 从当前同步位置移除,并在 doHandleUpdateVmInstanceMetadata
内部或其异步回调(包括所有早退/错误分支)在确认异步写盘/发送完成后再调用
chain.next();确保所有分支(成功回调、失败回调、以及任何提前返回路径)都会调用 chain.next() 一次且仅一次,从而保证同一 vm 的
flush 顺序化(参考方法名 doHandleUpdateVmInstanceMetadata 和调用点 chain.next()/bus.send())。
plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java (1)

1944-1946: ⚠️ Potential issue | 🔴 Critical

不要在异步写入完成前提前 chain.next()

现在这个限流只包住了消息派发,真正的 NFS 元数据写入仍会并发执行;而且 doHandleUpdateMetadata() 只要同步抛异常,这条链就会直接卡住。chain.next() 需要放到统一 completion/异常收口里。

💡 建议先把链推进交给异步收口
             `@Override`
             public void run(SyncTaskChain chain) {
-                doHandleUpdateMetadata(msg);
-                chain.next();
+                doHandleUpdateMetadata(msg, chain);
             }

doHandleUpdateMetadata(...) 的 success/fail/early-return/异常路径都要统一调用 chain.next()

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java`
around lines 1944 - 1946, 当前 run(SyncTaskChain) 实现中在调用
doHandleUpdateMetadata(msg) 之后立即执行 chain.next(),这会在异步写入尚未完成或在
doHandleUpdateMetadata 同步抛异常时提前推进链或导致链卡住;请把推进链的逻辑移到 doHandleUpdateMetadata
的统一完成/异常收口处:在所有成功、失败、提前返回或捕获到异常的分支里都确保最终调用 chain.next()(或在异步回调/CompletionHandler
中推进),并从 run(...) 中移除那个立即调用 chain.next() 的位置,以保证限流与元数据写入的生命周期一致并避免链被挂起。
plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageBase.java (1)

3349-3378: ⚠️ Potential issue | 🔴 Critical

不要在异步写入完成前提前 chain.next()

doHandleUpdateMetadata() 这里只是发起异步 bkd.handle(...)。Line 3352 立即释放队列后,VM_METADATA_PS_MAX_CONCURRENT 实际限制的是“派发并发”而不是“写入并发”;另外 Line 3363-3366 之前如果同步抛异常,这条链也没有统一收口。建议把 chain.next() 放到 success/fail/同步异常的同一出口。

🔧 建议修改
     `@Override`
     public void run(SyncTaskChain chain) {
-        doHandleUpdateMetadata(msg);
-        chain.next();
+        doHandleUpdateMetadata(msg, chain);
     }
@@
-    private void doHandleUpdateMetadata(final UpdateVmInstanceMetadataOnPrimaryStorageMsg msg) {
-        final String hostUuid = getHostUuidByResourceUuid(msg.getRootVolumeUuid());
-        LocalStorageHypervisorFactory f = getHypervisorBackendFactoryByHostUuid(hostUuid);
-        LocalStorageHypervisorBackend bkd = f.getHypervisorBackend(self);
-        bkd.handle(msg, hostUuid, new ReturnValueCompletion<UpdateVmInstanceMetadataOnPrimaryStorageReply>(msg) {
-            `@Override`
-            public void success(UpdateVmInstanceMetadataOnPrimaryStorageReply returnValue) {
-                bus.reply(msg, returnValue);
-            }
-
-            `@Override`
-            public void fail(ErrorCode errorCode) {
-                UpdateVmInstanceMetadataOnPrimaryStorageReply reply = new UpdateVmInstanceMetadataOnPrimaryStorageReply();
-                reply.setError(errorCode);
-                bus.reply(msg, reply);
-            }
-        });
+    private void doHandleUpdateMetadata(final UpdateVmInstanceMetadataOnPrimaryStorageMsg msg, final SyncTaskChain chain) {
+        try {
+            final String hostUuid = getHostUuidByResourceUuid(msg.getRootVolumeUuid());
+            LocalStorageHypervisorFactory f = getHypervisorBackendFactoryByHostUuid(hostUuid);
+            LocalStorageHypervisorBackend bkd = f.getHypervisorBackend(self);
+            bkd.handle(msg, hostUuid, new ReturnValueCompletion<UpdateVmInstanceMetadataOnPrimaryStorageReply>(msg) {
+                `@Override`
+                public void success(UpdateVmInstanceMetadataOnPrimaryStorageReply returnValue) {
+                    bus.reply(msg, returnValue);
+                    chain.next();
+                }
+
+                `@Override`
+                public void fail(ErrorCode errorCode) {
+                    UpdateVmInstanceMetadataOnPrimaryStorageReply reply = new UpdateVmInstanceMetadataOnPrimaryStorageReply();
+                    reply.setError(errorCode);
+                    bus.reply(msg, reply);
+                    chain.next();
+                }
+            });
+        } catch (Exception e) {
+            UpdateVmInstanceMetadataOnPrimaryStorageReply reply = new UpdateVmInstanceMetadataOnPrimaryStorageReply();
+            reply.setError(operr("failed to update VM metadata on local primary storage[uuid:%s], %s", self.getUuid(), e.getMessage()));
+            bus.reply(msg, reply);
+            chain.next();
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageBase.java`
around lines 3349 - 3378, 当前在 Runnable.run() 中立刻调用 chain.next() 导致
doHandleUpdateMetadata() 发起的异步写入未完成就释放队列;请把 chain.next() 移出 run()
的立即调用,把对队列的释放放到 bkd.handle(...) 的统一出口:在 ReturnValueCompletion 的 success(...) 和
fail(...) 中各调用 chain.next(),并在调用 bkd.handle(...) 的同步调用处用 try/catch
将同步抛出的异常也走到同一失败分支(调用 chain.next() 后回复 msg),确保无论同步异常或异步回调都会统一调用 chain.next() 并正确
reply(msg,...); 涉及符号:run, doHandleUpdateMetadata, bkd.handle,
ReturnValueCompletion.success, ReturnValueCompletion.fail, chain.next。
🟠 Major comments (20)
conf/db/upgrade/V4.10.29__schema.sql-20-21 (1)

20-21: ⚠️ Potential issue | 🟠 Major

给索引创建补上幂等保护。

Line 21 直接执行 CREATE INDEX,升级如果在前半段已成功、重跑到这里时会因为索引已存在而失败,恢复路径不安全。建议先查 information_schema.STATISTICS,仅在索引不存在时再创建。

建议修改
 -- Poller CAS claim query optimization: WHERE managementNodeUuid IS NULL AND lastClaimTime ... AND nextRetryTime <= NOW()
-CREATE INDEX `idxVmMetadataDirtyUnclaimed` ON `VmMetadataDirtyVO` (`managementNodeUuid`, `lastClaimTime`, `nextRetryTime`);
+DROP PROCEDURE IF EXISTS `zstack`.`sp_add_idx_vm_metadata_dirty_unclaimed`;
+DELIMITER $$
+CREATE PROCEDURE `zstack`.`sp_add_idx_vm_metadata_dirty_unclaimed`()
+BEGIN
+    IF NOT EXISTS (
+        SELECT 1
+        FROM information_schema.STATISTICS
+        WHERE TABLE_SCHEMA = 'zstack'
+          AND TABLE_NAME = 'VmMetadataDirtyVO'
+          AND INDEX_NAME = 'idxVmMetadataDirtyUnclaimed'
+    ) THEN
+        SET `@sql` = 'CREATE INDEX `idxVmMetadataDirtyUnclaimed` ON `zstack`.`VmMetadataDirtyVO` (`managementNodeUuid`, `lastClaimTime`, `nextRetryTime`)';
+        PREPARE stmt FROM `@sql`;
+        EXECUTE stmt;
+        DEALLOCATE PREPARE stmt;
+    END IF;
+END $$
+DELIMITER ;
+CALL `zstack`.`sp_add_idx_vm_metadata_dirty_unclaimed`();
+DROP PROCEDURE `zstack`.`sp_add_idx_vm_metadata_dirty_unclaimed`;

Based on learnings: 升级脚本需要通过 information_schema 检查对象存在性,确保脚本可以安全重复执行。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@conf/db/upgrade/V4.10.29__schema.sql` around lines 20 - 21, 为 CREATE INDEX
`idxVmMetadataDirtyUnclaimed` ON `VmMetadataDirtyVO` 添加幂等性检查:在执行创建索引前查询
information_schema.STATISTICS(检查 TABLE_NAME='VmMetadataDirtyVO' AND
INDEX_NAME='idxVmMetadataDirtyUnclaimed'),只有在不存在该索引时才执行 CREATE
INDEX,这样重跑或部分成功的升级脚本不会因索引已存在而失败。
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java-3199-3207 (1)

3199-3207: ⚠️ Potential issue | 🟠 Major

注册态把只读 API 也一起拦掉了

这里的注释写的是“允许 Destroy 和 query/read 操作”,但实现实际只放行了 APIDestroyVmInstanceMsg。像 APIGetVmDnsMsgAPIGetVmHostnameMsgAPIGetVmCapabilitiesMsg 这类只读接口在 Registering 期间也会被直接拒绝,和这里的语义不一致。建议显式放行只读消息,或把校验下沉到变更型 API 分支。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
3199 - 3207, The guard in VmInstanceBase rejects all messages except
APIDestroyVmInstanceMsg while VmInstanceState.Registering, but the intent per
comment is to allow destroy and read-only APIs; update the check so that
read-only messages (e.g. APIGetVmDnsMsg, APIGetVmHostnameMsg,
APIGetVmCapabilitiesMsg) are also permitted during Registering (or alternatively
invert the logic to only block mutating messages), i.e. modify the if-condition
around VmInstanceState.Registering in VmInstanceBase to allow
APIDestroyVmInstanceMsg and the read-only API message types instead of only
APIDestroyVmInstanceMsg.
plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java-1966-1968 (1)

1966-1968: ⚠️ Potential issue | 🟠 Major

按目标主机选择 backend。

这里已经解析出明确的 hostUuid,再用 getUsableBackend() 会退化成“任意一个可用 cluster 的 backend”。PS 同时挂多种 hypervisor 时,容易把请求交给错误实现;本类 handle(BatchSyncVolumeSizeOnPrimaryStorageMsg) 已经用 getBackendByHostUuid() 规避了同类问题。

💡 建议改成按主机反查 backend
-        final NfsPrimaryStorageBackend backend = getUsableBackend();
+        final NfsPrimaryStorageBackend backend = getBackendByHostUuid(hostUuid);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java`
around lines 1966 - 1968, The code calls getUsableBackend() even though hostUuid
is known, which can route the request to the wrong hypervisor backend; change
the lookup to use getBackendByHostUuid(hostUuid) (same approach used in
handle(BatchSyncVolumeSizeOnPrimaryStorageMsg)) so the backend for the specific
host is selected before invoking backend.handle(msg, hostUuid, ...); ensure null
handling if getBackendByHostUuid returns null and keep the existing
ReturnValueCompletion flow.
plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorageKVMBackend.java-2058-2064 (1)

2058-2064: ⚠️ Potential issue | 🟠 Major

不要用 /vol-.../ 正则反推元数据目录。

这里把目录形态硬编码进了 replaceFirst()。只要 installPath 为空,或者路径不再匹配这个模式,就会把元数据写到错误位置,甚至直接在这里 NPE。按最后一个 / 截目录并先判空更稳。

💡 一个更稳的改法
         String installPath = Q.New(VolumeVO.class)
                 .eq(VolumeVO_.uuid, msg.getRootVolumeUuid())
                 .select(VolumeVO_.installPath)
                 .findValue();
-        String path = installPath.replaceFirst("^(.+/vol-[^/]+/).*$", "$1");
+        if (installPath == null || !installPath.contains("/")) {
+            completion.fail(operr("cannot resolve metadata path from root volume[uuid:%s]", msg.getRootVolumeUuid()));
+            return;
+        }
+        String path = installPath.substring(0, installPath.lastIndexOf('/') + 1);
         String metadataPath = String.format("%s%s", path, VM_META_SUFFIX);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorageKVMBackend.java`
around lines 2058 - 2064, 当前用 installPath.replaceFirst("^(.+/vol-[^/]+/).*$",
"$1") 通过硬编码的 /vol-.../ 正则反推元数据目录会在 installPath 为空或路径不匹配时产生 NPE 或写到错误位置;改为从查询到的
VolumeVO.installPath(通过 Q.New(...).select(VolumeVO_.installPath))先判空,再通过查找最后一个
'/' 截取目录(如果没有 '/',退回到空或根目录的安全默认值),然后用该目录 + VM_META_SUFFIX 设置到
umsg.setMetadataPath(...),确保在 msg.getRootVolumeUuid() 返回的 installPath
不可用时有安全降级处理并避免 NPE。
header/src/main/java/org/zstack/header/vm/VmInstanceMetadataValidator.java-124-140 (1)

124-140: ⚠️ Potential issue | 🟠 Major

不要静默放过缺少 UUID 的快照组条目。

Line 127-130 和 Line 135-136 只在字段存在时才继续校验;如果 snapshotGroups[*].uuidsnapshotGroupRefs[*].volumeSnapshotGroupUuid 缺失,当前实现会直接通过校验。这样坏数据会被带到更后面的注册/修复流程里,定位会更难。

参考修正
-        Set<String> groupUuids = new HashSet<>();
-        for (String groupJson : dto.snapshotGroups) {
-            Map<String, Object> groupMap = JSONObjectUtil.toObject(groupJson, Map.class);
-            Object uuid = groupMap.get("uuid");
-            if (uuid != null) {
-                groupUuids.add(uuid.toString());
-            }
+        Set<String> groupUuids = new HashSet<>();
+        for (int i = 0; i < dto.snapshotGroups.size(); i++) {
+            Map<String, Object> groupMap = JSONObjectUtil.toObject(dto.snapshotGroups.get(i), Map.class);
+            Object uuid = groupMap.get("uuid");
+            if (uuid == null || uuid.toString().isEmpty()) {
+                throw new CloudRuntimeException(String.format(
+                        "metadata snapshotGroups[%s].uuid is missing", i));
+            }
+            groupUuids.add(uuid.toString());
         }
 
-        for (String refJson : dto.snapshotGroupRefs) {
-            Map<String, Object> refMap = JSONObjectUtil.toObject(refJson, Map.class);
-            Object groupUuid = refMap.get("volumeSnapshotGroupUuid");
-            if (groupUuid != null && !groupUuids.contains(groupUuid.toString())) {
+        for (int i = 0; i < dto.snapshotGroupRefs.size(); i++) {
+            Map<String, Object> refMap = JSONObjectUtil.toObject(dto.snapshotGroupRefs.get(i), Map.class);
+            Object groupUuid = refMap.get("volumeSnapshotGroupUuid");
+            if (groupUuid == null || groupUuid.toString().isEmpty()) {
+                throw new CloudRuntimeException(String.format(
+                        "metadata snapshotGroupRefs[%s].volumeSnapshotGroupUuid is missing", i));
+            }
+            if (!groupUuids.contains(groupUuid.toString())) {
                 throw new CloudRuntimeException(String.format(
                         "metadata snapshotGroupRef references non-existent group[uuid:%s]",
                         groupUuid));
             }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/VmInstanceMetadataValidator.java`
around lines 124 - 140, In VmInstanceMetadataValidator, stop silently accepting
entries missing UUIDs: when iterating dto.snapshotGroups (variable groupMap)
ensure the "uuid" key exists and is non-empty — if not, throw a
CloudRuntimeException with a clear message including the offending group JSON;
likewise when iterating dto.snapshotGroupRefs (variable refMap) ensure
"volumeSnapshotGroupUuid" exists and is non-empty — if missing, throw a
CloudRuntimeException describing the bad snapshotGroupRef and include the
offending ref JSON so the invalid metadata is rejected early.
header/src/main/java/org/zstack/header/vm/VmInstanceMetadataValidator.java-92-93 (1)

92-93: ⚠️ Potential issue | 🟠 Major

把 JSON 解析失败统一转换成 CloudRuntimeException

Line 92、Line 126 和 Line 134 这里直接把底层 JSON 解析异常往外抛,调用方拿不到具体失败路径,也和本类其他校验分支统一抛 CloudRuntimeException 的语义不一致。元数据注册链路如果按异常类型做错误码映射,这里会把“格式错误”漏成通用运行时异常。

Also applies to: 126-126, 134-134

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/VmInstanceMetadataValidator.java`
around lines 92 - 93, The JSON parsing calls using
JSONObjectUtil.toObject(rm.vo, Map.class) in VmInstanceMetadataValidator should
catch underlying JSON parsing exceptions and rethrow them as
CloudRuntimeException with a clear message (e.g., "failed to parse vm metadata
JSON for uuid field" or similar) so behavior matches other validation branches;
update each occurrence (the one that assigns voMap and the other two parsing
sites around the voUuid and related fields) to try/catch the parse, wrap the
original exception as the cause in a new CloudRuntimeException, and include
identifying context such as the affected rm.vo or vm identifier in the message.
compute/src/main/java/org/zstack/compute/vm/VmGlobalConfig.java-140-141 (1)

140-141: 🛠️ Refactor suggestion | 🟠 Major

VM_METADATA 配置缺少 @GlobalConfigDef 注解

VM_METADATA 配置项使用了 validValues = {"true", "false"} 验证,但缺少 @GlobalConfigDef 注解来指定默认值和类型。建议补充以保持与其他配置项的一致性。

建议修复
+    `@GlobalConfigDef`(defaultValue = "false", type = Boolean.class,
+            description = "Enable VM metadata feature")
     `@GlobalConfigValidation`(validValues = {"true", "false"})
     public static GlobalConfig VM_METADATA = new GlobalConfig(CATEGORY, "vm.metadata");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmGlobalConfig.java` around lines
140 - 141, VM_METADATA is missing the `@GlobalConfigDef` annotation; update the
declaration for the GlobalConfig named VM_METADATA to add `@GlobalConfigDef` with
an explicit defaultValue and type (e.g. defaultValue = "true" and type =
"Boolean" or the project's canonical boolean type) while keeping the existing
`@GlobalConfigValidation`(validValues = {"true","false"}); ensure you annotate the
VM_METADATA field (the GlobalConfig VM_METADATA) so it matches other configs'
metadata declarations.
compute/src/main/java/org/zstack/compute/vm/metadata/MetadataStaleRecoveryTask.java-133-138 (1)

133-138: ⚠️ Potential issue | 🟠 Major

limit(batchSize) 这里会让一小批失败记录长期霸占恢复批次。

当前筛选只有 lastFlushFailed=true + limit(batchSize),但没有 claim/轮转/游标语义。只要前一批里有 VM 每轮都无法被成功推进并继续保留 lastFlushFailed=true,后面的 stale VM 就可能一直进不了本任务的 batch,形成长期饥饿。这里需要一个“每次尝试都会推进”的排序/认领字段,或者显式的 claim 机制,而不是只靠固定的限流查询。

compute/src/main/java/org/zstack/compute/vm/metadata/MetadataStaleRecoveryTask.java-135-156 (1)

135-156: ⚠️ Potential issue | 🟠 Major

lastFlushFailed 字段的查询和更新条件改为 true/false 而不是 1/0

VmMetadataPathFingerprintVO.lastFlushFailedboolean 字段,但代码在 HQL 查询和更新中使用了数值字面量 10。根据 JPQL/HQL 规范,boolean 类型的实体属性必须使用布尔字面量 truefalse,而非数值字面量。使用 0/1 可能导致类型检查失败或在不同数据库方言间产生移植性问题。

需要修改的位置:

  • 第 135 行:WHERE fp.lastFlushFailed = 1WHERE fp.lastFlushFailed = true
  • 第 154 行:SET lastFlushFailed = 0SET lastFlushFailed = false
  • 第 171 行:SET lastFlushFailed = 0SET lastFlushFailed = false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/MetadataStaleRecoveryTask.java`
around lines 135 - 156, The HQL/JPQL uses numeric literals for a boolean entity
field causing potential type/portability issues; in MetadataStaleRecoveryTask
locate the query that selects VmMetadataPathFingerprintVO where
fp.lastFlushFailed = 1 and change the predicate to fp.lastFlushFailed = true,
and update the SQL.New(...) update statements that set lastFlushFailed = 0 to
set lastFlushFailed = false (these refer to the
VmMetadataPathFingerprintVO.lastFlushFailed updates used when staleRecoveryCount
>= maxCycles and the later reset), making all boolean comparisons/assignments
use true/false instead of 1/0.
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java-7342-7345 (1)

7342-7345: ⚠️ Potential issue | 🟠 Major

在读元数据失败时,不应设置 metadata 字段。

lines 7383-7386 的逻辑有缺陷:即使 !rsp.isSuccess() 为真,代码仍然会执行 reply.setMetadata(rsp.metadata)。应在成功时才设置 metadata:

if (rsp.isSuccess()) {
    reply.setMetadata(rsp.metadata);
} else {
    reply.setError(operr("operation error, because:%s", rsp.getError()));
}

同时请确认 metadataPath 已在 agent 端进行路径规范化和前缀校验,特别是当来自 APIRegisterVmInstanceMsg 的用户供应路径时。

Also applies to: 7342-7345

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java` around lines 7342 -
7345, The current handler uses WriteVmInstanceMetadataCmd /
Http<WriteVmInstanceMetadataRsp> to call the agent but sets
reply.setMetadata(rsp.metadata) even when rsp.isSuccess() is false; change the
response-handling logic so you only call reply.setMetadata(...) when
rsp.isSuccess() is true and otherwise call reply.setError(operr("operation
error, because:%s", rsp.getError())); also ensure metadataPath normalization and
prefix validation is performed on the agent side (and validate inputs coming
from APIRegisterVmInstanceMsg) before using metadataPath to prevent path
traversal or invalid prefixes.
compute/src/main/java/org/zstack/compute/vm/metadata/MetadataOrphanDetector.java-171-184 (1)

171-184: ⚠️ Potential issue | 🟠 Major

当前这条扫描链路还是同步 int 聚合,和新 handler 的异步契约对不上。

detectOrphans() 这里按同步返回值累计 orphanCount,但本 PR 里的 MetadataStorageHandler.scanMetadataVmUuids() 已经定义成 callback 形式。后续如果直接接上 handler,要么得在周期线程里阻塞等结果,要么 summary 会在实际扫描完成前就打印;而现在直接返回 0 也让任务变成了静默 no-op。

Also applies to: 196-224

header/src/main/java/org/zstack/header/vm/VmInstanceMetadataCodec.java-54-58 (1)

54-58: ⚠️ Potential issue | 🟠 Major

解码路径缺少 resourceUuid / vo.uuid 一致性校验。

VmInstanceMetadataDTO.ResourceMetadata 已经把这条约束写成 DTO 契约了,但这几个入口现在都只是直接 toObject。损坏或伪造的元数据可以让外层 UUID 和内层 VO 指向不同资源,后续注册/修复流程会在错误对象上继续运行。

Also applies to: 67-70, 93-94

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/VmInstanceMetadataCodec.java`
around lines 54 - 58, The decode path (VmInstanceMetadataCodec.decode and the
other similar decode methods at the referenced locations) must validate DTO
contract consistency: after deserializing with JSONObjectUtil.toObject into
VmInstanceMetadataDTO, iterate its ResourceMetadata entries and ensure each
ResourceMetadata.resourceUuid equals the contained vo.uuid (e.g.,
resource.getVo().getUuid()); if any mismatch throw a clear exception
(IllegalArgumentException or a domain-specific ValidationException) instead of
returning the object so corrupted/forged metadata cannot proceed. Update
VmInstanceMetadataCodec.decode and the other decode entry points to perform this
check and fail-fast on inconsistency.
header/src/main/java/org/zstack/header/vm/MetadataStorageHandler.java-79-92 (1)

79-92: 🛠️ Refactor suggestion | 🟠 Major

用布尔参数表示写入模式,调用点很容易传错。

storageStructureChange 同时决定 tmp 后缀和底层 OP type,true/false 在调用处没有自解释性。这个接口是新引入的公共边界,最好现在就改成显式枚举或拆成两个方法。 As per coding guidelines, "避免使用布尔型参数造成含义不明确。"

♻️ 建议改成显式模式
-    void writeMetadata(String psUuid, String vmUuid, String payloadJson,
-                       boolean storageStructureChange, Completion completion);
+    void writeMetadata(String psUuid, String vmUuid, String payloadJson,
+                       MetadataWriteMode writeMode, Completion completion);
enum MetadataWriteMode {
    NORMAL_UPDATE,
    STORAGE_STRUCTURE_CHANGE
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/MetadataStorageHandler.java` around
lines 79 - 92, The writeMetadata API on MetadataStorageHandler currently uses a
boolean storageStructureChange which is unclear at call sites; replace this
boolean with an explicit enum (e.g., MetadataWriteMode with values NORMAL_UPDATE
and STORAGE_STRUCTURE_CHANGE) or split into two intent-named methods (e.g.,
writeMetadataNormal / writeMetadataForStorageChange), update the signature of
writeMetadata accordingly, and change all callers to use the enum constant or
the new method names so the tmp suffix and OP type decision is explicit (ensure
downstream implementations and tests that reference
writeMetadata/storageStructureChange are updated to handle the new enum or new
method variants).
compute/src/main/java/org/zstack/compute/vm/metadata/MetadataPathDriftDetector.java-153-156 (1)

153-156: ⚠️ Potential issue | 🟠 Major

这里会把全量巡检放大成 N+1 查询。

每条指纹都单独查一次 VolumeVO,有卷时再查一次 VolumeSnapshotVO。VM 数量一大,这个周期任务会稳定地产生数万次 SQL。建议先按 batch 收集 vmUuid,再批量拉取 volume/snapshot 并按 VM 分组构造快照。

Also applies to: 193-212

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/MetadataPathDriftDetector.java`
around lines 153 - 156, The loop in MetadataPathDriftDetector (iterating
VmMetadataPathFingerprintVO batch and calling buildCurrentPathSnapshot per vm)
causes an N+1 query pattern because each fingerprint triggers separate VolumeVO
and VolumeSnapshotVO lookups; instead, collect all vmUuid values from the batch
up front, perform two bulk queries to fetch all related VolumeVOs and
VolumeSnapshotVOs (by vmUuid or volumeUuid as appropriate), group those results
by vmUuid, and then iterate the batch to buildCurrentPathSnapshot using the
preloaded, grouped volume/snapshot data so no per-fingerprint DB calls are made
(apply same fix for the code region around lines 193-212).
header/src/main/java/org/zstack/header/vm/VmInstanceMetadataDTO.java-53-68 (1)

53-68: ⚠️ Potential issue | 🟠 Major

Base64 不能保护这里声明的敏感信息。

如果 systemTags / resourceConfigs 里真的会落密码、密钥等内容,把 JSON 再做一次 Base64 只是编码,不是加密;拿到主存储内容的人都能直接还原。这里至少要在构建阶段排除/脱敏敏感项,或者改成受管密钥加密后再持久化。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/VmInstanceMetadataDTO.java` around
lines 53 - 68, The fields systemTags and resourceConfigs on
VmInstanceMetadataDTO are documented as Base64-encoded JSON but Base64 does not
protect secrets; update the construction/persistence flow to avoid storing raw
secrets: either filter/sanitize sensitive keys during the build that produces
the SystemTagVO/ResourceConfig JSON (exclude or mask password/secret/privateKey
fields) or encrypt the JSON payload with a managed key before assigning to
systemTags/resourceConfigs; locate the producers that serialize SystemTagVO list
and ResourceConfig list (the code paths that create the Base64 JSON for
VmInstanceMetadataDTO) and implement either redaction of sensitive fields or
integration with the existing key management/encryption utility to store
encrypted payloads instead of plain Base64.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java-833-846 (1)

833-846: ⚠️ Potential issue | 🟠 Major

成功 flush 后没有清掉失败标志。

onFlushFailure() 在重试耗尽时会把 VmMetadataPathFingerprintVO.lastFlushFailed 置成 true,但这里更新已有指纹时只改了 pathSnapshotlastFlushTime。同一 VM 后续即使刷写成功,也会一直保留“上次 flush 失败”的陈旧状态。

💡 建议修正
         } else {
             fp.setPathSnapshot(pathJson);
             fp.setLastFlushTime(new Timestamp(System.currentTimeMillis()));
+            fp.setLastFlushFailed(false);
             dbf.update(fp);
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 833 - 846, The existing update branch for
VmMetadataPathFingerprintVO (in the block that finds by vmUuid and updates fp)
doesn't clear the stale failure flag set by onFlushFailure(); update the code
that handles the existing VmMetadataPathFingerprintVO (the else branch that
calls fp.setPathSnapshot and fp.setLastFlushTime and dbf.update(fp)) to also
reset fp.setLastFlushFailed(false) and reset fp.setStaleRecoveryCount(0) (or
appropriate recovery fields) so a successful flush clears the prior failure
state for that VM.
storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java-1866-2131 (1)

1866-2131: ⚠️ Potential issue | 🟠 Major

注册成功后拿到的 VM 仍然缺少关键元数据。

当前只恢复了 VolumeVOVolumeSnapshot*VmInstanceVOVmMetadata 里的 vmNicVOs、system tag/resource config、volumeSnapshotReference* 等字段完全没有持久化,下面的 tag/config 逻辑也还停在注释里。API 返回成功后,数据库里的 VM 状态和 metadata 文件并不一致。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java`
around lines 1866 - 2131, registerVmInstance currently persists only VolumeVO,
VolumeSnapshot(*/Tree) and VmInstanceVO; you must also persist
vmMetadata.vmNicVOs, vmMetadata.volumeSnapshotReference* (refs),
vmMetadata.vmSystemTags and vmMetadata.vmResourceConfigs so DB matches metadata.
In register-volume/register-snapshot/register-vmInstance flows: convert vmNicVOs
to VmNicVO objects, set their vmInstanceUuid/owner fields and
dbf.persistCollection them; convert snapshot reference entries to the
appropriate VolumeSnapshotReferenceVO/VolumeSnapshotReferenceTreeVO (use the
same pattern as VolumeSnapshotVO/TreeVO), set
primaryStorageUuid/volumeUuid/treeUuid and dbf.persist or persistCollection
them; restore and fix the commented system tag/resource config logic
(SystemTagVO and ResourceConfigVO creation) to correctly match system tags via
getResourceSystemTagFromSystem and resource configs via
getResourceConfigFromSystem, set resourceUuid to newVm.getUuid(), then
dbf.persistCollection for tags and configs; ensure
accountUuid/primaryStorageUuid fields are set where applicable and that all
dbf.persist/persistCollection calls occur before
completion.success(vmInstanceInventory).
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java-897-925 (1)

897-925: ⚠️ Potential issue | 🟠 Major

全量刷新和初始化绕过了前面的 KVM 过滤。

markDirty() 明确只支持 UserVm + KVM,但这几段 SQL 只按 type='UserVm' 建脏行/推进分页。功能开关打开或版本升级后,非 KVM 的 UserVm 也会被 Poller 拿去发 UpdateVmInstanceMetadataMsg,只会制造无效 flush 和重试噪音。

💡 建议修正
-                "SELECT v.uuid, 1, 1 FROM VmInstanceVO v " +
-                "WHERE v.type = 'UserVm' AND v.uuid > :lastUuid " +
+                "SELECT v.uuid, 1, 1 FROM VmInstanceVO v " +
+                "WHERE v.type = 'UserVm' AND v.hypervisorType = 'KVM' AND v.uuid > :lastUuid " +
                 "ORDER BY v.uuid ASC LIMIT :batchSize")
@@
-                "WHERE v.type = 'UserVm' AND v.uuid > :lastUuid " +
+                "WHERE v.type = 'UserVm' AND v.hypervisorType = 'KVM' AND v.uuid > :lastUuid " +
                 "ORDER BY v.uuid ASC LIMIT :batchSize")
@@
-            List<String> batch = SQL.New("SELECT v.uuid FROM VmInstanceVO v " +
-                "WHERE v.type = 'UserVm' AND v.uuid > :lastUuid " +
+            List<String> batch = SQL.New("SELECT v.uuid FROM VmInstanceVO v " +
+                "WHERE v.type = 'UserVm' AND v.hypervisorType = 'KVM' AND v.uuid > :lastUuid " +
                 "ORDER BY v.uuid ASC LIMIT :batchSize", String.class)
@@
-                        "SELECT v.uuid, 1, 0 FROM VmInstanceVO v " +
-                        "LEFT JOIN VmMetadataDirtyVO d ON v.uuid = d.vmInstanceUuid " +
-                        "WHERE v.type = 'UserVm' AND v.uuid > :lastUuid AND d.vmInstanceUuid IS NULL " +
+                        "SELECT v.uuid, 1, 0 FROM VmInstanceVO v " +
+                        "LEFT JOIN VmMetadataDirtyVO d ON v.uuid = d.vmInstanceUuid " +
+                        "WHERE v.type = 'UserVm' AND v.hypervisorType = 'KVM' AND v.uuid > :lastUuid AND d.vmInstanceUuid IS NULL " +
                         "ORDER BY v.uuid ASC LIMIT :batchSize")
@@
-                    List<String> batchUuids = SQL.New("SELECT v.uuid FROM VmInstanceVO v " +
-                        "WHERE v.type = 'UserVm' AND v.uuid > :lastUuid " +
+                    List<String> batchUuids = SQL.New("SELECT v.uuid FROM VmInstanceVO v " +
+                        "WHERE v.type = 'UserVm' AND v.hypervisorType = 'KVM' AND v.uuid > :lastUuid " +
                         "ORDER BY v.uuid ASC", String.class)

Also applies to: 995-1015

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 897 - 925, The SQL batches in VmMetadataDirtyMarker (used by
markDirty()) currently only filter on v.type='UserVm' and thus include non-KVM
UserVm rows; update each SQL.New(...) statement that inserts, updates, or
selects VmInstanceVO rows to also filter by the KVM hypervisor (e.g. add "AND
v.hypervisorType = 'KVM'" or a parameterized "AND v.hypervisorType =
:hypervisor" to the WHERE clauses), pass the hypervisor param where used, and
ensure both the insert-ignore, update, and subsequent SELECT for pagination use
this additional constraint so only UserVm + KVM instances are processed.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java-518-526 (1)

518-526: ⚠️ Potential issue | 🟠 Major

全局并发上限可以被并发绕过。

get() >= maxConcurrentincrementAndGet() 不是原子操作。triggerFlushForVm() 和 Poller 并发提交时,多个线程都可能先通过检查,再一起把 globalFlushInFlight 加上去,实际 flush 数会超过配置值。

💡 建议修正
         int maxConcurrent = VmGlobalConfig.VM_METADATA_GLOBAL_MAX_CONCURRENT.value(Integer.class);
-        if (globalFlushInFlight.get() >= maxConcurrent) {
-            // 全局并发已满,释放 claim,Poller 下轮重试
-            releaseClaim(vmUuid);
-            return;
+        while (true) {
+            int current = globalFlushInFlight.get();
+            if (current >= maxConcurrent) {
+                // 全局并发已满,释放 claim,Poller 下轮重试
+                releaseClaim(vmUuid);
+                return;
+            }
+            if (globalFlushInFlight.compareAndSet(current, current + 1)) {
+                break;
+            }
         }
-        globalFlushInFlight.incrementAndGet();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 518 - 526, 当前使用的检查 plus increment (globalFlushInFlight.get() >=
maxConcurrent 之后 globalFlushInFlight.incrementAndGet()) 不是原子性的,会被并发绕过;在
VmMetadataDirtyMarker 的触发路径(triggerFlushForVm() / Poller)改为使用原子 CAS
循环或其它原子操作来做“检查并占位”:读取 globalFlushInFlight,若 >=
VmGlobalConfig.VM_METADATA_GLOBAL_MAX_CONCURRENT.value(...) 则调用
releaseClaim(vmUuid) 并返回,否则通过 globalFlushInFlight.compareAndSet(old, old+1)
成功后继续;确保所有提前返回或失败路径仍会正确调用 release/减少计数(对应原先的 releaseClaim 行为),以避免计数泄露。
storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java-1851-1854 (1)

1851-1854: ⚠️ Potential issue | 🟠 Major

失败分支没有把错误回给 API 调用方。

这里直接 publish(event),但没有 event.setError(errorCode)。调用方会收到一个没有 inventory、也没有 error 的“成功”响应,错误被吞掉了。

💡 建议修正
                     `@Override`
                     public void fail(ErrorCode errorCode) {
+                        event.setError(errorCode);
                         bus.publish(event);
                         chain.next();
                     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java`
around lines 1851 - 1854, 在 PrimaryStorageBase 中重写的 fail(ErrorCode errorCode)
分支没有把错误回传给 API;在发布事件之前调用 event.setError(errorCode)(并根据需要设置事件的成功/失败字段),然后再调用
bus.publish(event) 和 chain.next(),以确保调用方能接收到错误信息而不是空的“成功”响应。
🟡 Minor comments (15)
conf/db/upgrade/V4.10.29__schema.sql-14-17 (1)

14-17: ⚠️ Potential issue | 🟡 Minor

统一补齐 zstack schema 限定。

这里新表名已经显式写了 zstack,但外键引用和清理语句仍是未限定表名。和本目录升级脚本约定不一致,也让执行更依赖当前默认 schema。建议把这些引用统一改成 zstack....

建议修改
     CONSTRAINT `fkVmMetadataDirtyVOVmInstanceEO` FOREIGN KEY (`vmInstanceUuid`)
-        REFERENCES `VmInstanceEO` (`uuid`) ON DELETE CASCADE,
+        REFERENCES `zstack`.`VmInstanceEO` (`uuid`) ON DELETE CASCADE,
     CONSTRAINT `fkVmMetadataDirtyVOManagementNodeVO` FOREIGN KEY (`managementNodeUuid`)
-        REFERENCES `ManagementNodeVO` (`uuid`) ON DELETE SET NULL
+        REFERENCES `zstack`.`ManagementNodeVO` (`uuid`) ON DELETE SET NULL
@@
     CONSTRAINT `fkVmMetadataPathFingerprintVOVmInstanceEO` FOREIGN KEY (`vmInstanceUuid`)
-        REFERENCES `VmInstanceEO` (`uuid`) ON DELETE CASCADE
+        REFERENCES `zstack`.`VmInstanceEO` (`uuid`) ON DELETE CASCADE
@@
-DELETE FROM `GarbageCollectorVO` WHERE `name` LIKE 'update-vm-%-metadata-gc';
+DELETE FROM `zstack`.`GarbageCollectorVO` WHERE `name` LIKE 'update-vm-%-metadata-gc';

As per coding guidelines: In ZStack, the database schema name is fixed as 'zstack'. For SQL upgrade scripts under conf/db/upgrade, continue using TABLE_SCHEMA = 'zstack' and reference tables as zstack.UsedIpVO.

Also applies to: 31-32, 36-36

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@conf/db/upgrade/V4.10.29__schema.sql` around lines 14 - 17, The foreign key
constraints fkVmMetadataDirtyVOVmInstanceEO and
fkVmMetadataDirtyVOManagementNodeVO currently reference VmInstanceEO and
ManagementNodeVO without the fixed schema; update these references to use the
zstack schema prefix (e.g. zstack.`VmInstanceEO`, zstack.`ManagementNodeVO`) and
similarly update any related cleanup/drop statements in this script that
reference tables (e.g. UsedIpVO and other tables noted) so all table references
consistently use zstack.`...` as per upgrade script conventions.
docs/design/vm-metadata-04d-sblk读取与恢复.md-5-5 (1)

5-5: ⚠️ Potential issue | 🟡 Minor

给这些 fenced code block 补上语言标识。

这些位置都会触发 markdownlint 的 MD040。这里大多是流程/伪代码,统一标成 text 即可;少数真实代码块再保留 pythonbash 之类的具体语言。

Also applies to: 31-31, 50-50, 72-72, 109-109, 128-128, 159-159, 173-173, 196-196, 231-231, 263-263, 293-293, 351-351, 378-378

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-04d-sblk读取与恢复.md` at line 5, 文档中的若干 fenced code block
缺少语言标识,导致 markdownlint 报 MD040;请为所有三引号代码块补上语言标识(默认将流程/伪代码标为 text),真实代码块保留具体语言如
python、bash 等,以消除 MD040 警告并保持语法高亮一致性。
docs/design/vm-metadata-04b-sblk二进制布局.md-15-15 (1)

15-15: ⚠️ Potential issue | 🟡 Minor

给这些 fenced code block 补上语言标识。

这几处都会触发 markdownlint 的 MD040。这里主要是布局示意和伪代码,直接补成 text 就够了;真实代码示例再用具体语言。

Also applies to: 155-155, 200-200, 224-224

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-04b-sblk二进制布局.md` at line 15, The fenced code blocks
in docs/design/vm-metadata-04b-sblk二进制布局.md are missing language identifiers and
trigger markdownlint MD040; update each triple-backtick fence at the indicated
locations (including the blocks around lines 15, 155, 200, 224) to use a
language tag (use "text" for layout/psuedocode blocks) by changing ``` to
```text so the linter recognizes the block language.
docs/design/vm-metadata-00-设计约束索引.md-33-45 (1)

33-45: ⚠️ Potential issue | 🟡 Minor

约束总数和 Part 2 计数对不上。

按 Line 37-45 当前表格逐条数,Part 2 实际列了 9 条约束,但 Line 81 仍写成 8 条 / 总计 47 条。这个索引文档是后续交叉引用入口,数字不一致会让测试和设计对账失真。

📝 建议修改
-**总计**:6(Part 1b)+ 11(Part 1c)+ 8(Part 2)+ 14(Part 2b)+ 8(Part 3)= **47 条约束**
+**总计**:6(Part 1b)+ 11(Part 1c)+ 9(Part 2)+ 14(Part 2b)+ 8(Part 3)= **48 条约束**

Also applies to: 81-81

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-00-设计约束索引.md` around lines 33 - 45, The Part 2
constraints table lists nine items (C-DM-01 through C-TF-09) but the document
still shows "8 条 / 总计 47 条" on Line 81; update that summary to "9 条 / 总计 48 条"
(and any other occurrences of the old counts) so the Part 2 count and overall
total match the table entries; locate the table under the "Part 2 — Dirty Mark &
Poller" heading and the summary text near Line 81 to apply the change.
docs/design/vm-metadata-00-设计约束索引.md-73-73 (1)

73-73: ⚠️ Potential issue | 🟡 Minor

Line 73 的引号不成对。

回滚"由外到内"+ 空树清理 SQL 这里会破坏表格文本可读性,建议统一成成对引号并顺手补一下空格。

📝 建议修改
-| C-03-4 | 回滚"由外到内"+ 空树清理 SQL | §9 |
+| C-03-4 | 回滚“由外到内” + 空树清理 SQL | §9 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-00-设计约束索引.md` at line 73, The table entry for row
C-03-4 contains an unmatched quote and poor spacing ("回滚"由外到内"+ 空树清理 SQL);
update the cell text (identify by the symbol C-03-4) to use paired quotes and
consistent spacing, e.g. change to 回滚“由外到内”+ 空树清理 SQL or 回滚 "由外到内" + 空树清理 SQL so
quotes are balanced and spaces around the + are normalized.
docs/design/vm-metadata-07d-性能与补充测试.md-91-95 (1)

91-95: ⚠️ Potential issue | 🟡 Minor

PERF-32 行比表头多了一列。

第 95 行的 PERF-32 行有 5 个单元格,但表头只有 4 列。Markdown 渲染时会导致表格结构错误,最后一列会被截断。建议将条件信息合并到场景列中:

📝 建议修改
-| PERF-32 | 大量退避行跳过 | 500 dirty 行中 450 行有 nextRetryTime > now | WHERE 过滤效率 | < 5ms |
+| PERF-32 | 大量退避行跳过(500 dirty 行中 450 行有 nextRetryTime > now) | WHERE 过滤效率 | < 5ms |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-07d-性能与补充测试.md` around lines 91 - 95, The PERF-32
table row has one extra cell causing a 5th column mismatch with the 4-column
header; update the PERF-32 row so it only has four pipe-separated cells by
merging the "500 dirty 行中 450 行有 nextRetryTime > now" condition into the "场景"
column (i.e., make PERF-32's 场景 include the nextRetryTime condition) and keep
the 度量指标 and 基准 cells as-is so the table columns align.
docs/design/vm-metadata-01b-API拦截与VM解析.md-391-393 (1)

391-393: ⚠️ Potential issue | 🟡 Minor

表格行格式错误:两条 API 记录在同一行。

第 393 行在单个表格行中连接了 APIUpdateVmInstanceMsgAPIRecoverVmInstanceMsg,使用 || 分隔。Markdown 表格无法正确解析此格式,导致第二条记录显示不完整或被忽略。

📝 建议修改
-| `APIUpdateVmInstanceMsg` | CONFIG | `DirectVmUuidResolver` | false || `APIRecoverVmInstanceMsg` | CONFIG | `DirectVmUuidResolver` | false |
+| `APIUpdateVmInstanceMsg` | CONFIG | `DirectVmUuidResolver` | false |
+| `APIRecoverVmInstanceMsg` | CONFIG | `DirectVmUuidResolver` | false |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-01b-API拦截与VM解析.md` around lines 391 - 393, The table
row mistakenly combines two API records into one using "||"; split the single
line into two separate Markdown table rows so `APIUpdateVmInstanceMsg` and
`APIRecoverVmInstanceMsg` each have their own row with the same columns (Level:
CONFIG, Resolver: DirectVmUuidResolver, updateOnFailure: false), ensuring each
row begins and ends with proper pipe delimiters and matches the table header
format.
docs/design/vm-metadata-01b-API拦截与VM解析.md-79-88 (1)

79-88: ⚠️ Potential issue | 🟡 Minor

代码块需要补充语言标识符。

第 79-88 行和第 269-273 行的 fenced code blocks 缺少语言标识符,会导致 MD040 违规。建议使用 text 标识符。

同时,第 391-393 行的表格行结构存在问题:两行表格被合并到一行,应分别成行。

📝 建议修改

行 79-88:

-```
+```text

行 269-273:

-```
+```text

行 391-393:表格第二行应另起一行,不能与第一行共用。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-01b-API拦截与VM解析.md` around lines 79 - 88, Add the
missing language identifiers for the fenced code blocks around the described
flow and the other block by changing their opening fences to use ```text (refer
to the fenced code blocks currently showing the timing flow and the other block
at the second location) and fix the malformed table by splitting the merged row
into two separate lines so the second table row begins on its own line (the
problematic table appears where the lines around 391-393 are merged).
docs/design/vm-metadata-01c-存储层与模板虚拟机.md-227-243 (1)

227-243: ⚠️ Potential issue | 🟡 Minor

为这三个 fenced code block 补充 language 标识

当前文件中第 227-243、287-300 和 372-383 行的三个代码块都缺少 language 标识,会触发 markdownlint MD040 规则报告,且缺乏语法高亮。这些块用于展示 SQL 语句、时间序列图和 ASCII 结构图,建议统一补充 text 标识符。

📝 修改示例
-```
+```text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-01c-存储层与模板虚拟机.md` around lines 227 - 243, Add
explicit language identifiers ("text") to the three fenced code blocks that lack
them: the block beginning with "Step 1: 暂停该 VM 的 Poller flush" (which contains
the SQL INSERT/UPDATE), the subsequent block that shows the time-series diagram,
and the block containing the ASCII storage/structure diagram; replace each
opening ``` with ```text so markdownlint MD040 is satisfied and syntax
highlighting is applied.
docs/design/vm-metadata-01c-存储层与模板虚拟机.md-5-9 (1)

5-9: ⚠️ Potential issue | 🟡 Minor

修复目录中失效的锚点

  1. 第 8 行指向的"新增/修改代码文件清单"在正文中不存在,文档从第 2 节直接跳到第 3 节。
  2. 第 9 行错误地将"约束与不変量"编号为第 4 节,但实际上正文中该标题是第 3 节。
  3. 第 6 行的锚点 #16-存储迁移-poller-暂停的崩溃恢复 与实际标题不符——标题后缀有"(H3 修复)",会影响自动生成的锚点。
建议修改
-3. [新增/修改代码文件清单](`#3-新增修改代码文件清单`)
-4. [约束与不変量](`#4-约束与不変量`)
+3. [约束与不変量](`#3-约束与不変量`)

第 6 行的锚点需要调整为与实际标题匹配(含后缀):

-   1.6. [存储迁移 Poller 暂停的崩溃恢复](`#16-存储迁移-poller-暂停的崩溃恢复`)
+   1.6. [存储迁移 Poller 暂停的崩溃恢复](`#16-存储迁移-poller-暂停的崩溃恢复h3-修复`)

或者修改标题本身,移除后缀括号,改为纯注释形式。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-01c-存储层与模板虚拟机.md` around lines 5 - 9, 修复文档目录中失效锚点:在
docs/design/vm-metadata-01c-存储层与模板虚拟机.md 中更新目录项使其与正文标题精确匹配,具体做法是:将第 6 行的锚点
`#16-存储迁移-poller-暂停的崩溃恢复` 改为包含正文中实际标题后缀(例如 包含“(H3 修复)”)或直接在正文中移除该后缀以保持锚点简洁;将第 8
行的“新增/修改代码文件清单”目录项改为与正文真实节号/标题一致或在正文新增该小节;将第 9 行“约束与不変量”编号/锚点改为匹配正文当前为第 3
节的标题,确保目录中的文本和锚点标识符与对应标题(节标题文本)完全一致以修复自动锚点生成问题。
docs/design/vm-metadata-02b-高可用与运维.md-675-683 (1)

675-683: ⚠️ Potential issue | 🟡 Minor

比较表里的 SQL 策略还是旧表述。

§9.2 已经明确改成 INSERT IGNORE + UPDATE 两步语义,这里仍写成 INSERT ... ON DUPLICATE KEY UPDATE,后续实现或评审很容易被这张表带偏。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-02b-高可用与运维.md` around lines 675 - 683, 表格中“SQL
策略”一栏仍列为 `INSERT ... ON DUPLICATE KEY UPDATE`,与 §9.2 中约定的两步语义不一致;请将该单元格改为与 §9.2
保持一致的描述(例如写明为 “INSERT IGNORE + UPDATE(两步语义)” 或等效表述),并在旁注或括号中标注与 §9.2
的对应关系以避免混淆(参照 §9.2 的描述以保证术语一致性)。
docs/design/vm-metadata-02-脏标记与Poller.md-198-228 (1)

198-228: ⚠️ Potential issue | 🟡 Minor

给这些 fenced code block 补上语言标识。

这些代码块现在会持续触发 markdownlint 的 MD040,而且语法高亮也会缺失。这里补上最接近的语言即可,比如 javasqljsontext

Also applies to: 647-647, 795-795, 847-847, 878-878

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-02-脏标记与Poller.md` around lines 198 - 228, Add
language identifiers to the fenced code blocks in
docs/design/vm-metadata-02-脏标记与Poller.md so markdownlint MD040 stops firing and
syntax highlighting works; for the SQL/DB flow blocks (showing INSERT IGNORE /
DELETE / UPDATE sequences) use "sql", for the Java method overload snippet that
includes "public boolean markDirty(String vmInstanceUuid) { return
markDirty(vmInstanceUuid, false); }" use "java", and choose appropriate tags
(e.g., "text" or "json") for the other examples mentioned (lines around 647,
795, 847, 878) to match their content. Ensure each triple-backtick fence is
immediately followed by the language token.
docs/design/vm-metadata-02b-高可用与运维.md-171-199 (1)

171-199: ⚠️ Potential issue | 🟡 Minor

这里的恢复策略描述已经和后文不一致了。

这一段写的是“重试耗尽后仅依赖下次 API 自动重试”,但同一文档的 §8.2.3 以及 Part 2 §4.8 已经引入了 lastFlushFailed + MetadataStaleRecoveryTask 的后台重入队路径。建议统一口径,否则实现说明和运维说明会给出相反结论。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/vm-metadata-02b-高可用与运维.md` around lines 171 - 199, The document
is inconsistent: §8.1 states "no background recovery, rely on API-triggered
retries", but §8.2.3 and Part 2 §4.8 introduce lastFlushFailed +
MetadataStaleRecoveryTask (background requeue), so reconcile them by choosing
one canonical recovery flow and updating all sections and cross-references
accordingly; either (A) remove/replace references to lastFlushFailed and
MetadataStaleRecoveryTask in §8.2.3 and Part 2 §4.8 and adjust examples to match
the API-triggered retry behavior described in §8.1, or (B) augment §8.1 to
describe how retryCount>=maxRetry interacts with lastFlushFailed +
MetadataStaleRecoveryTask (what sets lastFlushFailed, when
MetadataStaleRecoveryTask will requeue or escalate, and how
markDirty/nextRetryTime are affected), and then update the tables and alerting
language to include the background recovery path; update all mentions of
MetadataStaleEvent/MetadataStaleRecoveryTask/lastFlushFailed to the chosen
single behavior and ensure cross-section references point to the authoritative
description.
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java-7382-7387 (1)

7382-7387: ⚠️ Potential issue | 🟡 Minor

失败分支不要继续回填 metadata。

Line 7386 在 rsp.isSuccess()false 时仍然把 rsp.metadata 放进 reply,容易把部分内容或陈旧内容带给上游。这里最好只在成功分支设置 metadata。

建议修改
             public void success(ReadVmInstanceMetadataRsp rsp) {
                 if (!rsp.isSuccess()) {
                     reply.setError(operr("operation error, because:%s", rsp.getError()));
+                } else {
+                    reply.setMetadata(rsp.metadata);
                 }
-                reply.setMetadata(rsp.metadata);
                 bus.reply(msg, reply);
                 completion.done();
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java` around lines 7382 -
7387, The success callback for ReadVmInstanceMetadataRsp improperly sets reply
metadata even when rsp.isSuccess() is false; update the anonymous
success(ReadVmInstanceMetadataRsp rsp) handler so that
reply.setMetadata(rsp.metadata) is only executed inside the successful branch
(i.e., when rsp.isSuccess() returns true) and ensure that when rsp.isSuccess()
is false you set reply.setError(...) and immediately bus.reply(msg, reply)
without touching metadata; locate the code by the
success(ReadVmInstanceMetadataRsp) method, the ReadVmInstanceMetadataRsp type,
and the reply.setMetadata / reply.setError calls to make the change.
header/src/main/java/org/zstack/header/storage/primary/APIRegisterVmInstanceMsg.java-59-61 (1)

59-61: ⚠️ Potential issue | 🟡 Minor

__example__() 现在返回空消息,生成的 API 示例不可用。

至少要填上 metadataPathprimaryStorageUuidclusterUuid 这些必填字段,否则 Groovy/Markdown 文档看不出最小可调用请求。 As per coding guidelines, "API 类需要实现 example 方法以便生成 API 文档,并确保生成对应的 Groovy API Template 与 API Markdown 文件。"

📝 建议补全示例
 public static APIRegisterVmInstanceMsg __example__() {
     APIRegisterVmInstanceMsg msg = new APIRegisterVmInstanceMsg();
+    msg.setMetadataPath("/path/to/vm-metadata.json");
+    msg.setPrimaryStorageUuid(uuid());
+    msg.setClusterUuid(uuid());
     return msg;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/storage/primary/APIRegisterVmInstanceMsg.java`
around lines 59 - 61, The __example__ method in APIRegisterVmInstanceMsg returns
an empty message; update APIRegisterVmInstanceMsg.__example__() to construct and
return a minimal valid example by setting required fields such as metadataPath,
primaryStorageUuid and clusterUuid (and any other mandatory properties used by
validation) on the APIRegisterVmInstanceMsg instance so the generated
Groovy/Markdown API template and documentation show a runnable minimal request.

Comment on lines +73 to +77
private static final List<String> SUPPORTED_PS_TYPES = List.of(
"SharedBlock",
"LocalStorage",
"NFS"
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== build files =="
fd 'pom\.xml$|build\.gradle$|build\.gradle\.kts$' -a

echo
echo "== Maven compiler settings =="
rg -n --glob 'pom.xml' '<maven.compiler.source>|<maven.compiler.target>|<release>|<source>|<target>' .

echo
echo "== Gradle Java compatibility settings =="
rg -n --glob 'build.gradle' --glob 'build.gradle.kts' 'sourceCompatibility|targetCompatibility|JavaVersion' .

Repository: MatheMatrix/zstack

Length of output: 21864


改用 Arrays.asList(...)Collections.unmodifiableList(...)

仓库编译目标为 Java 8(根目录 pom.xml 第 21-22 行设置 maven.compiler.sourcemaven.compiler.target 为 8),但 List.of(...) 仅在 Java 9+ 才可用,此处会导致编译失败。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/MetadataOrphanDetector.java`
around lines 73 - 77, The static field SUPPORTED_PS_TYPES in
MetadataOrphanDetector uses List.of(...) which requires Java 9+; replace it with
a Java 8 compatible immutable list, e.g., create the list with
Arrays.asList("SharedBlock", "LocalStorage", "NFS") and wrap with
Collections.unmodifiableList(...) (or use Collections.unmodifiableList(new
ArrayList<>(Arrays.asList(...)))) to maintain immutability; update the
SUPPORTED_PS_TYPES declaration in class MetadataOrphanDetector accordingly.

Comment on lines +490 to +496
// Step 2: 查询刚认领到的行(DP-01 修复:增加 lastClaimTime 过滤,
// 仅返回本轮 CAS 认领的行,避免与 triggerFlushForVm 并发认领的行混入)
Timestamp thisCycleCutoff = Timestamp.from(Instant.now().minus(Duration.ofSeconds(5)));
return Q.New(VmMetadataDirtyVO.class)
.eq(VmMetadataDirtyVO_.managementNodeUuid, Platform.getManagementServerId())
.gte(VmMetadataDirtyVO_.lastClaimTime, thisCycleCutoff)
.list();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

这里会把别的路径刚认领的行也当成本轮 Poller 的结果。

lastClaimTime >= now-5s 不能区分“本轮 claimDirtyRows() 刚认领”与“triggerFlushForVm() 几秒前认领”。同一 vmUuid 被重复 submitFlushTask() 后,exceedMaxPendingCallback() 会执行 releaseClaim(vmUuid),把正在 flush 的 claim 提前释放,其他 MN 就能抢到这行并发刷写。这里最好用唯一 claim token/batch 标记回查,而不是靠时间窗近似匹配。

Also applies to: 550-554

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 490 - 496, The query using lastClaimTime
(VmMetadataDirtyVO_.lastClaimTime >= thisCycleCutoff) cannot reliably
distinguish rows claimed by thisCycle's claimDirtyRows() from rows claimed
earlier by triggerFlushForVm(); change the claiming flow to generate a unique
claim token/batch id in claimDirtyRows(), persist it on VmMetadataDirtyVO (e.g.
add claimToken field), set that token when claiming rows, and then replace the
time-based filter in the retrieval code with an equality filter on
VmMetadataDirtyVO_.claimToken to return only rows belonging to the current claim
batch; also update submitFlushTask(), exceedMaxPendingCallback(), and
releaseClaim(vmUuid) handling to set/clear or check the claimToken
appropriately.

Comment on lines +8 to +11
@RestResponse(allTo = "inventory")
public class APIRegisterVmInstanceFromMetadataEvent extends APIEvent {
private VmInstanceInventory inventory;
private List<String> warnings;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 检查项目中使用 allTo = "inventory" 注解且包含多个字段的 API 事件类
rg -l '@RestResponse.*allTo.*inventory' --type java | xargs -I {} sh -c 'echo "=== {} ===" && rg -A 5 "class.*extends.*API" {}'

Repository: MatheMatrix/zstack

Length of output: 50376


🏁 Script executed:

# Get the full content of APIRegisterVmInstanceFromMetadataEvent.java
find . -name "APIRegisterVmInstanceFromMetadataEvent.java" -type f | head -1 | xargs wc -l

Repository: MatheMatrix/zstack

Length of output: 152


🏁 Script executed:

# Read the file content
find . -name "APIRegisterVmInstanceFromMetadataEvent.java" -type f | head -1 | xargs cat -n

Repository: MatheMatrix/zstack

Length of output: 1587


🏁 Script executed:

# Find similar classes with warnings field that might have `@RestResponse`
rg -l "warnings" --type java | xargs -I {} rg -B 2 "private List<String> warnings" {} | head -50

Repository: MatheMatrix/zstack

Length of output: 456


🏁 Script executed:

# Check APIReadVmInstanceMetadataReply specifically
fd "APIReadVmInstanceMetadataReply.java" | xargs cat -n

Repository: MatheMatrix/zstack

Length of output: 2245


修改 @RestResponse 注解以正确暴露所有字段。

类中包含两个字段 inventorywarnings,但使用了 @RestResponse(allTo = "inventory") 注解。该注解仅会在 REST 响应中暴露 inventory 字段,导致 warnings 字段无法被正确序列化。

建议改为 @RestResponse(fieldsTo = {"all"}) 以确保两个字段都能在 REST 响应中正确暴露。同时,__example__() 方法应设置 warnings 字段(如 warnings = java.util.Collections.emptyList()),以便为 API 文档提供完整示例。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/vm/APIRegisterVmInstanceFromMetadataEvent.java`
around lines 8 - 11, Replace the `@RestResponse`(allTo = "inventory") on class
APIRegisterVmInstanceFromMetadataEvent with `@RestResponse`(fieldsTo = {"all"}) so
both fields (inventory and warnings) are exposed in the REST response, and
update the class' __example__() method to populate the warnings field (e.g.,
warnings = java.util.Collections.emptyList()) along with the existing inventory
example to ensure the API documentation includes both fields.

responseClass = APIReimageVmInstanceEvent.class,
category = "vmInstance"
)
@MetadataImpact(MetadataImpact.Impact.STORAGE)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 查找 MetadataImpact 注解的定义位置
echo "=== 查找 MetadataImpact 定义 ==="
fd -e java MetadataImpact --exec grep -l "^public.*@interface\|^@interface\|^public `@interface`" {} \; 2>/dev/null || \
  ast-grep --pattern '@interface MetadataImpact { $$$ }'

echo ""
echo "=== 查找 MetadataImpact 类/注解声明 ==="
rg -n "(`@interface`|class|enum)\s+MetadataImpact" --type java

echo ""
echo "=== 检查其他文件如何 import MetadataImpact ==="
rg -n "import.*MetadataImpact" --type java | head -10

Repository: MatheMatrix/zstack

Length of output: 1732


🏁 Script executed:

cat -n header/src/main/java/org/zstack/header/vm/APIReimageVmInstanceMsg.java | head -30

Repository: MatheMatrix/zstack

Length of output: 1147


缺少 MetadataImpact 注解的 import 语句

添加了 @MetadataImpact(MetadataImpact.Impact.STORAGE) 注解,但文件中没有对应的 import 语句。即使 MetadataImpact 与本文件在同一包 (org.zstack.header.vm) 中,Java 仍然要求显式导入注解以在装饰器中使用。不添加该 import 会导致编译失败。

🔧 建议添加 import
 import org.springframework.http.HttpMethod;
 import org.zstack.header.message.APIMessage;
 import org.zstack.header.message.APIParam;
 import org.zstack.header.rest.RestRequest;
+import org.zstack.header.vm.MetadataImpact;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/APIReimageVmInstanceMsg.java` at
line 18, Add an explicit import for the MetadataImpact annotation so the
`@MetadataImpact`(MetadataImpact.Impact.STORAGE) usage in APIReimageVmInstanceMsg
compiles; specifically import the MetadataImpact type (e.g., MetadataImpact) at
the top of the file before the class declaration to resolve the annotation
reference used on APIReimageVmInstanceMsg.

Comment on lines +3814 to +3823
bus.send(umsg, new CloudBusCallBack(msg) {
@Override
public void run(MessageReply r) {
UpdateVmInstanceMetadataOnPrimaryStorageReply reply = new UpdateVmInstanceMetadataOnPrimaryStorageReply();
if (!r.isSuccess()) {
reply.setError(Platform.operr("failed to update vm[uuid=%s] on hypervisor.", self.getUuid())
.withCause(r.getError()));
}
bus.reply(msg, reply);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

请通过 completion 返回结果,不要在这里直接 bus.reply()

这个 override 已经拿到了 ReturnValueCompletion,但当前分支完全绕过了它;这和本类其他 handle(..., completion) 的契约不一致,外层调用方很容易卡住,或者出现重复 reply。顺手也把失败文案里的资源标识修掉:这里现在把 self.getUuid() 当成了 VM UUID,实际它是主存储 UUID。

🛠️ 建议修改
         bus.send(umsg, new CloudBusCallBack(msg) {
             `@Override`
             public void run(MessageReply r) {
                 UpdateVmInstanceMetadataOnPrimaryStorageReply reply = new UpdateVmInstanceMetadataOnPrimaryStorageReply();
                 if (!r.isSuccess()) {
-                    reply.setError(Platform.operr("failed to update vm[uuid=%s] on hypervisor.", self.getUuid())
-                            .withCause(r.getError()));
+                    completion.fail(Platform.operr(
+                            "failed to update vm metadata on hypervisor for root volume[uuid:%s]",
+                            msg.getRootVolumeUuid()
+                    ).withCause(r.getError()));
+                    return;
                 }
-                bus.reply(msg, reply);
+                completion.success(reply);
             }
         });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageKvmBackend.java`
around lines 3814 - 3823, 在 LocalStorageKvmBackend 中替换当前 bus.send 回调里的直接
bus.reply(msg, reply) 行为,改为使用外层传入的 ReturnValueCompletion(completion)来返回结果:在回调中根据
r.isSuccess() 调用 completion.success(reply) 或 completion.fail(err);不要直接对 msg 调用
bus.reply 以免重复回复或卡住调用者;同时修正文案里用到的资源标识,将错误信息中把 self.getUuid() 误当作 VM UUID
的描述改为主存储(primary storage)UUID,并把 r.getError() 包装成合适的 PlatformOpErr 传入
completion.fail。

Comment on lines +1907 to +2062
List<VolumeVO> volumes = new ArrayList<>();
volumesString.forEach(v -> volumes.add(JSONObjectUtil.toObject(v, VolumeVO.class)));

List<VolumeVO> newVolumes = new ArrayList<>();
volumes.forEach(v -> {
VolumeVO vo = new VolumeVO();
// vo.setRootImageUuid(vo.getRootImageUuid());
vo.setAccountUuid(msg.getSession().getAccountUuid());
vo.setPrimaryStorageUuid(msg.getPrimaryStorageUuid());
vo.setInstallPath(v.getInstallPath());

vo.setCreateDate(v.getCreateDate());
vo.setDescription(v.getDescription());
vo.setName(v.getName());
vo.setSize(v.getSize());
vo.setActualSize(v.getActualSize());
vo.setState(v.getState());
vo.setUuid(v.getUuid());
vo.setVmInstanceUuid(v.getVmInstanceUuid());
vo.setType(v.getType());
vo.setCreateDate(v.getCreateDate());
vo.setLastOpDate(v.getLastOpDate());
vo.setDeviceId(v.getDeviceId());
vo.setStatus(v.getStatus());
vo.setFormat(v.getFormat());
vo.setShareable(v.isShareable());
vo.setVolumeQos(v.getVolumeQos());
vo.setLastDetachDate(v.getLastDetachDate());
vo.setLastVmInstanceUuid(v.getLastVmInstanceUuid());
vo.setLastAttachDate(v.getLastAttachDate());
vo.setProtocol(v.getProtocol());
newVolumes.add(vo);
});
dbf.persistCollection(newVolumes);
trigger.next();
}
});

flow(new NoRollbackFlow() {
String __name__ = "register-snapshot";

@Override
public void run(FlowTrigger trigger, Map data) {
// 快照
vmMetadata.volumeSnapshots.forEach((volumeUuid, snapshotList) -> {
// 一个 volume 有多个快照树
// key = treeuuid
// value = snapshosts
Map<String, List<VolumeSnapshotInventory>> snapshotsByTreeUuid = new HashMap<>();
snapshotList.forEach(snapshot -> {
VolumeSnapshotInventory inv = JSONObjectUtil.toObject(snapshot, VolumeSnapshotInventory.class);
if (snapshotsByTreeUuid.containsKey(inv.getTreeUuid())) {
snapshotsByTreeUuid.get(inv.getTreeUuid()).add(inv);
} else {
snapshotsByTreeUuid.put(inv.getTreeUuid(), new ArrayList<>());
snapshotsByTreeUuid.get(inv.getTreeUuid()).add(inv);
}
});

// 遍历每一颗树
snapshotsByTreeUuid.forEach((treeUuid, snapshots) -> {
//构建快照树
VolumeSnapshotTree tree = VolumeSnapshotTree.fromInventories(snapshots);
// 层级遍历 快照
List<VolumeSnapshotInventory> levelOrderTraversals = tree.levelOrderTraversal();
// 判断当前树有没有 latest 节点
boolean treeIsCurrent = levelOrderTraversals.stream().anyMatch(VolumeSnapshotInventory::isLatest);

// 先创建快照树,VolumeSnapshotVO 外键依赖 VolumeSnapshotTreeVO
VolumeSnapshotTreeVO newTree = new VolumeSnapshotTreeVO();
newTree.setCurrent(treeIsCurrent);
newTree.setVolumeUuid(volumeUuid);
newTree.setUuid(treeUuid);
newTree.setStatus(VolumeSnapshotTreeStatus.Completed);
dbf.persist(newTree);

// 按照层级遍历的快照构建VolumeSnapshotTreeVO
levelOrderTraversals.forEach(snapshot -> {
VolumeSnapshotVO vo = new VolumeSnapshotVO();
vo.setPrimaryStorageUuid(msg.getPrimaryStorageUuid());
vo.setPrimaryStorageInstallPath(snapshot.getPrimaryStorageInstallPath());

vo.setName(snapshot.getName());
vo.setCreateDate(snapshot.getCreateDate());
vo.setDescription(snapshot.getDescription());
vo.setLastOpDate(snapshot.getLastOpDate());
vo.setParentUuid(snapshot.getParentUuid());
vo.setState(VolumeSnapshotState.valueOf(snapshot.getState()));
vo.setType(snapshot.getType());
vo.setVolumeUuid(snapshot.getVolumeUuid());
vo.setFormat(snapshot.getFormat());
vo.setUuid(snapshot.getUuid());
vo.setStatus(VolumeSnapshotStatus.valueOf(snapshot.getStatus()));
vo.setLatest(snapshot.isLatest());
vo.setSize(snapshot.getSize());
vo.setVolumeType(snapshot.getVolumeType());
vo.setTreeUuid(snapshot.getTreeUuid());
vo.setDistance(snapshot.getDistance());
dbf.persist(vo);
});
});
});

// 快照组
List<VolumeSnapshotGroupVO> newGroups = new ArrayList<>();
vmMetadata.volumeSnapshotGroupVO.forEach(group -> {
VolumeSnapshotGroupVO vo = JSONObjectUtil.toObject(group, VolumeSnapshotGroupVO.class);
vo.setAccountUuid(msg.getSession().getAccountUuid());
newGroups.add(vo);
});
dbf.persistCollection(newGroups);

// 快照组ref
List<VolumeSnapshotGroupRefVO> newGroupRefs = new ArrayList<>();
vmMetadata.volumeSnapshotGroupRefVO.forEach(group -> {
VolumeSnapshotGroupRefVO vo = JSONObjectUtil.toObject(group, VolumeSnapshotGroupRefVO.class);
newGroupRefs.add(vo);
});
dbf.persistCollection(newGroupRefs);

trigger.next();
}
});

flow(new NoRollbackFlow() {
String __name__ = "register-vmInstance";

@Override
public void run(FlowTrigger trigger, Map data) {
VmInstanceVO metaVm = JSONObjectUtil.toObject(vmMetadata.vmInstanceVO, VmInstanceVO.class);
VmInstanceVO newVm = new VmInstanceVO();

newVm.setClusterUuid(msg.getClusterUuid());
newVm.setHostUuid(msg.getHostUuid());
// 寻找有没有cache的tag lv 构建imageCache
// newVm.setImageUuid();

newVm.setUuid(metaVm.getUuid());
newVm.setName(metaVm.getName());
newVm.setDescription(metaVm.getDescription());
newVm.setType(metaVm.getType());
newVm.setHypervisorType(metaVm.getHypervisorType());
newVm.setCreateDate(metaVm.getCreateDate());
newVm.setLastOpDate(metaVm.getLastOpDate());
newVm.setState(metaVm.getState());
newVm.setRootVolumeUuid(metaVm.getRootVolumeUuid());
newVm.setInternalId(metaVm.getInternalId());
newVm.setCpuNum(metaVm.getCpuNum());
newVm.setCpuSpeed(metaVm.getCpuSpeed());
newVm.setMemorySize(metaVm.getMemorySize());
newVm.setReservedMemorySize(metaVm.getReservedMemorySize());
newVm.setAllocatorStrategy(metaVm.getAllocatorStrategy());
newVm.setPlatform(metaVm.getPlatform());
newVm.setArchitecture(metaVm.getArchitecture());
newVm.setGuestOsType(metaVm.getGuestOsType());
dbf.persist(newVm);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

这段注册流程没有事务/回滚保护。

这里连续 persist/persistCollection 多张表,但所有 step 都是 NoRollbackFlowerror() 里也只做 completion.fail()。只要 snapshot/group/vm 任一步失败,前面已经写入的 volume/tree/snapshot 就会残留在库里,下一次重试很容易直接撞上重复主键或留下孤儿记录。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java`
around lines 1907 - 2062, The flows "register-snapshot" and
"register-vmInstance" (and the earlier volume persist block) perform multiple
dbf.persist / dbf.persistCollection calls inside NoRollbackFlow, so partial
writes will remain on error; convert these NoRollbackFlow blocks into
transactional flows (or wrap the body in a single DB transaction) so that on
failure the transaction is rolled back: replace NoRollbackFlow with a
Flow/SyncFlow implementation that starts a DB transaction (via
dbf.beginTransaction()/commit()/rollback() or the project's transactional
helper), perform all dbf.persist/dbf.persistCollection inside that transaction,
and on exception call rollback and trigger.fail()/completion.fail(); ensure the
flow names ("register-snapshot", "register-vmInstance") and the volume persist
section use the same transaction so any error undoes prior writes to avoid
duplicate PKs or orphaned records.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
header/src/main/java/org/zstack/header/network/l3/APIGetL3NetworkRouterInterfaceIpMsg.java (1)

28-33: ⚠️ Potential issue | 🟡 Minor

__example__ 方法返回类型错误(预存问题)

虽然此代码段不是本次 PR 的变更,但发现了一个预存的复制粘贴错误:__example__ 方法声明返回 APIGetL3NetworkMtuMsg,但当前类是 APIGetL3NetworkRouterInterfaceIpMsg。这会导致 API 文档生成不正确。

🐛 建议修复
-    public static APIGetL3NetworkMtuMsg __example__() {
-        APIGetL3NetworkMtuMsg msg = new APIGetL3NetworkMtuMsg();
+    public static APIGetL3NetworkRouterInterfaceIpMsg __example__() {
+        APIGetL3NetworkRouterInterfaceIpMsg msg = new APIGetL3NetworkRouterInterfaceIpMsg();
         msg.setL3NetworkUuid(uuid());
 
         return msg;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/network/l3/APIGetL3NetworkRouterInterfaceIpMsg.java`
around lines 28 - 33, The __example__ method in class
APIGetL3NetworkRouterInterfaceIpMsg incorrectly declares and constructs
APIGetL3NetworkMtuMsg (copy/paste error); update the method signature return
type and the instantiated object to APIGetL3NetworkRouterInterfaceIpMsg and keep
setting the L3 network UUID via setL3NetworkUuid(uuid()) so the example returns
the correct message type.
🧹 Nitpick comments (6)
core/src/main/java/org/zstack/core/debug/APICleanQueueMsg.java (1)

33-33: 预存在的命名问题:managementiUuid 存在拼写错误。

字段名 managementiUuid 中多了一个字母 i,应为 managementUuid。这是预存在的问题,不属于本次PR变更范围,可考虑在后续维护中修复。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/src/main/java/org/zstack/core/debug/APICleanQueueMsg.java` at line 33,
字段名 managementiUuid 拼写错误,应改为 managementUuid;请在 APICleanQueueMsg
中重命名该字段并同时更新所有相关访问器和引用(例如 getter/setter、构造器、序列化/反序列化、JSON 映射、数据库字段映射及其他类中对
managementiUuid 的引用)以避免编译或运行时错误;若需兼容外部序列化格式,请保留向后兼容策略(如 `@JsonProperty`
或临时桥接方法)直至外部使用方迁移完毕。
header/src/main/java/org/zstack/header/storage/backup/APIScanBackupStorageMsg.java (1)

21-24: __example__ 方法未设置示例值。

现有的 __example__ 方法返回的消息实例未设置 backupStorageUuid,可能导致 API 文档示例不够完整。建议补充示例 UUID 值。

♻️ 建议的改进
 public static APIScanBackupStorageMsg __example__() {
     APIScanBackupStorageMsg msg = new APIScanBackupStorageMsg();
+    msg.setBackupStorageUuid(uuid());
     return msg;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/storage/backup/APIScanBackupStorageMsg.java`
around lines 21 - 24, The __example__ method in APIScanBackupStorageMsg returns
an instance without setting example data; update
APIScanBackupStorageMsg.__example__ to populate the backupStorageUuid field with
a sample UUID (e.g., a valid UUID string) on the msg instance before returning
so the API example includes a concrete backupStorageUuid value.
header/src/main/java/org/zstack/header/storage/primary/APIGetPrimaryStorageUsageReportMsg.java (1)

8-9: Import 顺序可以优化

新增的 org.zstack.header.vm.MetadataImpact 导入放在了 java.util.List 之后,建议将其与其他 org.zstack 导入(第 4-6 行)放在一起,保持导入分组的一致性。

♻️ 建议的修改
 import org.zstack.header.message.APIParam;
 import org.zstack.header.message.APISyncCallMessage;
 import org.zstack.header.rest.RestRequest;
+import org.zstack.header.vm.MetadataImpact;

 import java.util.List;
-import org.zstack.header.vm.MetadataImpact;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/storage/primary/APIGetPrimaryStorageUsageReportMsg.java`
around lines 8 - 9, The import order is inconsistent: move the new
org.zstack.header.vm.MetadataImpact import to the group with the other
org.zstack imports in APIGetPrimaryStorageUsageReportMsg so all org.zstack.*
imports (e.g., existing imports on lines 4-6) are together and java.util.List
remains in the standard java.* group; update the import section to follow the
project's grouping (java.*, javax.*, third-party, org.zstack.*) and keep
MetadataImpact adjacent to the other org.zstack imports.
header/src/main/java/org/zstack/header/network/service/APIDetachNetworkServiceProviderFromL2NetworkMsg.java (1)

7-11: 格式不一致:注解周围存在多余空行。

第 8 行和第 10 行的空行导致格式与 PR 中其他文件不一致。建议移除多余空行,使注解紧邻导入语句和类声明。

♻️ 建议修复
 import org.zstack.header.network.l2.L2NetworkVO;
 import org.zstack.header.vm.MetadataImpact;
-
 `@MetadataImpact`(MetadataImpact.Impact.NONE)
-
 public class APIDetachNetworkServiceProviderFromL2NetworkMsg extends APIMessage {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/network/service/APIDetachNetworkServiceProviderFromL2NetworkMsg.java`
around lines 7 - 11, 移除 `@MetadataImpact` 注解周围的多余空行:在
APIDetachNetworkServiceProviderFromL2NetworkMsg 类中,将导入语句 import
org.zstack.header.vm.MetadataImpact;、注解
`@MetadataImpact`(MetadataImpact.Impact.NONE) 和类声明 public class
APIDetachNetworkServiceProviderFromL2NetworkMsg extends APIMessage
放在连续行上,确保注解紧邻导入和类声明以保持与仓库中其他文件的格式一致。
header/src/main/java/org/zstack/header/network/l2/APICreateL2NetworkMsg.java (1)

47-49: 注解后的空行格式不一致。

第 48 行的空行导致 @MetadataImpact 注解与类声明之间存在多余空行,与 PR 中其他文件的格式不一致(如 APIDeleteL2NetworkMsg.java)。建议删除该空行以保持一致性。

♻️ 建议修复
 `@MetadataImpact`(MetadataImpact.Impact.NONE)
-
 public abstract class APICreateL2NetworkMsg extends APICreateMessage implements APIAuditor {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/network/l2/APICreateL2NetworkMsg.java`
around lines 47 - 49, Remove the extra blank line between the `@MetadataImpact`
annotation and the class declaration in APICreateL2NetworkMsg so the annotation
sits immediately above the class definition (i.e., ensure `@MetadataImpact`(...)
is directly followed by "public abstract class APICreateL2NetworkMsg ..."),
matching the formatting used in files like APIDeleteL2NetworkMsg.java.
header/src/main/java/org/zstack/header/storage/primary/APIGetVmInstanceMetadataFromPrimaryStorageMsg.java (1)

10-18: 建议在 REST 路径中包含主存储 UUID

当前 API 路径 /primary-storage/vm-instances/metadata 没有包含 {uuid} 路径变量来标识具体的主存储。对比同模块中的 APIReadVmInstanceMetadataMsg,其路径使用了 /vm-instances/{vmUuid}/metadata 模式。

根据 REST 设计惯例和项目 coding guidelines,GET 请求的资源标识符通常应通过路径变量传递,而非请求体。建议调整为:

♻️ 建议的路径修改
 `@RestRequest`(
-        path = "/primary-storage/vm-instances/metadata",
+        path = "/primary-storage/{uuid}/vm-instances/metadata",
         method = HttpMethod.GET,
         responseClass = APIGetVmInstanceMetadataFromPrimaryStorageReply.class
 )

请确认当前路径设计是否是有意为之(例如为了支持批量查询多个主存储的元数据)。如果是单主存储查询,建议采用路径变量方式。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/storage/primary/APIGetVmInstanceMetadataFromPrimaryStorageMsg.java`
around lines 10 - 18, The REST path currently on class
APIGetVmInstanceMetadataFromPrimaryStorageMsg lacks a path variable for the
target primary storage UUID; update the `@RestRequest` path to include a path
variable (e.g. "/primary-storage/{uuid}/vm-instances/metadata") and ensure the
existing `@APIParam` field uuid is used as that path variable (adjust any
annotations or parameter binding if your framework requires explicit
path-to-field binding). If the API is intended for batch queries, leave the path
and document the behavior; otherwise switch to the single-resource path variable
approach and update related clients/tests that call this endpoint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@header/src/main/java/org/zstack/header/image/APICreateRootVolumeTemplateFromRootVolumeMsg.java`:
- Line 24: The MetadataImpact on APICreateRootVolumeTemplateFromRootVolumeMsg is
incorrectly set to NONE; update the annotation on class
APICreateRootVolumeTemplateFromRootVolumeMsg to the correct impact (e.g.,
MetadataImpact.Impact.WRITE or the enum constant used for
storage/template-affecting operations) so metadata dirty/refresh flows run;
mirror the value used by other template/storage-related messages to ensure
consistency.

In
`@header/src/main/java/org/zstack/header/storage/snapshot/APIBackupVolumeSnapshotMsg.java`:
- Line 36: 类注解 MetadataImpact 标记错误:在 APIBackupVolumeSnapshotMsg 上将
`@MetadataImpact`(MetadataImpact.Impact.NONE) 改为
`@MetadataImpact`(MetadataImpact.Impact.STORAGE);查找 APIBackupVolumeSnapshotMsg
类并替换注解值以与其他 snapshot 相关的 API 保持一致,确保元数据刷新/拦截逻辑能正确识别为 STORAGE 影响。

In
`@header/src/main/java/org/zstack/header/storage/snapshot/APIBatchDeleteVolumeSnapshotMsg.java`:
- Line 29: 当前在 APIBatchDeleteVolumeSnapshotMsg 上的注解
`@MetadataImpact`(MetadataImpact.Impact.NONE) 分类错误,应改为反映会触发存储元数据变更的 STORAGE;请修改
APIBatchDeleteVolumeSnapshotMsg 的 MetadataImpact 注解为
MetadataImpact.Impact.STORAGE(定位符:APIBatchDeleteVolumeSnapshotMsg、MetadataImpact
注解),保存并运行相关单元/集成测试以确保删除快照流程会触发存储元数据更新。

---

Outside diff comments:
In
`@header/src/main/java/org/zstack/header/network/l3/APIGetL3NetworkRouterInterfaceIpMsg.java`:
- Around line 28-33: The __example__ method in class
APIGetL3NetworkRouterInterfaceIpMsg incorrectly declares and constructs
APIGetL3NetworkMtuMsg (copy/paste error); update the method signature return
type and the instantiated object to APIGetL3NetworkRouterInterfaceIpMsg and keep
setting the L3 network UUID via setL3NetworkUuid(uuid()) so the example returns
the correct message type.

---

Nitpick comments:
In `@core/src/main/java/org/zstack/core/debug/APICleanQueueMsg.java`:
- Line 33: 字段名 managementiUuid 拼写错误,应改为 managementUuid;请在 APICleanQueueMsg
中重命名该字段并同时更新所有相关访问器和引用(例如 getter/setter、构造器、序列化/反序列化、JSON 映射、数据库字段映射及其他类中对
managementiUuid 的引用)以避免编译或运行时错误;若需兼容外部序列化格式,请保留向后兼容策略(如 `@JsonProperty`
或临时桥接方法)直至外部使用方迁移完毕。

In
`@header/src/main/java/org/zstack/header/network/l2/APICreateL2NetworkMsg.java`:
- Around line 47-49: Remove the extra blank line between the `@MetadataImpact`
annotation and the class declaration in APICreateL2NetworkMsg so the annotation
sits immediately above the class definition (i.e., ensure `@MetadataImpact`(...)
is directly followed by "public abstract class APICreateL2NetworkMsg ..."),
matching the formatting used in files like APIDeleteL2NetworkMsg.java.

In
`@header/src/main/java/org/zstack/header/network/service/APIDetachNetworkServiceProviderFromL2NetworkMsg.java`:
- Around line 7-11: 移除 `@MetadataImpact` 注解周围的多余空行:在
APIDetachNetworkServiceProviderFromL2NetworkMsg 类中,将导入语句 import
org.zstack.header.vm.MetadataImpact;、注解
`@MetadataImpact`(MetadataImpact.Impact.NONE) 和类声明 public class
APIDetachNetworkServiceProviderFromL2NetworkMsg extends APIMessage
放在连续行上,确保注解紧邻导入和类声明以保持与仓库中其他文件的格式一致。

In
`@header/src/main/java/org/zstack/header/storage/backup/APIScanBackupStorageMsg.java`:
- Around line 21-24: The __example__ method in APIScanBackupStorageMsg returns
an instance without setting example data; update
APIScanBackupStorageMsg.__example__ to populate the backupStorageUuid field with
a sample UUID (e.g., a valid UUID string) on the msg instance before returning
so the API example includes a concrete backupStorageUuid value.

In
`@header/src/main/java/org/zstack/header/storage/primary/APIGetPrimaryStorageUsageReportMsg.java`:
- Around line 8-9: The import order is inconsistent: move the new
org.zstack.header.vm.MetadataImpact import to the group with the other
org.zstack imports in APIGetPrimaryStorageUsageReportMsg so all org.zstack.*
imports (e.g., existing imports on lines 4-6) are together and java.util.List
remains in the standard java.* group; update the import section to follow the
project's grouping (java.*, javax.*, third-party, org.zstack.*) and keep
MetadataImpact adjacent to the other org.zstack imports.

In
`@header/src/main/java/org/zstack/header/storage/primary/APIGetVmInstanceMetadataFromPrimaryStorageMsg.java`:
- Around line 10-18: The REST path currently on class
APIGetVmInstanceMetadataFromPrimaryStorageMsg lacks a path variable for the
target primary storage UUID; update the `@RestRequest` path to include a path
variable (e.g. "/primary-storage/{uuid}/vm-instances/metadata") and ensure the
existing `@APIParam` field uuid is used as that path variable (adjust any
annotations or parameter binding if your framework requires explicit
path-to-field binding). If the API is intended for batch queries, leave the path
and document the behavior; otherwise switch to the single-resource path variable
approach and update related clients/tests that call this endpoint.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f0d2ca25-1916-4b1b-a8bd-b02fe287e308

📥 Commits

Reviewing files that changed from the base of the PR and between f12598a and 1cd28ae.

📒 Files selected for processing (242)
  • core/src/main/java/org/zstack/core/captcha/APIRefreshCaptchaMsg.java
  • core/src/main/java/org/zstack/core/config/APIGetGlobalConfigOptionsMsg.java
  • core/src/main/java/org/zstack/core/config/APIQueryGlobalConfigMsg.java
  • core/src/main/java/org/zstack/core/config/APIResetGlobalConfigMsg.java
  • core/src/main/java/org/zstack/core/config/APIUpdateGlobalConfigMsg.java
  • core/src/main/java/org/zstack/core/debug/APICleanQueueMsg.java
  • core/src/main/java/org/zstack/core/debug/APIDebugSignalMsg.java
  • core/src/main/java/org/zstack/core/debug/APIGetDebugSignalMsg.java
  • core/src/main/java/org/zstack/core/errorcode/APICheckElaborationContentMsg.java
  • core/src/main/java/org/zstack/core/errorcode/APIGetElaborationCategoriesMsg.java
  • core/src/main/java/org/zstack/core/errorcode/APIGetElaborationsMsg.java
  • core/src/main/java/org/zstack/core/errorcode/APIReloadElaborationMsg.java
  • core/src/main/java/org/zstack/core/eventlog/APIQueryEventLogMsg.java
  • core/src/main/java/org/zstack/core/gc/APIDeleteGCJobMsg.java
  • core/src/main/java/org/zstack/core/gc/APIQueryGCJobMsg.java
  • core/src/main/java/org/zstack/core/gc/APITriggerGCJobMsg.java
  • header/src/main/java/org/zstack/header/APIIsOpensourceVersionMsg.java
  • header/src/main/java/org/zstack/header/allocator/APIGetCpuMemoryCapacityMsg.java
  • header/src/main/java/org/zstack/header/allocator/APIGetHostAllocatorStrategiesMsg.java
  • header/src/main/java/org/zstack/header/apimediator/APIIsReadyToGoMsg.java
  • header/src/main/java/org/zstack/header/cluster/APIChangeClusterStateMsg.java
  • header/src/main/java/org/zstack/header/cluster/APICreateClusterMsg.java
  • header/src/main/java/org/zstack/header/cluster/APIDeleteClusterMsg.java
  • header/src/main/java/org/zstack/header/cluster/APIQueryClusterMsg.java
  • header/src/main/java/org/zstack/header/cluster/APIUpdateClusterMsg.java
  • header/src/main/java/org/zstack/header/cluster/APIUpdateClusterOSMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIChangeDiskOfferingStateMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIChangeInstanceOfferingStateMsg.java
  • header/src/main/java/org/zstack/header/configuration/APICreateDiskOfferingMsg.java
  • header/src/main/java/org/zstack/header/configuration/APICreateInstanceOfferingMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIDeleteDiskOfferingMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIDeleteInstanceOfferingMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIGenerateApiJsonTemplateMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIGenerateApiTypeScriptDefinitionMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIGenerateGroovyClassMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIGenerateSqlForeignKeyMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIGenerateSqlIndexMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIGenerateSqlVOViewMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIGenerateTestLinkDocumentMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIGetGlobalPropertyMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIQueryDiskOfferingMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIQueryInstanceOfferingMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIUpdateDiskOfferingMsg.java
  • header/src/main/java/org/zstack/header/configuration/APIUpdateInstanceOfferingMsg.java
  • header/src/main/java/org/zstack/header/console/APIQueryConsoleProxyAgentMsg.java
  • header/src/main/java/org/zstack/header/console/APIReconnectConsoleProxyAgentMsg.java
  • header/src/main/java/org/zstack/header/console/APIRequestConsoleAccessMsg.java
  • header/src/main/java/org/zstack/header/console/APIUpdateConsoleProxyAgentMsg.java
  • header/src/main/java/org/zstack/header/core/APIGetChainTaskMsg.java
  • header/src/main/java/org/zstack/header/core/encrypt/APIGetEncryptedFieldMsg.java
  • header/src/main/java/org/zstack/header/core/encrypt/APIUpdateEncryptKeyMsg.java
  • header/src/main/java/org/zstack/header/core/external/service/APIGetExternalServicesMsg.java
  • header/src/main/java/org/zstack/header/core/external/service/APIReloadExternalServiceMsg.java
  • header/src/main/java/org/zstack/header/core/progress/APIGetTaskProgressMsg.java
  • header/src/main/java/org/zstack/header/core/webhooks/APICreateWebhookMsg.java
  • header/src/main/java/org/zstack/header/core/webhooks/APIDeleteWebhookMsg.java
  • header/src/main/java/org/zstack/header/core/webhooks/APIQueryWebhookMsg.java
  • header/src/main/java/org/zstack/header/core/webhooks/APIUpdateWebhookMsg.java
  • header/src/main/java/org/zstack/header/host/APIAddHostMsg.java
  • header/src/main/java/org/zstack/header/host/APIChangeHostStateMsg.java
  • header/src/main/java/org/zstack/header/host/APIDeleteHostMsg.java
  • header/src/main/java/org/zstack/header/host/APIGetHostBlockDevicesMsg.java
  • header/src/main/java/org/zstack/header/host/APIGetHostPowerStatusMsg.java
  • header/src/main/java/org/zstack/header/host/APIGetHostSensorsMsg.java
  • header/src/main/java/org/zstack/header/host/APIGetHostTaskMsg.java
  • header/src/main/java/org/zstack/header/host/APIGetHostWebSshUrlMsg.java
  • header/src/main/java/org/zstack/header/host/APIGetHypervisorTypesMsg.java
  • header/src/main/java/org/zstack/header/host/APIGetPhysicalMachineBlockDevicesMsg.java
  • header/src/main/java/org/zstack/header/host/APIMountBlockDeviceMsg.java
  • header/src/main/java/org/zstack/header/host/APIPowerOnHostMsg.java
  • header/src/main/java/org/zstack/header/host/APIPowerResetHostMsg.java
  • header/src/main/java/org/zstack/header/host/APIQueryHostMsg.java
  • header/src/main/java/org/zstack/header/host/APIReconnectHostMsg.java
  • header/src/main/java/org/zstack/header/host/APIShutdownHostMsg.java
  • header/src/main/java/org/zstack/header/host/APIUpdateHostIpmiMsg.java
  • header/src/main/java/org/zstack/header/host/APIUpdateHostMsg.java
  • header/src/main/java/org/zstack/header/host/APIUpdateHostNqnMsg.java
  • header/src/main/java/org/zstack/header/host/APIUpdateHostnameMsg.java
  • header/src/main/java/org/zstack/header/identity/APIChangeResourceOwnerMsg.java
  • header/src/main/java/org/zstack/header/identity/APICreateAccountMsg.java
  • header/src/main/java/org/zstack/header/identity/APIDeleteAccountMsg.java
  • header/src/main/java/org/zstack/header/identity/APIGetAccountQuotaUsageMsg.java
  • header/src/main/java/org/zstack/header/identity/APIGetResourceAccountMsg.java
  • header/src/main/java/org/zstack/header/identity/APILogInByAccountMsg.java
  • header/src/main/java/org/zstack/header/identity/APILogOutMsg.java
  • header/src/main/java/org/zstack/header/identity/APIQueryAccountMsg.java
  • header/src/main/java/org/zstack/header/identity/APIQueryAccountResourceRefMsg.java
  • header/src/main/java/org/zstack/header/identity/APIQueryQuotaMsg.java
  • header/src/main/java/org/zstack/header/identity/APIRenewSessionMsg.java
  • header/src/main/java/org/zstack/header/identity/APIRevokeResourceSharingMsg.java
  • header/src/main/java/org/zstack/header/identity/APIShareResourceMsg.java
  • header/src/main/java/org/zstack/header/identity/APIUpdateAccountMsg.java
  • header/src/main/java/org/zstack/header/identity/APIUpdateQuotaMsg.java
  • header/src/main/java/org/zstack/header/identity/APIValidateSessionMsg.java
  • header/src/main/java/org/zstack/header/identity/login/APIGetLoginProceduresMsg.java
  • header/src/main/java/org/zstack/header/identity/login/APILogInMsg.java
  • header/src/main/java/org/zstack/header/identity/role/api/APIAttachRoleToAccountMsg.java
  • header/src/main/java/org/zstack/header/identity/role/api/APICreateRoleMsg.java
  • header/src/main/java/org/zstack/header/identity/role/api/APIDeleteRoleMsg.java
  • header/src/main/java/org/zstack/header/identity/role/api/APIDetachRoleFromAccountMsg.java
  • header/src/main/java/org/zstack/header/identity/role/api/APIGetRolePolicyActionsMsg.java
  • header/src/main/java/org/zstack/header/identity/role/api/APIQueryRoleAccountRefMsg.java
  • header/src/main/java/org/zstack/header/identity/role/api/APIQueryRoleMsg.java
  • header/src/main/java/org/zstack/header/identity/role/api/APIUpdateRoleMsg.java
  • header/src/main/java/org/zstack/header/image/APIAddImageMsg.java
  • header/src/main/java/org/zstack/header/image/APICalculateImageHashMsg.java
  • header/src/main/java/org/zstack/header/image/APIChangeImageStateMsg.java
  • header/src/main/java/org/zstack/header/image/APICreateDataVolumeTemplateFromVolumeMsg.java
  • header/src/main/java/org/zstack/header/image/APICreateDataVolumeTemplateFromVolumeSnapshotMsg.java
  • header/src/main/java/org/zstack/header/image/APICreateRootVolumeTemplateFromRootVolumeMsg.java
  • header/src/main/java/org/zstack/header/image/APICreateRootVolumeTemplateFromVolumeSnapshotMsg.java
  • header/src/main/java/org/zstack/header/image/APIDeleteImageMsg.java
  • header/src/main/java/org/zstack/header/image/APIExpungeImageMsg.java
  • header/src/main/java/org/zstack/header/image/APIGetCandidateBackupStorageForCreatingImageMsg.java
  • header/src/main/java/org/zstack/header/image/APIGetCandidateImagesForCreatingVmMsg.java
  • header/src/main/java/org/zstack/header/image/APIGetUploadImageJobDetailsMsg.java
  • header/src/main/java/org/zstack/header/image/APIQueryImageMsg.java
  • header/src/main/java/org/zstack/header/image/APIRecoverImageMsg.java
  • header/src/main/java/org/zstack/header/image/APISetImageBootModeMsg.java
  • header/src/main/java/org/zstack/header/image/APISyncImageSizeMsg.java
  • header/src/main/java/org/zstack/header/image/APIUpdateImageMsg.java
  • header/src/main/java/org/zstack/header/longjob/APICancelLongJobMsg.java
  • header/src/main/java/org/zstack/header/longjob/APICleanLongJobMsg.java
  • header/src/main/java/org/zstack/header/longjob/APIDeleteLongJobMsg.java
  • header/src/main/java/org/zstack/header/longjob/APIQueryLongJobMsg.java
  • header/src/main/java/org/zstack/header/longjob/APIRerunLongJobMsg.java
  • header/src/main/java/org/zstack/header/longjob/APIResumeLongJobMsg.java
  • header/src/main/java/org/zstack/header/longjob/APISubmitLongJobMsg.java
  • header/src/main/java/org/zstack/header/longjob/APIUpdateLongJobMsg.java
  • header/src/main/java/org/zstack/header/managementnode/APIGetCurrentTimeMsg.java
  • header/src/main/java/org/zstack/header/managementnode/APIGetManagementNodeArchMsg.java
  • header/src/main/java/org/zstack/header/managementnode/APIGetManagementNodeOSMsg.java
  • header/src/main/java/org/zstack/header/managementnode/APIGetPlatformTimeZoneMsg.java
  • header/src/main/java/org/zstack/header/managementnode/APIGetSupportAPIsMsg.java
  • header/src/main/java/org/zstack/header/managementnode/APIGetVersionMsg.java
  • header/src/main/java/org/zstack/header/managementnode/APIQueryManagementNodeMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIAttachL2NetworkToClusterMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIAttachL2NetworkToHostMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APICreateL2NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APICreateL2NoVlanNetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APICreateL2VlanNetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIDeleteL2NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIDetachL2NetworkFromClusterMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIDetachL2NetworkFromHostMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIGetCandidateClustersForAttachingL2NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIGetCandidateL2NetworksForAttachingClusterMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIGetL2NetworkTypesMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIGetVSwitchTypesMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIQueryL2NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIQueryL2VlanNetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIUpdateL2NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l2/APIUpdateL2NetworkVirtualNetworkIdMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIAddDnsToL3NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIAddHostRouteToL3NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIAddIpRangeByNetworkCidrMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIAddIpRangeMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIAddIpv6RangeByNetworkCidrMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIAddIpv6RangeMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIAddReservedIpRangeMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIChangeL3NetworkStateMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APICheckIpAvailabilityMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APICreateL3NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIDeleteIpAddressMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIDeleteIpRangeMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIDeleteL3NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIDeleteReservedIpRangeMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIGetFreeIpMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIGetIpAddressCapacityMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIGetL3NetworkMtuMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIGetL3NetworkRouterInterfaceIpMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIGetL3NetworkTypesMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIQueryAddressPoolMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIQueryIpAddressMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIQueryIpRangeMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIQueryL3NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIRemoveDnsFromL3NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIRemoveHostRouteFromL3NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APISetL3NetworkMtuMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APISetL3NetworkRouterInterfaceIpMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIUpdateIpRangeMsg.java
  • header/src/main/java/org/zstack/header/network/l3/APIUpdateL3NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/service/APIAddNetworkServiceProviderMsg.java
  • header/src/main/java/org/zstack/header/network/service/APIAttachNetworkServiceProviderToL2NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/service/APIAttachNetworkServiceToL3NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/service/APIDetachNetworkServiceFromL3NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/service/APIDetachNetworkServiceProviderFromL2NetworkMsg.java
  • header/src/main/java/org/zstack/header/network/service/APIGetNetworkServiceTypesMsg.java
  • header/src/main/java/org/zstack/header/network/service/APIQueryNetworkServiceL3NetworkRefMsg.java
  • header/src/main/java/org/zstack/header/network/service/APIQueryNetworkServiceProviderMsg.java
  • header/src/main/java/org/zstack/header/query/APIGenerateInventoryQueryDetailsMsg.java
  • header/src/main/java/org/zstack/header/query/APIGenerateQueryableFieldsMsg.java
  • header/src/main/java/org/zstack/header/resourceattribute/api/APICreateResourceAttributeKeyMsg.java
  • header/src/main/java/org/zstack/header/resourceattribute/api/APICreateResourceAttributeValueMsg.java
  • header/src/main/java/org/zstack/header/resourceattribute/api/APIDeleteResourceAttributeKeyMsg.java
  • header/src/main/java/org/zstack/header/resourceattribute/api/APIDeleteResourceAttributeValueMsg.java
  • header/src/main/java/org/zstack/header/resourceattribute/api/APIQueryResourceAttributeKeyMsg.java
  • header/src/main/java/org/zstack/header/resourceattribute/api/APIQueryResourceAttributeValueMsg.java
  • header/src/main/java/org/zstack/header/resourceattribute/api/APIUpdateResourceAttributeKeyMsg.java
  • header/src/main/java/org/zstack/header/search/APICreateSearchIndexMsg.java
  • header/src/main/java/org/zstack/header/search/APIDeleteSearchIndexMsg.java
  • header/src/main/java/org/zstack/header/storage/addon/backup/APIAddExternalBackupStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/addon/primary/APIAddExternalPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/addon/primary/APIDiscoverExternalPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/addon/primary/APIUpdateExternalPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIAddBackupStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIAttachBackupStorageToZoneMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIChangeBackupStorageStateMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APICleanUpTrashOnBackupStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIDeleteBackupStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIDeleteExportedImageFromBackupStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIDetachBackupStorageFromZoneMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIGetBackupStorageCapacityMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIGetBackupStorageTypesMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIGetTrashOnBackupStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIQueryBackupStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIReconnectBackupStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIScanBackupStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/backup/APIUpdateBackupStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIAddPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIAddStorageProtocolMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIAttachPrimaryStorageToClusterMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIChangePrimaryStorageStateMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APICleanUpImageCacheOnPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APICleanUpStorageTrashOnPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APICleanUpTrashOnPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIDeletePrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIDetachPrimaryStorageFromClusterMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIGetPrimaryStorageAllocatorStrategiesMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIGetPrimaryStorageCapacityMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIGetPrimaryStorageLicenseInfoMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIGetPrimaryStorageTypesMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIGetPrimaryStorageUsageReportMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIGetTrashOnPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIGetVmInstanceMetadataFromPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIQueryImageCacheMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIQueryPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIReconnectPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIRegisterVmInstanceMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APISyncPrimaryStorageCapacityMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIUpdatePrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/snapshot/APIBackupVolumeSnapshotMsg.java
  • header/src/main/java/org/zstack/header/storage/snapshot/APIBatchDeleteVolumeSnapshotMsg.java
🚧 Files skipped from review as they are similar to previous changes (1)
  • header/src/main/java/org/zstack/header/storage/primary/APIRegisterVmInstanceMsg.java

)
@TagResourceType(ImageVO.class)
@DefaultTimeout(timeunit = TimeUnit.HOURS, value = 72)
@MetadataImpact(MetadataImpact.Impact.NONE)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

MetadataImpact 的分类看起来设错了。

APICreateRootVolumeTemplateFromRootVolumeMsg 会基于根云盘创建模板,并且涉及备份存储写入;按这次 PR 的分类口径,这类 template/storage 相关 API 不应标成 NONE。这里如果继续使用 NONE,后续依赖该注解触发的元数据脏标记或刷新链路可能会被跳过。

建议修改
-@MetadataImpact(MetadataImpact.Impact.NONE)
+@MetadataImpact(MetadataImpact.Impact.STORAGE)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@MetadataImpact(MetadataImpact.Impact.NONE)
`@MetadataImpact`(MetadataImpact.Impact.STORAGE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/image/APICreateRootVolumeTemplateFromRootVolumeMsg.java`
at line 24, The MetadataImpact on APICreateRootVolumeTemplateFromRootVolumeMsg
is incorrectly set to NONE; update the annotation on class
APICreateRootVolumeTemplateFromRootVolumeMsg to the correct impact (e.g.,
MetadataImpact.Impact.WRITE or the enum constant used for
storage/template-affecting operations) so metadata dirty/refresh flows run;
mirror the value used by other template/storage-related messages to ensure
consistency.

* @since 0.1.0
*/

@MetadataImpact(MetadataImpact.Impact.NONE)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

MetadataImpact 分类看起来标错了。

这个 API 是卷快照相关的存储操作,而且 PR 目标里已经把 snapshot CRUD 归到了 STORAGE。这里标成 Impact.NONE 会让后续基于该注解的元数据刷新/拦截逻辑误判为“无影响”。建议改成 @MetadataImpact(MetadataImpact.Impact.STORAGE),并与其他 snapshot 类 API 保持一致。

建议修改
-@MetadataImpact(MetadataImpact.Impact.NONE)
+@MetadataImpact(MetadataImpact.Impact.STORAGE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/storage/snapshot/APIBackupVolumeSnapshotMsg.java`
at line 36, 类注解 MetadataImpact 标记错误:在 APIBackupVolumeSnapshotMsg 上将
`@MetadataImpact`(MetadataImpact.Impact.NONE) 改为
`@MetadataImpact`(MetadataImpact.Impact.STORAGE);查找 APIBackupVolumeSnapshotMsg
类并替换注解值以与其他 snapshot 相关的 API 保持一致,确保元数据刷新/拦截逻辑能正确识别为 STORAGE 影响。

responseClass = APIBatchDeleteVolumeSnapshotEvent.class
)
@DefaultTimeout(timeunit = TimeUnit.HOURS, value = 6)
@MetadataImpact(MetadataImpact.Impact.NONE)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

MetadataImpact 分类疑似标错了。

根据本 PR 的提交说明,snapshot CRUD 被归类为 STORAGE。这里把批量删除卷快照标成 Impact.NONE,会让这条删除路径跳过存储元数据变更标记,导致快照删除后 VM 元数据可能不刷新。

建议修改
-@MetadataImpact(MetadataImpact.Impact.NONE)
+@MetadataImpact(MetadataImpact.Impact.STORAGE)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@MetadataImpact(MetadataImpact.Impact.NONE)
`@MetadataImpact`(MetadataImpact.Impact.STORAGE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/storage/snapshot/APIBatchDeleteVolumeSnapshotMsg.java`
at line 29, 当前在 APIBatchDeleteVolumeSnapshotMsg 上的注解
`@MetadataImpact`(MetadataImpact.Impact.NONE) 分类错误,应改为反映会触发存储元数据变更的 STORAGE;请修改
APIBatchDeleteVolumeSnapshotMsg 的 MetadataImpact 注解为
MetadataImpact.Impact.STORAGE(定位符:APIBatchDeleteVolumeSnapshotMsg、MetadataImpact
注解),保存并运行相关单元/集成测试以确保删除快照流程会触发存储元数据更新。

Resolves: ZSV-10000

Change-Id: I746a6d7266686d637a6d76756861636a74766b74
@MatheMatrix MatheMatrix force-pushed the sync/tao.gan/res-vm-ZSV-10000@@4 branch from 1cd28ae to 8345c56 Compare March 12, 2026 07:52
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
header/src/main/java/org/zstack/header/vm/APIAttachVmNicToVmMsg.java (1)

8-13: ⚠️ Potential issue | 🟡 Minor

REST 路径存在拼写错误:"nices" 应为 "nics"

该 API 定义中的路径使用了 "nices",但项目中所有其他 VM NIC 相关 API 都一致使用 "nics" 格式(如 /vm-instances/nics/{vmNicUuid}/security-groups/{securityGroupUuid}/vm-instances/nics 等)。此拼写错误已传播至 SDK 自动生成代码和 Groovy 文档。应统一使用 "nics" 以符合项目 API 命名规范。

建议修复
 `@RestRequest`(
-        path = "/vm-instances/{vmInstanceUuid}/nices/{vmNicUuid}",
+        path = "/vm-instances/{vmInstanceUuid}/nics/{vmNicUuid}",
         method = HttpMethod.POST,
         responseClass = APIAttachVmNicToVmEvent.class,
         parameterName = "params"
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/APIAttachVmNicToVmMsg.java` around
lines 8 - 13, The REST path in the RestRequest annotation on
APIAttachVmNicToVmMsg uses the misspelled segment "nices" instead of "nics";
update the path value in the `@RestRequest` on class APIAttachVmNicToVmMsg to use
"/vm-instances/{vmInstanceUuid}/nics/{vmNicUuid}" so it matches other VM NIC
APIs and regenerates SDK/docs consistently, then run any codegen/tests that
validate REST paths.
♻️ Duplicate comments (10)
header/src/main/java/org/zstack/header/vm/VmInstanceState.java (1)

33-34: ⚠️ Potential issue | 🔴 Critical

Registering 补齐状态转换规则。

现在只把 Registering 加进了枚举和 offlineStates,但没有对应的 .transactions(...)。一旦有流程在该状态下调用 nextState(),这里会直接抛 CloudRuntimeException

🔧 建议修复
         Crashed.transactions(
                 new Transaction(VmInstanceStateEvent.running, VmInstanceState.Running),
                 new Transaction(VmInstanceStateEvent.unknown, VmInstanceState.Unknown),
                 new Transaction(VmInstanceStateEvent.stopping, VmInstanceState.Stopping),
                 new Transaction(VmInstanceStateEvent.stopped, VmInstanceState.Stopped),
                 new Transaction(VmInstanceStateEvent.rebooting, VmInstanceState.Rebooting),
                 new Transaction(VmInstanceStateEvent.crashed, VmInstanceState.Crashed),
                 new Transaction(VmInstanceStateEvent.destroyed, VmInstanceState.Destroyed),
                 new Transaction(VmInstanceStateEvent.noState, VmInstanceState.NoState),
                 new Transaction(VmInstanceStateEvent.destroying, VmInstanceState.Destroying)
         );
+        Registering.transactions(
+                new Transaction(VmInstanceStateEvent.stopped, VmInstanceState.Stopped),
+                new Transaction(VmInstanceStateEvent.running, VmInstanceState.Running),
+                new Transaction(VmInstanceStateEvent.destroying, VmInstanceState.Destroying),
+                new Transaction(VmInstanceStateEvent.unknown, VmInstanceState.Unknown)
+        );
         NoState.transactions(
                 new Transaction(VmInstanceStateEvent.running, VmInstanceState.Running),

Also applies to: 56-56

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/VmInstanceState.java` around lines
33 - 34, The enum VmInstanceState defines a Registering value but lacks a
.transactions(...) entry, so calls to nextState() from Registering throw
CloudRuntimeException; add a transactions(...) block for Registering in
VmInstanceState (matching the pattern used for other states) that lists the
allowed VmInstanceStateEvent -> target state transitions (and mirror any
offlineStates inclusion) so nextState() can resolve valid transitions from
Registering; update any related offlineStates usage if necessary to keep
consistency with the new transactions entry.
storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java (1)

1866-2130: ⚠️ Potential issue | 🔴 Critical

VM 注册流程仍然缺少原子性。

register-volumeregister-snapshotregister-vmInstance 这几步在多个 NoRollbackFlow 里连续落多张表;任何一步失败,前面已写入的数据都不会回滚。重试时很容易留下孤儿 snapshot/tree/group/volume 记录,或者直接撞上重复主键。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java`
around lines 1866 - 2130, The registerVmInstance flow writes to multiple tables
across the NoRollbackFlow steps ("register-volume", "register-snapshot",
"register-vmInstance") so failures leave partial data and orphaned records; fix
by making the DB writes atomic: either wrap all persistence into a single
transactional/rollbackable flow or add compensating rollback logic for each
step. Concretely, modify registerVmInstance to (a) replace the NoRollbackFlow
steps that call dbf.persist/dbf.persistCollection with a single Flow (or
transactional block) that uses dbf.startTransaction/commit/rollback (or a
RollbackFlow implementation) to persist volumes, snapshot trees, snapshots,
groups, refs and the VM together, or (b) implement rollback handlers for each
NoRollbackFlow that delete the entities created via dbf on failure; ensure the
FlowChain error handler triggers the transaction rollback or executes the
compensating deletes so no orphaned
VolumeVO/VolumeSnapshotVO/VolumeSnapshotTreeVO/VolumeSnapshotGroupVO/VolumeSnapshotGroupRefVO
or VmInstanceVO remain.
compute/src/main/java/org/zstack/compute/vm/metadata/SnapshotGroupBasedVmUuidFromApiResolver.java (1)

11-29: ⚠️ Potential issue | 🟡 Minor

类注释与实际支持的消息类型不一致。

Javadoc 中列出的 APICreateVolumeSnapshotGroupMsg 实际上实现的是 CreateVolumeSnapshotGroupMessage 接口,而非 VolumeSnapshotGroupMessage。因此 supports() 方法不会匹配创建消息,而 Javadoc 描述具有误导性。

建议移除对 Create 场景的声明,仅保留实际支持的消息类型(Delete、Revert、Update、Ungroup 等操作已有快照组的场景)。

📝 建议修正文档
 /**
  * 快照组关联 VM UUID 解析器:从实现 {`@link` VolumeSnapshotGroupMessage} 接口的 API 消息中获取 groupUuid,
  * 通过 VolumeSnapshotGroupRefVO 查询关联的所有 volumeUuid,再查询 VolumeVO 得到 vmInstanceUuid。
  *
  * <p>覆盖的 API:</p>
  * <ul>
  *   <li>{`@code` APIDeleteVolumeSnapshotGroupMsg}(STORAGE, updateOnFailure=true)</li>
- *   <li>{`@code` APICreateVolumeSnapshotGroupMsg}(STORAGE, updateOnFailure=true)</li>
+ *   <li>{`@code` APIRevertVmFromSnapshotGroupMsg}(STORAGE)</li>
+ *   <li>{`@code` APIUpdateVolumeSnapshotGroupMsg}(CONFIG)</li>
+ *   <li>其他实现 VolumeSnapshotGroupMessage 的 API</li>
  * </ul>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/SnapshotGroupBasedVmUuidFromApiResolver.java`
around lines 11 - 29, Update the class Javadoc of
SnapshotGroupBasedVmUuidFromApiResolver to reflect the actual supported API
message types used by its supports() method: remove
APICreateVolumeSnapshotGroupMsg (which implements
CreateVolumeSnapshotGroupMessage) from the listed scenarios and explicitly list
only the messages that implement VolumeSnapshotGroupMessage (e.g.,
APIDeleteVolumeSnapshotGroupMsg and other delete/revert/update/ungroup messages
that the supports() method matches), so the documentation is consistent with the
supports() logic in the class.
plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java (1)

1943-1946: ⚠️ Potential issue | 🔴 Critical

不要在异步写入完成前推进 SyncTaskChain

Line 1945 这里只把消息派发纳入了 PS 级并发控制,真正的 backend.handle(...) 仍会在 chain.next() 之后并发执行;如果 doHandleUpdateMetadata() 同步抛异常,队列还可能被卡住。请把 chain.next() 挪到 success/fail/同步异常的统一收口里。

🛠️ 建议修改
`@Override`
public void run(SyncTaskChain chain) {
-    doHandleUpdateMetadata(msg);
-    chain.next();
+    try {
+        doHandleUpdateMetadata(msg, new NoErrorCompletion(msg) {
+            `@Override`
+            public void done() {
+                chain.next();
+            }
+        });
+    } catch (Exception e) {
+        chain.next();
+        throw e;
+    }
}
private void doHandleUpdateMetadata(UpdateVmInstanceMetadataOnPrimaryStorageMsg msg, NoErrorCompletion completion) {
    ...
    backend.handle(msg, hostUuid, new ReturnValueCompletion<UpdateVmInstanceMetadataOnPrimaryStorageReply>(msg) {
        `@Override`
        public void success(UpdateVmInstanceMetadataOnPrimaryStorageReply r) {
            bus.reply(msg, r);
            completion.done();
        }

        `@Override`
        public void fail(ErrorCode errorCode) {
            reply.setError(errorCode);
            bus.reply(msg, reply);
            completion.done();
        }
    });
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java`
around lines 1943 - 1946, The current SyncTaskChain task calls
doHandleUpdateMetadata(msg) and immediately calls chain.next(), which allows
backend.handle(...) to run concurrently and can leave the chain stuck if
doHandleUpdateMetadata throws synchronously; modify the task to not call
chain.next() immediately but instead wrap doHandleUpdateMetadata to accept a
NoErrorCompletion (or similar) and call chain.next() only from the unified
completion/failure path (i.e., after backend.handle(...) invokes its
ReturnValueCompletion.success or .fail, and also in the catch block for any
synchronous exception), so move the chain.next() into that single
success/fail/exception handler and wire doHandleUpdateMetadata to reply on
success/fail before calling completion.done()/chain.next().
plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageBase.java (1)

3349-3352: ⚠️ Potential issue | 🔴 Critical

不要在异步写入完成前释放并发槽位。

Line 3351 现在只限制了消息派发速度,bkd.handle(...) 的实际元数据写入会在 chain.next() 之后继续并发执行;如果 doHandleUpdateMetadata() 同步抛异常,这条链还会被卡住。请把 chain.next() 放到 success/fail/同步异常的统一收口里。

🛠️ 建议修改
`@Override`
public void run(SyncTaskChain chain) {
-    doHandleUpdateMetadata(msg);
-    chain.next();
+    try {
+        doHandleUpdateMetadata(msg, new NoErrorCompletion(msg) {
+            `@Override`
+            public void done() {
+                chain.next();
+            }
+        });
+    } catch (Exception e) {
+        chain.next();
+        throw e;
+    }
}
private void doHandleUpdateMetadata(UpdateVmInstanceMetadataOnPrimaryStorageMsg msg, NoErrorCompletion completion) {
    ...
    bkd.handle(msg, hostUuid, new ReturnValueCompletion<UpdateVmInstanceMetadataOnPrimaryStorageReply>(msg) {
        `@Override`
        public void success(UpdateVmInstanceMetadataOnPrimaryStorageReply returnValue) {
            bus.reply(msg, returnValue);
            completion.done();
        }

        `@Override`
        public void fail(ErrorCode errorCode) {
            UpdateVmInstanceMetadataOnPrimaryStorageReply reply = new UpdateVmInstanceMetadataOnPrimaryStorageReply();
            reply.setError(errorCode);
            bus.reply(msg, reply);
            completion.done();
        }
    });
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageBase.java`
around lines 3349 - 3352, 当前 run(SyncTaskChain) 中直接在 doHandleUpdateMetadata(...)
之后立即调用 chain.next() 导致并发槽位在异步写入未完成前被释放,并且若 doHandleUpdateMetadata 同步抛异常会卡住链条;请把
chain.next() 移到 doHandleUpdateMetadata 内部的异步回调(bkd.handle(...) 的 success 与
fail)以及同步异常的统一收口里(例如在 finally 或使用 NoErrorCompletion/ReturnValueCompletion 的
done/complete 路径中调用
chain.next()),确保无论成功、失败或抛异常都只在真正完成处理后再释放并发槽位;参考方法名:run(SyncTaskChain),
doHandleUpdateMetadata(...), bkd.handle(...).
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java (1)

9413-9416: ⚠️ Potential issue | 🔴 Critical

不要在主存储回调前提前 chain.next()

Line 9416 现在会在 bus.send() 尚未完成时释放串行链,同一 VM 的后续 flush 仍可并发进入,旧 payload 有机会晚于新 payload 落盘。

🐛 建议修改
             `@Override`
             public void run(SyncTaskChain chain) {
-                doHandleUpdateVmInstanceMetadata(msg);
-                chain.next();
+                doHandleUpdateVmInstanceMetadata(msg, chain);
             }
@@
-    private void doHandleUpdateVmInstanceMetadata(UpdateVmInstanceMetadataMsg msg) {
+    private void doHandleUpdateVmInstanceMetadata(UpdateVmInstanceMetadataMsg msg, SyncTaskChain chain) {
@@
         if (payloadSize > org.zstack.compute.vm.metadata.VmMetadataBuilder.REJECT_THRESHOLD) {
@@
             reply.setError(Platform.operr("metadata payload too large (%d bytes, limit %d) for vm[uuid=%s]",
                     payloadSize, org.zstack.compute.vm.metadata.VmMetadataBuilder.REJECT_THRESHOLD, msg.getUuid()));
             bus.reply(msg, reply);
+            chain.next();
             return;
         }
@@
         bus.send(umsg, new CloudBusCallBack(msg) {
             `@Override`
             public void run(MessageReply r) {
                 UpdateVmInstanceMetadataOnPrimaryStorageReply reply = new UpdateVmInstanceMetadataOnPrimaryStorageReply();
@@
                 }
                 bus.reply(msg, reply);
+                chain.next();
             }
         });
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
9413 - 9416, The run(SyncTaskChain) handler currently calls chain.next()
immediately after doHandleUpdateVmInstanceMetadata(msg), releasing the serial
chain before the async main-storage/bus.send() work completes; remove the
premature chain.next() and instead invoke chain.next() only after the
asynchronous operation completes (e.g. in the callback/response handler inside
doHandleUpdateVmInstanceMetadata or the bus.send completion/main-storage
callback) so subsequent flushes for the same VM cannot run concurrently and old
payloads cannot overtake newer ones.
header/src/main/java/org/zstack/header/vm/APIRegisterVmInstanceFromMetadataEvent.java (1)

8-11: ⚠️ Potential issue | 🟠 Major

warnings 字段现在不会出现在 REST 返回里。

Line 8 把响应限定到了 inventory,所以 Line 11 的 warnings 会被 REST 序列化和示例文档一起漏掉;当前 __example__() 也没有把它填出来。

可直接套用的修正
-@RestResponse(allTo = "inventory")
+@RestResponse(fieldsTo = {"all"})
 public class APIRegisterVmInstanceFromMetadataEvent extends APIEvent {
@@
     public static APIRegisterVmInstanceFromMetadataEvent __example__() {
         APIRegisterVmInstanceFromMetadataEvent evt = new APIRegisterVmInstanceFromMetadataEvent();
         VmInstanceInventory vm = new VmInstanceInventory();
         vm.setUuid(uuid());
         vm.setName("recovered-vm");
         evt.setInventory(vm);
+        evt.setWarnings(java.util.Collections.emptyList());
         return evt;
     }
 }

Also applies to: 37-43

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/vm/APIRegisterVmInstanceFromMetadataEvent.java`
around lines 8 - 11, The `@RestResponse`(allTo = "inventory") on class
APIRegisterVmInstanceFromMetadataEvent excludes the warnings field from REST
serialization and the generated example; update the annotation to include
warnings (e.g., add "warnings" to allTo or remove the allTo restriction) so
warnings are returned, and also populate the warnings list inside the class's
__example__() method; apply the same fix for the duplicate class variant
referenced around lines 37-43 to ensure both the REST response and example
include the warnings field.
compute/src/main/java/org/zstack/compute/vm/metadata/MetadataOrphanDetector.java (1)

73-77: ⚠️ Potential issue | 🔴 Critical

List.of() 需要 Java 9+,与项目 Java 8 目标不兼容。

此问题已在之前的审查中标记。项目的 pom.xml 配置 maven.compiler.sourcemaven.compiler.target 为 8,但 List.of(...) 仅在 Java 9+ 可用,会导致编译失败。

🔧 兼容 Java 8 的修复
-    private static final List<String> SUPPORTED_PS_TYPES = List.of(
-            "SharedBlock",
-            "LocalStorage",
-            "NFS"
-    );
+    private static final List<String> SUPPORTED_PS_TYPES = Collections.unmodifiableList(
+            Arrays.asList("SharedBlock", "LocalStorage", "NFS")
+    );

需要添加导入:

import java.util.Arrays;
import java.util.Collections;

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/MetadataOrphanDetector.java`
around lines 73 - 77, The constant SUPPORTED_PS_TYPES in MetadataOrphanDetector
uses List.of(...) which requires Java 9+; replace it with a Java 8-compatible
construction such as Collections.unmodifiableList(Arrays.asList("SharedBlock",
"LocalStorage", "NFS")) (or Arrays.asList(...) if mutability is acceptable) and
add the required imports (java.util.Arrays and java.util.Collections) so the
code compiles under Java 8.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java (2)

490-496: ⚠️ Potential issue | 🔴 Critical

这批查询仍然会捞到别的路径刚认领的行。

lastClaimTime >= now-5s 还是无法区分“本轮 claimDirtyRows() 刚 claim 的行”与 triggerFlushForVm() / 上一轮 poller 近 5 秒内 claim 的行。后面的 submitFlushTask() 会把这些外来 claim 当成本轮结果,重复 flush 和提前 release claim 的竞态还会继续出现。这里还是需要 batch token,而不是时间窗近似。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 490 - 496, The current filter in VmMetadataDirtyMarker that uses
lastClaimTime (Timestamp.thisCycleCutoff) still picks up rows claimed by other
paths; change the claim semantics to use a per-claim batch token instead of a
time window: when claimDirtyRows() performs the CAS/claim step, generate and
persist a unique claimBatchToken (or claimId) into the VmMetadataDirtyVO rows it
claims, then in the subsequent query replace the time-based filter
(.gte(VmMetadataDirtyVO_.lastClaimTime, thisCycleCutoff)) with an equality on
that token (e.g. .eq(VmMetadataDirtyVO_.claimBatchToken, myBatchToken)); ensure
triggerFlushForVm(), submitFlushTask() and any code that releases or touches
claims use the same token field when matching/releasing so only rows from this
claim batch are processed.

641-645: ⚠️ Potential issue | 🔴 Critical

目标 VM 字段还是填错了。

UpdateVmInstanceMetadataMsg 这里需要的是 vmInstanceUuid,不是 uuid。保持现在的写法会让处理端拿不到目标 VM。

💡 建议修正
         UpdateVmInstanceMetadataMsg msg = new UpdateVmInstanceMetadataMsg();
-        msg.setUuid(vmUuid);
+        msg.setVmInstanceUuid(vmUuid);
         msg.setStorageStructureChange(latestDirty.isStorageStructureChange());
         msg.setTimeout(TimeUnit.MINUTES.toMillis(5));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 641 - 645, The code is setting the wrong field on
UpdateVmInstanceMetadataMsg: replace the call to msg.setUuid(vmUuid) with the
correct setter for the target VM (e.g. msg.setVmInstanceUuid(vmUuid)) so the
consumer can locate the VM; keep the rest (setStorageStructureChange,
setTimeout, bus.makeLocalServiceId(..., VmInstanceConstant.SERVICE_ID))
unchanged.
🟠 Major comments (17)
header/src/main/java/org/zstack/header/vm/BatchCheckMetadataStatusMsg.java-22-37 (1)

22-37: ⚠️ Potential issue | 🟠 Major

不要把调用方的 vmUuids 列表按引用塞进消息体。

NeedReplyMessage 会沿异步链路传递;这里保留原始 List 引用,发送方后续如果复用、清空或追加这个列表,请求内容会跟着变化。建议在 setter 中复制并归一化为空列表,避免 TOCTOU 式问题。

建议修改
+import java.util.ArrayList;
+import java.util.Collections;
 import java.util.List;
@@
-    private List<String> vmUuids;
+    private List<String> vmUuids = Collections.emptyList();
@@
     public List<String> getVmUuids() {
         return vmUuids;
     }
@@
     public void setVmUuids(List<String> vmUuids) {
-        this.vmUuids = vmUuids;
+        this.vmUuids = vmUuids == null
+                ? Collections.emptyList()
+                : Collections.unmodifiableList(new ArrayList<>(vmUuids));
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/BatchCheckMetadataStatusMsg.java`
around lines 22 - 37, BatchCheckMetadataStatusMsg currently stores the caller's
vmUuids list by reference which can lead to TOCTOU mutations when the message is
sent asynchronously; modify setVmUuids to defensively copy and normalize null to
an empty list (e.g., this.vmUuids = (vmUuids == null ? Collections.emptyList() :
new ArrayList<>(vmUuids))) and consider making getVmUuids return an unmodifiable
view or a copy to prevent external mutation (refer to class
BatchCheckMetadataStatusMsg, field vmUuids, methods setVmUuids and getVmUuids).
header/src/main/java/org/zstack/header/vm/BatchCheckMetadataStatusReply.java-17-24 (1)

17-24: ⚠️ Potential issue | 🟠 Major

避免把可变 Map 直接暴露到 reply 边界。

这里直接保存并返回调用方传入的 results。如果构造 reply 的代码在发送后继续复用或修改同一个 Map,reply 内容会被反向污染;而 null 也会把下游消费方推向额外的判空分支。建议在 setter 中做 defensive copy,并把 null 归一化为空映射。

建议修改
+import java.util.Collections;
+import java.util.LinkedHashMap;
 import java.util.Map;
@@
-    private Map<String, MetadataStatusResult> results;
+    private Map<String, MetadataStatusResult> results = Collections.emptyMap();
@@
     public Map<String, MetadataStatusResult> getResults() {
         return results;
     }
@@
     public void setResults(Map<String, MetadataStatusResult> results) {
-        this.results = results;
+        this.results = results == null
+                ? Collections.emptyMap()
+                : Collections.unmodifiableMap(new LinkedHashMap<>(results));
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/BatchCheckMetadataStatusReply.java`
around lines 17 - 24, 在 BatchCheckMetadataStatusReply 中不要直接暴露可变 Map 字段 results:在
setResults(Map<String, MetadataStatusResult> results) 中对传入的 map 做防御性拷贝并将 null
归一化为 Collections.emptyMap()(或 new HashMap<>()),同时在 getResults()
返回前也返回一个不可变视图或防御性拷贝以避免外部修改(参考类名 BatchCheckMetadataStatusReply 和方法名
getResults/setResults),确保 reply 对外为不可变或独立副本。
testlib/src/main/java/org/zstack/testlib/KVMSimulator.groovy-684-691 (1)

684-691: ⚠️ Potential issue | 🟠 Major

让 READ/WRITE 模拟器保持有状态。

现在 WRITE 直接返回成功,READ 永远返回固定 payload,这样写入路径、路径路由和读后校验都没有被真正覆盖;即使生产代码把 metadata 写错了,测试也还是会通过。建议至少按请求中的 metadata 路径或 VM 标识缓存写入内容,并在 READ 时返回同一份数据。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testlib/src/main/java/org/zstack/testlib/KVMSimulator.groovy` around lines
684 - 691, The READ/WRITE KVM metadata simulators are stateless: spec.simulator
handlers for KVMConstant.WRITE_VM_INSTANCE_METADATA_PATH and
KVMConstant.READ_VM_INSTANCE_METADATA_PATH always return fixed responses
(WriteVmInstanceMetadataRsp / ReadVmInstanceMetadataRsp) so tests cannot detect
incorrect writes; change the spec.simulator for WRITE_VM_INSTANCE_METADATA_PATH
to store the incoming metadata keyed by VM id or metadata path (extract from the
HttpEntity payload or request headers) into an in-memory map and have the
READ_VM_INSTANCE_METADATA_PATH handler look up and return the stored value
(falling back to the existing fixed payload only if no write exists), ensuring
the simulator functions (the two spec.simulator closures and the
ReadVmInstanceMetadataRsp/WriteVmInstanceMetadataRsp flow) share that map and
keying logic.
header/src/main/java/org/zstack/header/vm/APICleanupVmInstanceMetadataMsg.java-11-17 (1)

11-17: ⚠️ Potential issue | 🟠 Major

给这个 API 补上 @MetadataImpact

本 PR 的目标已经明确说明:CI 会校验所有 APIMessage 子类都带这个注解。APICleanupVmInstanceMetadataMsg 现在缺失后,会直接偏离这条约束;这里至少要补一个与“清理主存元数据”语义一致的 impact。

🩹 建议修改
+@MetadataImpact(MetadataImpact.Impact.STORAGE)
 `@RestRequest`(
         path = "/vm-instances/metadata/cleanup",
         method = HttpMethod.PUT,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/vm/APICleanupVmInstanceMetadataMsg.java`
around lines 11 - 17, Add the `@MetadataImpact` annotation to the
APICleanupVmInstanceMetadataMsg class to reflect that this API modifies VM
metadata (it performs cleanup/removal); update the class declaration for
APICleanupVmInstanceMetadataMsg to include `@MetadataImpact` with the appropriate
impact type/flags that indicate metadata deletion/cleanup so it satisfies the CI
rule requiring all APIMessage subclasses to carry this annotation.
conf/db/upgrade/V4.10.29__schema.sql-20-21 (1)

20-21: ⚠️ Potential issue | 🟠 Major

把索引创建改成可重入。

Line 21 的 CREATE INDEX 没有做存在性保护。升级如果在这之后部分失败并重跑,前面的 CREATE TABLE IF NOT EXISTS 会跳过,但这里会因为索引已存在而直接中断,恢复流程走不下去。这里建议先查 information_schema.STATISTICS,再按条件创建索引。 Based on learnings, “ZStack 数据库升级脚本 V4.10.16__schema.sql 中的 UPGRADE_VM_METADATA_TABLES_DIRECT 存储过程使用直接 RENAME TABLE 操作,不具备幂等性。需要通过 information_schema 检查表和约束的存在性来确保脚本可以安全重复执行。”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@conf/db/upgrade/V4.10.29__schema.sql` around lines 20 - 21, CREATE INDEX
`idxVmMetadataDirtyUnclaimed` on VmMetadataDirtyVO is not idempotent; change the
upgrade step to first query information_schema.STATISTICS for
table_name='VmMetadataDirtyVO' and index_name='idxVmMetadataDirtyUnclaimed' and
only run CREATE INDEX `idxVmMetadataDirtyUnclaimed` (...) when that query
returns no rows so the migration can be safely re-run; update the
V4.10.29__schema.sql migration to perform this conditional check (use the same
index name `idxVmMetadataDirtyUnclaimed` and table `VmMetadataDirtyVO`) rather
than unconditionally issuing CREATE INDEX.
plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java-1966-1968 (1)

1966-1968: ⚠️ Potential issue | 🟠 Major

这里应按目标 host 选择 backend。

Line 1966 前面已经解析出了 hostUuid,这里再走 getUsableBackend() 会把请求绑定到“任意可用 backend”,而不是目标 host 对应的实现。这样既可能选错 backend,也会在没有其它 connected host 时直接拿到 null。建议改为按 hostUuid 取 backend。

🛠️ 建议修改
-        final NfsPrimaryStorageBackend backend = getUsableBackend();
+        final NfsPrimaryStorageBackend backend = getBackendByHostUuid(hostUuid);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java`
around lines 1966 - 1968, 在 NfsPrimaryStorage.handle 调用前不要再用
getUsableBackend()(它返回任意可用 backend),而要基于已解析出的 hostUuid 选择对应的 backend;替换为按
hostUuid 获取 backend(例如新增或调用类似 getBackendByHostUuid(hostUuid) /
findBackendForHost(hostUuid) 的方法),并在找不到对应 backend 时做明确的 null 处理/错误返回;确保后续调用
backend.handle(msg, hostUuid, ...) 使用该按 hostUuid 选出的后端实例。
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java-3198-3208 (1)

3198-3208: ⚠️ Potential issue | 🟠 Major

Registering 态把只读 API 也一并拦掉了。

当前白名单只有 APIDestroyVmInstanceMsg,所以后面同样在本方法处理的 APIGetVmBootOrderMsgAPIGetVmDnsMsgAPIGetVmCapabilitiesMsg 等只读请求也会直接返回错误。这和注释里的“允许 Destroy 和 query/read 操作”不一致,注册期的查询能力会被破坏。建议把允许列表补齐,或抽成统一的 isAllowedWhileRegistering() 判断。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
3198 - 3208, The current guard in handleApiMessage treats
VmInstanceState.Registering as allowing only APIDestroyVmInstanceMsg, which
incorrectly blocks read/query messages like APIGetVmBootOrderMsg,
APIGetVmDnsMsg, APIGetVmCapabilitiesMsg; update the check to allow all read-only
APIs (either by extending the whitelist to include those API*Msg classes or by
adding a helper isAllowedWhileRegistering(Message) used in handleApiMessage) so
that queries are permitted during Registering while destructive operations
remain blocked.
plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorageKVMBackend.java-2058-2064 (1)

2058-2064: ⚠️ Potential issue | 🟠 Major

元数据路径的推导方式太脆,容易写到错误位置。

这里既没有处理 installPath == null,也把目录提取硬编码成了 vol-* 的正则。只要查不到 root volume 的 install path,或者路径格式稍微不符合这个模式,metadataPath 就会变成 NPE 或 *.qcow2<suffix> 这种错误地址,后面的更新会静默落到错地方。

更稳妥的修正方向
 String installPath = Q.New(VolumeVO.class)
         .eq(VolumeVO_.uuid, msg.getRootVolumeUuid())
         .select(VolumeVO_.installPath)
         .findValue();
-String path = installPath.replaceFirst("^(.+/vol-[^/]+/).*$", "$1");
-String metadataPath = String.format("%s%s", path, VM_META_SUFFIX);
+if (installPath == null) {
+    completion.fail(operr("cannot find install path for root volume[uuid:%s]", msg.getRootVolumeUuid()));
+    return;
+}
+
+int dirEnd = installPath.lastIndexOf('/') + 1;
+if (dirEnd <= 0) {
+    completion.fail(operr("invalid root volume install path[%s]", installPath));
+    return;
+}
+
+String metadataPath = String.format("%s%s", installPath.substring(0, dirEnd), VM_META_SUFFIX);
 umsg.setMetadataPath(metadataPath);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorageKVMBackend.java`
around lines 2058 - 2064, The code in NfsPrimaryStorageKVMBackend derives
metadataPath from installPath unsafely: the query using Q.New(VolumeVO.class)
may return null and the replaceFirst with a hardcoded "vol-*" regex can produce
NPEs or incorrect paths; update the logic to null-check the installPath returned
by the VolumeVO query, validate its expected directory structure (avoid relying
solely on "vol-*" regex), and compute metadataPath by using safe path operations
(e.g., locate the parent directory containing the volume folder or match a
configurable segment) before calling umsg.setMetadataPath(metadataPath); ensure
fallback handling/logging when installPath is missing or malformed so no silent
writes occur.
compute/src/main/java/org/zstack/compute/vm/VmExpungeMetadataFlow.java-51-71 (1)

51-71: ⚠️ Potential issue | 🟠 Major

根卷查询时机冲突导致元数据清理可能被静默跳过。

流程文档(第 27 行)明确说明 VmExpungeMetadataFlow 运行在 Root/Memory/Cache Volume 删除之后,但代码第 60-63 行仍尝试查询 VolumeVO 来获取 primaryStorageUuid。若实际编排遵循文档顺序,此查询会稳定返回 null,导致元数据清理始终进入 skip 分支,最后仅留下一条调试日志。

修复方案:spec.getVmInventory().getRootVolume().getPrimaryStorageUuid() 直接读取缓存的 PS 信息,避免依赖已删除的 VolumeVO 记录。此方式在 VmImageSelectBackupStorageFlow 等其他流程中已成熟使用。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmExpungeMetadataFlow.java`
around lines 51 - 71, The code currently queries VolumeVO for primaryStorageUuid
(Q.New(...).select(...)) which races with document-described ordering and can
return null after volumes are deleted; replace that DB lookup in
VmExpungeMetadataFlow with reading the cached value via
spec.getVmInventory().getRootVolume().getPrimaryStorageUuid() (use that as
psUuid) and only fall back to the VolumeVO query if the cached rootVolume or its
primaryStorageUuid is null; update logic around psUuid null-handling accordingly
to avoid silent skips.
compute/src/main/java/org/zstack/compute/vm/metadata/MetadataStaleRecoveryTask.java-133-138 (1)

133-138: ⚠️ Potential issue | 🟠 Major

批量扫描缺少稳定推进条件,超过 batch 后会有饥饿问题。

这里直接 limit(batchSize),但没有 ORDER BY 或游标式推进。待恢复记录一旦多于 batch,周期任务就可能反复命中同一小部分数据,其他 VM 长期拿不到重试机会。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/MetadataStaleRecoveryTask.java`
around lines 133 - 138, The query in MetadataStaleRecoveryTask uses
SQL.New(...).limit(batchSize) on VmMetadataPathFingerprintVO and can repeatedly
hit the same subset when >batchSize records exist; make selection deterministic
and page through results by adding an ORDER BY (e.g., ORDER BY fp.id or ORDER BY
fp.lastUpdate) and implement cursor-style pagination (e.g., WHERE fp.id >
:lastProcessedId or WHERE fp.lastUpdate <= :lastProcessedTimestamp) persisted
between runs (or advanced each invocation) instead of just limit(batchSize);
update the MetadataStaleRecoveryTask to supply the ordering parameter and
maintain/advance the cursor so all stale records get fair retry opportunities.
header/src/main/java/org/zstack/header/vm/APIRegisterVmInstanceFromMetadataMsg.java-23-25 (1)

23-25: ⚠️ Potential issue | 🟠 Major

这里有两个可写的 VM UUID 来源,API 合约会变得含糊。

这个消息继承了 APICreateMessage,因此会自动暴露 resourceUuid(见 header/src/main/java/org/zstack/header/message/APICreateMessage.java:8-42);但 metadataContent 自身又携带 vmUuid。如果两者不一致,调用方和服务端都很难判断哪个才是权威值,最好在消息层直接禁止其一,或显式校验二者一致。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@header/src/main/java/org/zstack/header/vm/APIRegisterVmInstanceFromMetadataMsg.java`
around lines 23 - 25, APIRegisterVmInstanceFromMetadataMsg currently exposes two
writable VM UUID sources (inherited resourceUuid from APICreateMessage and
vmUuid inside metadataContent), causing ambiguity; update
APIRegisterVmInstanceFromMetadataMsg to validate and reconcile them: in the
message validation (e.g. override a validate/validateMessage method or add a
custom check before processing) parse vmUuid from metadataContent and if
resourceUuid is non-null ensure it equals the parsed vmUuid (reject the request
on mismatch), otherwise set resourceUuid to the parsed vmUuid so only one
authoritative UUID is used; reference APIRegisterVmInstanceFromMetadataMsg,
APICreateMessage, metadataContent and resourceUuid when implementing this check.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java-675-706 (1)

675-706: ⚠️ Potential issue | 🟠 Major

保存的路径指纹并不等于刚写出去的 payload。

deleted == 0 分支说明这次成功写入的 metadata 已经落后于 DB 当前版本,但这里仍然会继续 savePathFingerprint();同时 savePathFingerprint() 自己查询的卷集合只看 vmInstanceUuid,没有带上 lastVmInstanceUuid 的已卸载卷,也没排除 shareable 卷。结果是指纹既可能“超前”于实际存储内容,也可能和 payload 的资源集合不一致,MetadataPathDriftDetector 很容易误判。

Also applies to: 813-829

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 675 - 706, The current onFlushSuccess() always calls
savePathFingerprint(vmUuid) even when deleted==0 (meaning the DB has newer
changes), and savePathFingerprint() queries volumes only by vmInstanceUuid and
does not exclude shareable volumes or include volumes with lastVmInstanceUuid,
leading to fingerprints that don't match the just-flushed payload; fix by: (1)
only invoke savePathFingerprint(vmUuid) when the conditional delete succeeded
(deleted != 0) inside onFlushSuccess() in VmMetadataDirtyMarker so the
fingerprint reflects the flushed snapshotVersion, and (2) update
savePathFingerprint(...) to build its volume query to include volumes where
vmInstanceUuid == vmUuid OR lastVmInstanceUuid == vmUuid while excluding
shareable volumes (and any other detached-but-relevant volumes), ensuring the
recorded path fingerprint matches the actual resources in the flushed payload.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataUpdateInterceptor.java-149-157 (1)

149-157: ⚠️ Potential issue | 🟠 Major

Impact.NONE 需要直接跳过。

PR 里已经把 500+ 个 API 标成了 NONE,但这里仍会把它们缓存下来并在成功后 markDirty(false)。这样所有能解析出 vmUuid 的“无影响 API”都会触发一次 metadata flush,负载会被放大很多。最好在 beforeDeliveryMessage() 就提前返回,并在 submitMarkDirty() 再补一层防御。

💡 建议修改
-                APIMessage apiMsg = (APIMessage) msg;
-                List<String> vmUuids = resolveVmUuids(apiMsg);
+                MetadataImpact impact = msg.getClass().getAnnotation(MetadataImpact.class);
+                if (impact == null || impact.value() == MetadataImpact.Impact.NONE) {
+                    return;
+                }
+
+                APIMessage apiMsg = (APIMessage) msg;
+                List<String> vmUuids = resolveVmUuids(apiMsg);
                 if (vmUuids.isEmpty()) {
                     return;
                 }
-
-                MetadataImpact impact = msg.getClass().getAnnotation(MetadataImpact.class);
                 pendingApis.put(apiMsg.getId(), new MetadataImpactInfo(
                         vmUuids, impact.value(), impact.updateOnFailure()));
@@
     void submitMarkDirty(String vmInstanceUuid, MetadataImpact.Impact impact) {
+        if (impact == MetadataImpact.Impact.NONE) {
+            return;
+        }
         boolean storageStructureChange = (impact == MetadataImpact.Impact.STORAGE);

Also applies to: 230-235

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataUpdateInterceptor.java`
around lines 149 - 157, In VmMetadataUpdateInterceptor, when resolving
MetadataImpact for a message (the code that builds pendingApis using
apiMsg.getId() and new MetadataImpactInfo(...)), skip and return immediately if
impact.value() == MetadataImpact.Impact.NONE so NONE-marked APIs are not cached;
additionally add a defensive check in submitMarkDirty (the method that processes
pendingApis and calls markDirty) to ignore entries whose MetadataImpactInfo
indicates Impact.NONE before submitting any markDirty work. Ensure you reference
VmMetadataUpdateInterceptor, the pendingApis map, MetadataImpact annotation
usage, and submitMarkDirty/markDirty when making the changes.
compute/src/main/java/org/zstack/compute/vm/VmInstanceMetadataFieldProcessor.java-196-199 (1)

196-199: ⚠️ Potential issue | 🟠 Major

收紧路径标识符的匹配/替换边界。

这里同时用了 contains() 和全量 replace()oldPathIdentifier 在 spec 里要么是“路径前缀”,要么是 “VG UUID”;当前写法会把其它路径段里偶然命中的同名片段也判成同一主存储,并把所有命中位置都改掉。跨存储过滤和 installPath 重写都会出现误判/误改。

Also applies to: 226-230

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/VmInstanceMetadataFieldProcessor.java`
around lines 196 - 199, The belongsToPrimaryStorage check and the related
installPath rewrite use naive contains()/replace(), causing false positives when
oldPathIdentifier appears inside other path segments; update
belongsToPrimaryStorage to only match when installPath has the identifier as the
intended boundary (e.g., path prefix or whole VG UUID) by using startsWith for
prefix matches or exact-equality for UUID matches, or use a regex that enforces
segment boundaries (e.g., (^|/)<identifier>(/|$)); likewise change the
corresponding replacement code (the code around lines 226-230 that does full
replace) to only replace the matched prefix/segment (or use Path-based
normalization) instead of replacing all occurrences so only the intended path
portion is rewritten.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java-899-917 (1)

899-917: ⚠️ Potential issue | 🟠 Major

批量初始化/升级刷新绕过了 KVM 过滤。

markDirty() 的常规入口明确只处理 KVM UserVm,但这里两条 SQL 只筛了 type = 'UserVm'。功能开关开启或做 upgrade refresh 时,非 KVM 的 UserVm 也会被塞进 dirty 表,和日常增量路径的目标集合不一致。

Also applies to: 995-1003

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 899 - 917, The batch INSERT/UPDATE for VmMetadataDirtyVO is only
filtering on v.type = 'UserVm' and thus includes non-KVM User VMs; update the
SQL in the methods that perform the batch initialize/upgrade (the SQL.New(...)
statements operating on VmInstanceVO -> VmMetadataDirtyVO and referenced by
markDirty()) to also filter by the KVM hypervisor (e.g., add a condition like
v.hypervisorType = 'KVM' or the equivalent column used in VmInstanceVO), and
apply the same fix to the other instance of the SQL block (the similar
statements around the 995-1003 region) so the batch path matches the normal
markDirty() KVM-only behavior.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java-842-845 (1)

842-845: ⚠️ Potential issue | 🟠 Major

成功刷写后需要清掉失败标记。

已有指纹行分支只更新时间和 pathSnapshot,没有把 lastFlushFailed 复位。一次历史失败会永久残留,后续恢复/巡检逻辑会一直把这台 VM 当成“上次 flush 失败”。

💡 建议修改
         } else {
             fp.setPathSnapshot(pathJson);
             fp.setLastFlushTime(new Timestamp(System.currentTimeMillis()));
+            fp.setLastFlushFailed(false);
             dbf.update(fp);
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 842 - 845, The current success branch in VmMetadataDirtyMarker
updates fp.setPathSnapshot(pathJson) and fp.setLastFlushTime(...) but does not
clear the failure marker, so a prior failure remains forever; after a successful
flush, reset the failure flag on the fingerprint object (e.g. call
fp.setLastFlushFailed(false) or the appropriate setter) before calling
dbf.update(fp) so the record no longer indicates a lastFlushFailed state. Ensure
you reference VmMetadataDirtyMarker and the fp instance when making this change.
compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java-519-526 (1)

519-526: ⚠️ Potential issue | 🟠 Major

全局并发上限检查不是原子的。

get()incrementAndGet() 分开执行,多个线程可以同时穿过这里,把 globalFlushInFlight 冲到 maxConcurrent 之上。这个限流对 triggerFlushForVm() 和 poller 的并发提交都不可靠。

💡 建议修改
         int maxConcurrent = VmGlobalConfig.VM_METADATA_GLOBAL_MAX_CONCURRENT.value(Integer.class);
-        if (globalFlushInFlight.get() >= maxConcurrent) {
-            // 全局并发已满,释放 claim,Poller 下轮重试
-            releaseClaim(vmUuid);
-            return;
-        }
-        globalFlushInFlight.incrementAndGet();
+        int inFlight = globalFlushInFlight.incrementAndGet();
+        if (inFlight > maxConcurrent) {
+            globalFlushInFlight.decrementAndGet();
+            // 全局并发已满,释放 claim,Poller 下轮重试
+            releaseClaim(vmUuid);
+            return;
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 519 - 526, The current check uses separate
globalFlushInFlight.get() and incrementAndGet(), which is not atomic and lets
multiple threads exceed maxConcurrent; change it to an atomic
compare-and-increment (e.g., a CAS loop or
AtomicInteger.updateAndGet/getAndUpdate with a lambda) inside
VmMetadataDirtyMarker where you currently check maxConcurrent so that you only
increment when the previous value is < maxConcurrent; if the atomic update fails
because value >= maxConcurrent, call releaseClaim(vmUuid) and return. Also
ensure the corresponding decrement happens reliably (e.g., in the existing
completion/finally path) so globalFlushInFlight is balanced. Reference symbols:
globalFlushInFlight, triggerFlushForVm(), releaseClaim(vmUuid).
🟡 Minor comments (2)
conf/db/upgrade/V4.10.29__schema.sql-35-36 (1)

35-36: ⚠️ Potential issue | 🟡 Minor

清理语句也统一带上 zstack schema。

这里直接 DELETE FROM \GarbageCollectorVO`仍然依赖默认库,和本目录升级脚本的固定 schema 约定不一致;建议改成DELETE FROM `zstack`.`GarbageCollectorVO`。 Based on learnings, “In ZStack, the database schema name is fixed as 'zstack'. For SQL upgrade scripts under conf/db/upgrade, continue using TABLE_SCHEMA = 'zstack' and reference tables as zstack.UsedIpVO`.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@conf/db/upgrade/V4.10.29__schema.sql` around lines 35 - 36, Update the DELETE
statement so it explicitly targets the fixed zstack schema instead of relying on
the default database; change the statement that references the table
GarbageCollectorVO in V4.10.29__schema.sql to DELETE FROM
`zstack`.`GarbageCollectorVO` WHERE `name` LIKE 'update-vm-%-metadata-gc';
ensuring the schema-qualified table name (`zstack`.`GarbageCollectorVO`) is used
anywhere this table is referenced in the upgrade script.
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java-7382-7386 (1)

7382-7386: ⚠️ Potential issue | 🟡 Minor

读取失败时不要继续回填 metadata

当前 agent 返回失败后仍然把 rsp.metadata 填进 reply,调用方如果只消费返回体字段而没有先判断 error,容易把失败结果当成有效元数据处理。

💡 建议修改
             public void success(ReadVmInstanceMetadataRsp rsp) {
                 if (!rsp.isSuccess()) {
                     reply.setError(operr("operation error, because:%s", rsp.getError()));
+                } else {
+                    reply.setMetadata(rsp.metadata);
                 }
-                reply.setMetadata(rsp.metadata);
                 bus.reply(msg, reply);
                 completion.done();
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java` around lines 7382 -
7386, In success(ReadVmInstanceMetadataRsp rsp) ensure you do not populate
reply.metadata when the response indicates failure: if !rsp.isSuccess() set
reply.setError(operr(...)) and return (or otherwise skip setting
reply.setMetadata), otherwise set reply.setMetadata(rsp.metadata); this prevents
failed responses from writing rsp.metadata into reply.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 382725ba-4cc9-4e3b-904a-6aa152d2bb90

📥 Commits

Reviewing files that changed from the base of the PR and between 1cd28ae and 8345c56.

⛔ Files ignored due to path filters (4)
  • conf/globalConfig/vm.xml is excluded by !**/*.xml
  • conf/persistence.xml is excluded by !**/*.xml
  • conf/serviceConfig/primaryStorage.xml is excluded by !**/*.xml
  • conf/springConfigXml/VmInstanceManager.xml is excluded by !**/*.xml
📒 Files selected for processing (136)
  • compute/src/main/java/org/zstack/compute/vm/AbstractVmInstance.java
  • compute/src/main/java/org/zstack/compute/vm/VmExpungeMetadataFlow.java
  • compute/src/main/java/org/zstack/compute/vm/VmGlobalConfig.java
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceApiInterceptor.java
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceMetadataFieldProcessor.java
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceUtils.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/DefaultVmUuidFromApiResolver.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/MetadataCascadeExtension.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/MetadataOrphanDetector.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/MetadataPathDriftDetector.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/MetadataPathSnapshotBuilder.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/MetadataStaleRecoveryTask.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/NicBasedVmUuidFromApiResolver.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/ReflectionBasedVmUuidFromApiResolver.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/ResourceBasedVmUuidFromApiResolver.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/SnapshotBasedVmUuidFromApiResolver.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/SnapshotGroupBasedVmUuidFromApiResolver.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataBuilder.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataUpdateInterceptor.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/VolumeBasedVmUuidFromApiResolver.java
  • conf/db/upgrade/V4.10.29__schema.sql
  • header/src/main/java/org/zstack/header/storage/backup/APIExportImageFromBackupStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIGetVmInstanceMetadataFromPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIGetVmInstanceMetadataFromPrimaryStorageReply.java
  • header/src/main/java/org/zstack/header/storage/primary/APIRegisterVmInstanceEventDoc_zh_cn.groovy
  • header/src/main/java/org/zstack/header/storage/primary/APIRegisterVmInstanceMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/APIRegisterVmInstanceMsgDoc_zh_cn.groovy
  • header/src/main/java/org/zstack/header/storage/primary/APIRegisterVmInstanceReply.java
  • header/src/main/java/org/zstack/header/storage/primary/CleanupVmInstanceMetadataOnPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/CleanupVmInstanceMetadataOnPrimaryStorageReply.java
  • header/src/main/java/org/zstack/header/storage/primary/GetVmInstanceMetadataFromPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/GetVmInstanceMetadataFromPrimaryStorageReply.java
  • header/src/main/java/org/zstack/header/storage/primary/ReadVmInstanceMetadataMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/ReadVmInstanceMetadataOnHypervisorMsg.java
  • header/src/main/java/org/zstack/header/storage/primary/ReadVmInstanceMetadataOnHypervisorReply.java
  • header/src/main/java/org/zstack/header/storage/primary/ReadVmInstanceMetadataReply.java
  • header/src/main/java/org/zstack/header/storage/primary/RegisterVmInstanceException.java
  • header/src/main/java/org/zstack/header/storage/snapshot/APIDeleteVolumeSnapshotMsg.java
  • header/src/main/java/org/zstack/header/storage/snapshot/APIRevertVolumeFromSnapshotMsg.java
  • header/src/main/java/org/zstack/header/storage/snapshot/VolumeSnapshotTree.java
  • header/src/main/java/org/zstack/header/storage/snapshot/VolumeSnapshotVO_.java
  • header/src/main/java/org/zstack/header/storage/snapshot/group/APIDeleteVolumeSnapshotGroupMsg.java
  • header/src/main/java/org/zstack/header/tag/APICreateSystemTagMsg.java
  • header/src/main/java/org/zstack/header/tag/APIDeleteTagMsg.java
  • header/src/main/java/org/zstack/header/tag/APIUpdateSystemTagMsg.java
  • header/src/main/java/org/zstack/header/vm/APIAttachVmNicToVmMsg.java
  • header/src/main/java/org/zstack/header/vm/APIChangeVmNicNetworkMsg.java
  • header/src/main/java/org/zstack/header/vm/APIChangeVmNicStateMsg.java
  • header/src/main/java/org/zstack/header/vm/APICheckVmInstanceMetadataConsistencyMsg.java
  • header/src/main/java/org/zstack/header/vm/APICheckVmInstanceMetadataConsistencyReply.java
  • header/src/main/java/org/zstack/header/vm/APICleanupVmInstanceMetadataEvent.java
  • header/src/main/java/org/zstack/header/vm/APICleanupVmInstanceMetadataMsg.java
  • header/src/main/java/org/zstack/header/vm/APIConvertTemplatedVmInstanceToVmInstanceMsg.java
  • header/src/main/java/org/zstack/header/vm/APIConvertVmInstanceToTemplatedVmInstanceMsg.java
  • header/src/main/java/org/zstack/header/vm/APIDeleteVmBootModeMsg.java
  • header/src/main/java/org/zstack/header/vm/APIDeleteVmHostnameMsg.java
  • header/src/main/java/org/zstack/header/vm/APIDeleteVmNicMsg.java
  • header/src/main/java/org/zstack/header/vm/APIDeleteVmSshKeyMsg.java
  • header/src/main/java/org/zstack/header/vm/APIPreCheckVmMetadataRegistrationMsg.java
  • header/src/main/java/org/zstack/header/vm/APIPreCheckVmMetadataRegistrationReply.java
  • header/src/main/java/org/zstack/header/vm/APIReadVmInstanceMetadataMsg.java
  • header/src/main/java/org/zstack/header/vm/APIReadVmInstanceMetadataReply.java
  • header/src/main/java/org/zstack/header/vm/APIRecoverVmInstanceMsg.java
  • header/src/main/java/org/zstack/header/vm/APIRegisterVmInstanceFromMetadataEvent.java
  • header/src/main/java/org/zstack/header/vm/APIRegisterVmInstanceFromMetadataMsg.java
  • header/src/main/java/org/zstack/header/vm/APIReimageVmInstanceEvent.java
  • header/src/main/java/org/zstack/header/vm/APIReimageVmInstanceEventDoc_zh_cn.groovy
  • header/src/main/java/org/zstack/header/vm/APIReimageVmInstanceMsg.java
  • header/src/main/java/org/zstack/header/vm/APIScanVmInstanceMetadataMsg.java
  • header/src/main/java/org/zstack/header/vm/APIScanVmInstanceMetadataReply.java
  • header/src/main/java/org/zstack/header/vm/APISetVmBootOrderMsg.java
  • header/src/main/java/org/zstack/header/vm/APIUpdateVmInstanceMsg.java
  • header/src/main/java/org/zstack/header/vm/APIUpdateVmMetadataEvent.java
  • header/src/main/java/org/zstack/header/vm/APIUpdateVmMetadataMsg.java
  • header/src/main/java/org/zstack/header/vm/BatchCheckMetadataStatusMsg.java
  • header/src/main/java/org/zstack/header/vm/BatchCheckMetadataStatusReply.java
  • header/src/main/java/org/zstack/header/vm/ConsistencyCheckResult.java
  • header/src/main/java/org/zstack/header/vm/DiffEntry.java
  • header/src/main/java/org/zstack/header/vm/MetadataImpact.java
  • header/src/main/java/org/zstack/header/vm/MetadataStatusResult.java
  • header/src/main/java/org/zstack/header/vm/MetadataStorageHandler.java
  • header/src/main/java/org/zstack/header/vm/PreCheckItem.java
  • header/src/main/java/org/zstack/header/vm/RegisterVmFromMetadataInnerMsg.java
  • header/src/main/java/org/zstack/header/vm/RegisterVmFromMetadataInnerReply.java
  • header/src/main/java/org/zstack/header/vm/RepairMetadataMsg.java
  • header/src/main/java/org/zstack/header/vm/RepairMetadataReply.java
  • header/src/main/java/org/zstack/header/vm/UpdateVmInstanceMetadataMsg.java
  • header/src/main/java/org/zstack/header/vm/UpdateVmInstanceMetadataOnHypervisorMsg.java
  • header/src/main/java/org/zstack/header/vm/UpdateVmInstanceMetadataOnHypervisorReply.java
  • header/src/main/java/org/zstack/header/vm/UpdateVmInstanceMetadataOnPrimaryStorageMsg.java
  • header/src/main/java/org/zstack/header/vm/UpdateVmInstanceMetadataOnPrimaryStorageReply.java
  • header/src/main/java/org/zstack/header/vm/UpdateVmInstanceMetadataReply.java
  • header/src/main/java/org/zstack/header/vm/VmInstanceConstant.java
  • header/src/main/java/org/zstack/header/vm/VmInstanceMetadataCodec.java
  • header/src/main/java/org/zstack/header/vm/VmInstanceMetadataConstants.java
  • header/src/main/java/org/zstack/header/vm/VmInstanceMetadataDTO.java
  • header/src/main/java/org/zstack/header/vm/VmInstanceMetadataRegistrationSpec.java
  • header/src/main/java/org/zstack/header/vm/VmInstanceMetadataValidator.java
  • header/src/main/java/org/zstack/header/vm/VmInstanceState.java
  • header/src/main/java/org/zstack/header/vm/VmMetadata.java
  • header/src/main/java/org/zstack/header/vm/VmMetadataCanonicalEvents.java
  • header/src/main/java/org/zstack/header/vm/VmMetadataCategory.java
  • header/src/main/java/org/zstack/header/vm/VmMetadataConstants.java
  • header/src/main/java/org/zstack/header/vm/VmMetadataDirtyVO.java
  • header/src/main/java/org/zstack/header/vm/VmMetadataDirtyVO_.java
  • header/src/main/java/org/zstack/header/vm/VmMetadataErrors.java
  • header/src/main/java/org/zstack/header/vm/VmMetadataPathFingerprintVO.java
  • header/src/main/java/org/zstack/header/vm/VmMetadataScanResult.java
  • header/src/main/java/org/zstack/header/vm/VmUuidFromApiResolver.java
  • header/src/main/java/org/zstack/header/vm/VolumeResourceMetadata.java
  • header/src/main/java/org/zstack/header/vm/cdrom/APIDeleteVmCdRomMsg.java
  • header/src/main/java/org/zstack/header/volume/APIAttachDataVolumeToVmMsg.java
  • header/src/main/java/org/zstack/header/volume/APICreateVolumeSnapshotGroupMsg.java
  • header/src/main/java/org/zstack/header/volume/APIDeleteDataVolumeMsg.java
  • header/src/main/java/org/zstack/header/volume/APIDetachDataVolumeFromVmMsg.java
  • header/src/main/java/org/zstack/header/volume/APIFlattenVolumeMsg.java
  • header/src/main/java/org/zstack/header/volume/APIRecoverDataVolumeMsg.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMAgentCommands.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMConstant.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
  • plugin/localstorage/src/main/java/org/zstack/storage/primary/local/APILocalStorageMigrateVolumeMsg.java
  • plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageBase.java
  • plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageHypervisorBackend.java
  • plugin/localstorage/src/main/java/org/zstack/storage/primary/local/LocalStorageKvmBackend.java
  • plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorage.java
  • plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorageBackend.java
  • plugin/nfsPrimaryStorage/src/main/java/org/zstack/storage/primary/nfs/NfsPrimaryStorageKVMBackend.java
  • resourceconfig/src/main/java/org/zstack/resourceconfig/APIDeleteResourceConfigMsg.java
  • resourceconfig/src/main/java/org/zstack/resourceconfig/APIUpdateResourceConfigMsg.java
  • sdk/src/main/java/org/zstack/sdk/RegisterVmInstanceAction.java
  • sdk/src/main/java/org/zstack/sdk/RegisterVmInstanceResult.java
  • storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java
  • testlib/src/main/java/org/zstack/testlib/ApiHelper.groovy
  • testlib/src/main/java/org/zstack/testlib/KVMSimulator.groovy
🚧 Files skipped from review as they are similar to previous changes (9)
  • header/src/main/java/org/zstack/header/storage/primary/ReadVmInstanceMetadataMsg.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/DefaultVmUuidFromApiResolver.java
  • header/src/main/java/org/zstack/header/storage/primary/GetVmInstanceMetadataFromPrimaryStorageMsg.java
  • compute/src/main/java/org/zstack/compute/vm/metadata/SnapshotBasedVmUuidFromApiResolver.java
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceApiInterceptor.java
  • header/src/main/java/org/zstack/header/storage/primary/RegisterVmInstanceException.java
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceUtils.java
  • header/src/main/java/org/zstack/header/storage/primary/APIRegisterVmInstanceMsgDoc_zh_cn.groovy
  • header/src/main/java/org/zstack/header/storage/primary/APIGetVmInstanceMetadataFromPrimaryStorageReply.java

Comment on lines +75 to +111
@Override
public void managementNodeReady() {
startTask();
}

// =====================================================================
// 任务管理
// =====================================================================

private synchronized void startTask() {
if (taskFuture != null) {
taskFuture.cancel(false);
}
taskFuture = thdf.submitPeriodicTask(new PeriodicTask() {
@Override
public TimeUnit getTimeUnit() {
return TimeUnit.SECONDS;
}

@Override
public long getInterval() {
return VmGlobalConfig.VM_METADATA_STALE_RECOVERY_INTERVAL.value(Long.class);
}

@Override
public String getName() {
return "vm-metadata-stale-recovery";
}

@Override
public void run() {
recoverStaleVms();
}
});
logger.info("[MetadataStaleRecovery] task started (interval={}s)",
VmGlobalConfig.VM_METADATA_STALE_RECOVERY_INTERVAL.value(Long.class));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

这个定时任务现在会在每个管理节点上重复跑。

managementNodeReady() 无条件启动本地 periodic task,但后面的扫描又没有 owner/分片条件。集群场景下多个管理节点会并发处理同一批 lastFlushFailed=1 记录,轻则重复告警/重复入队,重则把 staleRecoveryCount 提前打满,过早进入 circuit-break。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/MetadataStaleRecoveryTask.java`
around lines 75 - 111, The periodic recovery task is started unconditionally in
managementNodeReady() causing every management node to run recoverStaleVms()
concurrently; modify startTask()/managementNodeReady() so only the elected/owner
management node runs the periodic task (or keep the task scheduled but add an
ownership/leader check at the start of the PeriodicTask.run), e.g. before
submitting via thdf.submitPeriodicTask or inside the run() wrapper verify node
is the cluster leader/owner and skip calling recoverStaleVms() when not,
ensuring taskFuture is started/cancelled only on the owner and the
VmGlobalConfig.VM_METADATA_STALE_RECOVERY_INTERVAL behavior remains unchanged.

Comment on lines +647 to +655
bus.send(msg, new CloudBusCallBack(null) {
@Override
public void run(MessageReply reply) {
if (reply.isSuccess()) {
onFlushSuccess(vmUuid, snapshotVersion);
} else {
onFlushFailure(vmUuid, reply.getError());
}
chainNext.run();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

确保 chainNext.run() 一定执行。

onFlushSuccess() / onFlushFailure() 里只要抛一个运行时异常,chainNext.run() 就不会执行,per-VM chain 会卡住,globalFlushInFlight 也会泄漏。这里至少要用 try/finally 包住收尾。

💡 建议修改
         bus.send(msg, new CloudBusCallBack(null) {
             `@Override`
             public void run(MessageReply reply) {
-                if (reply.isSuccess()) {
-                    onFlushSuccess(vmUuid, snapshotVersion);
-                } else {
-                    onFlushFailure(vmUuid, reply.getError());
-                }
-                chainNext.run();
+                try {
+                    if (reply.isSuccess()) {
+                        onFlushSuccess(vmUuid, snapshotVersion);
+                    } else {
+                        onFlushFailure(vmUuid, reply.getError());
+                    }
+                } finally {
+                    chainNext.run();
+                }
             }
         });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 647 - 655, The callback in bus.send (CloudBusCallBack.run) must
guarantee chainNext.run() executes even if onFlushSuccess(vmUuid,
snapshotVersion) or onFlushFailure(vmUuid, reply.getError()) throws; update the
run(...) method to invoke onFlushSuccess/onFlushFailure inside a try block and
call chainNext.run() in a finally block (optionally log any exception thrown),
so the per-VM chain and globalFlushInFlight cannot get stuck; locate the run
method in the anonymous CloudBusCallBack and change the structure to try { if
(reply.isSuccess()) onFlushSuccess(vmUuid, snapshotVersion); else
onFlushFailure(vmUuid, reply.getError()); } finally { chainNext.run(); }.

Comment on lines +739 to +748
// C-SR-05: 在 PathFingerprintVO 上标记 lastFlushFailed=true
SQL.New("UPDATE VmMetadataPathFingerprintVO " +
"SET lastFlushFailed = 1 WHERE vmInstanceUuid = :vmUuid")
.param("vmUuid", vmUuid)
.execute();

// 删除 dirty 行(释放 Poller 资源),stale 恢复由独立任务接管
SQL.New(VmMetadataDirtyVO.class)
.eq(VmMetadataDirtyVO_.vmInstanceUuid, vmUuid)
.delete();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

首次 flush 失败会把恢复线索一并删掉。

这里先 UPDATE 指纹表,再无条件删除 dirty 行。对“还从未成功刷写过”的 VM,VmMetadataPathFingerprintVO 还不存在,lastFlushFailed 实际不会落库;dirty 行一删,后面的 stale-recovery 就没有入口了。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@compute/src/main/java/org/zstack/compute/vm/metadata/VmMetadataDirtyMarker.java`
around lines 739 - 748, The current flow updates VmMetadataPathFingerprintVO
(setting lastFlushFailed) then unconditionally deletes VmMetadataDirtyVO rows;
if the fingerprint row does not exist (first-ever flush) the UPDATE is a no-op
and the delete removes the only recovery clue. Change the logic in the code
around the SQL.New("UPDATE VmMetadataPathFingerprintVO ...").param("vmUuid",
vmUuid).execute() and the
SQL.New(VmMetadataDirtyVO.class).eq(VmMetadataDirtyVO_.vmInstanceUuid,
vmUuid).delete() call so that you either (a) ensure a
VmMetadataPathFingerprintVO is created/inserted with lastFlushFailed=true when
none exists before deleting dirty rows, or (b) skip deleting the
VmMetadataDirtyVO row unless the fingerprint row existed/was successfully
updated (i.e., check existence of VmMetadataPathFingerprintVO by vmInstanceUuid
or use the UPDATE result to decide). Ensure you reference
VmMetadataPathFingerprintVO, VmMetadataDirtyVO, lastFlushFailed and
vmInstanceUuid/vmUuid when making the change.

Comment on lines +9427 to +9450
// 1. 构建 payload(通过 VmMetadataBuilder 在 @Transactional(readOnly=true) 事务内完成)
String metadata = vmMetadataBuilder.buildVmInstanceMetadata(msg.getUuid());

// 2. Payload 大小保护
int payloadSize = metadata.getBytes(java.nio.charset.StandardCharsets.UTF_8).length;
if (payloadSize > org.zstack.compute.vm.metadata.VmMetadataBuilder.REJECT_THRESHOLD) {
logger.error(String.format("metadata payload too large: %d bytes for vm[uuid:%s], rejecting",
payloadSize, msg.getUuid()));
MessageReply reply = new MessageReply();
reply.setError(Platform.operr("metadata payload too large (%d bytes, limit %d) for vm[uuid=%s]",
payloadSize, org.zstack.compute.vm.metadata.VmMetadataBuilder.REJECT_THRESHOLD, msg.getUuid()));
bus.reply(msg, reply);
return;
}
if (payloadSize > org.zstack.compute.vm.metadata.VmMetadataBuilder.WARN_THRESHOLD) {
logger.warn(String.format("metadata payload large: %d bytes for vm[uuid:%s]",
payloadSize, msg.getUuid()));
}

// 3. 发送到主存储
Tuple tuple = Q.New(VolumeVO.class).select(VolumeVO_.primaryStorageUuid, VolumeVO_.uuid)
.eq(VolumeVO_.vmInstanceUuid, msg.getUuid()).eq(VolumeVO_.type, VolumeType.Root).findTuple();
String primaryStorageUuid = tuple.get(0, String.class);
String rootVolumeUuid = tuple.get(1, String.class);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

这里不能假定 metadata 和 Root 卷一定存在。

VmMetadataBuilder.buildVmInstanceMetadata() 在 VM 已删除、不再符合条件,或 payload 被 builder 自己按阈值拒绝时都可能返回 null;Line 9431 直接 getBytes() 会先 NPE,下面的 reject 分支根本走不到。另一方面,Line 9449 和 Line 9450 也默认 findTuple() 一定命中,一旦 flush 和删除并发,同样会异常退出且不回 reply。至少要为 metadatatuple 明确 noop / error 分支。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
9427 - 9450, The code assumes VmMetadataBuilder.buildVmInstanceMetadata(...)
always returns a non-null metadata and that Q.New(...).findTuple() always finds
a root volume; add null/empty checks: after calling
VmMetadataBuilder.buildVmInstanceMetadata(msg.getUuid()) verify metadata != null
and if null either reply with an error via bus.reply(msg, reply) (using
Platform.operr(...) like the existing reject branch) or treat it as a no-op and
reply success, depending on intended semantics; guard the payloadSize
calculation (metadata.getBytes(...)) behind that check and preserve the
REJECT_THRESHOLD/WARN_THRESHOLD logic. Likewise, after Tuple tuple =
Q.New(VolumeVO.class)...findTuple() verify tuple is not null and contains
expected elements before calling tuple.get(0, ...) / get(1, ...); if missing,
log and reply with an error (or no-op reply) so the code does not NPE or throw
when the root volume was concurrently deleted.

Comment on lines +9 to +14
@RestRequest(
path = "/vm-instances/{vmUuid}/metadata",
method = HttpMethod.GET,
responseClass = APIReadVmInstanceMetadataReply.class
)
public class APIReadVmInstanceMetadataMsg extends APISyncCallMessage implements VmInstanceMessage {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

缺少 @MetadataImpact,和本 PR 的目标不一致。

这也是新增的 APISyncCallMessage,但类上没有 @MetadataImpact。既然这次改动要让 CI 强制检查所有 APIMessage 子类,这里不补注解的话会直接留下一个漏网点;读接口应归到非变更类目。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@header/src/main/java/org/zstack/header/vm/APIReadVmInstanceMetadataMsg.java`
around lines 9 - 14, 在 APIReadVmInstanceMetadataMsg 类上补上缺失的 `@MetadataImpact`
注解,表明该类是只读/非变更元数据的 API(将此消息类从 APISyncCallMessage/ APIMessage 子类标记为非变更类别),以符合本次
CI 对所有 APIMessage 子类的校验。定位到类 APIReadVmInstanceMetadataMsg 并在类定义前添加
`@MetadataImpact`(...),使用表示“只读/非变更”的枚举值(即将读接口归到非变更类目)。

Comment on lines +951 to +954
} else if (msg instanceof APIRegisterVmInstanceMsg) {
handle((APIRegisterVmInstanceMsg) msg);
} else if (msg instanceof APIGetVmInstanceMetadataFromPrimaryStorageMsg) {
handle((APIGetVmInstanceMetadataFromPrimaryStorageMsg) msg);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

补上 GetVmInstanceMetadataFromPrimaryStorageMsg 的本地分发。

这里 API 入口已经接了 APIGetVmInstanceMetadataFromPrimaryStorageMsg,但 handleLocalMessage() 里没有对应的 GetVmInstanceMetadataFromPrimaryStorageMsg 分支。当前这条链路发到 PrimaryStorage service 后会直接落到 bus.dealWithUnknownMessage(msg),后端实现永远不会被调用。

🔧 建议修复
@@
         } else if (msg instanceof GetVolumeSnapshotEncryptedOnPrimaryStorageMsg) {
             handle((GetVolumeSnapshotEncryptedOnPrimaryStorageMsg) msg);
+        } else if (msg instanceof GetVmInstanceMetadataFromPrimaryStorageMsg) {
+            handle((GetVmInstanceMetadataFromPrimaryStorageMsg) msg);
         } else if (msg instanceof DeleteVolumeChainOnPrimaryStorageMsg) {
             handle((DeleteVolumeChainOnPrimaryStorageMsg) msg);
         } else if (msg instanceof CleanUpStorageTrashOnPrimaryStorageMsg) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java`
around lines 951 - 954, PrimaryStorageBase's local dispatcher misses handling of
the internal GetVmInstanceMetadataFromPrimaryStorageMsg, so messages of that
type fall through to bus.dealWithUnknownMessage and the backend handler is never
invoked; in handleLocalMessage() add a branch like else if (msg instanceof
GetVmInstanceMetadataFromPrimaryStorageMsg) {
handle((GetVmInstanceMetadataFromPrimaryStorageMsg) msg); } (matching the
existing APIGetVmInstanceMetadataFromPrimaryStorageMsg -> handle(...) pattern)
so the internal message is routed to the existing
handle(GetVmInstanceMetadataFromPrimaryStorageMsg) method.

Comment on lines +1851 to +1855
@Override
public void fail(ErrorCode errorCode) {
bus.publish(event);
chain.next();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

失败分支没有把错误回给 API 调用方。

fail() 里只发布了事件,没有 event.setError(errorCode)。这样注册失败时,API 层可能收到一个“成功但无 inventory”的响应,错误被静默吞掉。

🔧 建议修复
                     `@Override`
                     public void fail(ErrorCode errorCode) {
+                        event.setError(errorCode);
                         bus.publish(event);
                         chain.next();
                     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@Override
public void fail(ErrorCode errorCode) {
bus.publish(event);
chain.next();
}
`@Override`
public void fail(ErrorCode errorCode) {
event.setError(errorCode);
bus.publish(event);
chain.next();
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@storage/src/main/java/org/zstack/storage/primary/PrimaryStorageBase.java`
around lines 1851 - 1855, 在 PrimaryStorageBase.java 的该匿名 Callback 的
fail(ErrorCode errorCode) 分支中目前只调用了 bus.publish(event) 并继续
chain.next(),但没有把错误回传给 API。修改 fail 方法:在发布事件前调用
event.setError(errorCode)(并根据项目约定设置 event.setSuccess(false) 如果存在此方法),然后再
bus.publish(event) 并 chain.next(),以确保调用方能收到失败信息。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants