Skip to content

<fix>[core]: cap client keep-alive below agent socket timeout#4393

Open
MatheMatrix wants to merge 1 commit into
4.8.0from
sync/moyu/TIC-5970
Open

<fix>[core]: cap client keep-alive below agent socket timeout#4393
MatheMatrix wants to merge 1 commit into
4.8.0from
sync/moyu/TIC-5970

Conversation

@MatheMatrix

Copy link
Copy Markdown
Owner

问题 (TIC-5970)

MN 通过 RESTFacadeImpl 的异步 HTTP 客户端(Apache HttpAsyncClient)向 agent 发命令时,从未设置 keep-alive 策略,连接池里的空闲连接寿命是"无限"(DefaultConnectionKeepAliveStrategy 在 server 不回 Keep-Alive 头时返回 -1)。

而 agent 的 cherrypy HTTP server 会在其 socket_timeout(默认 10s,agent 侧 fix 后 15s)关闭空闲 keep-alive 连接。当连接池复用一条 agent 已经关闭的连接时,复用请求撞上 RST,表现为 Connection reset by peer,被包进错误码 1015 Cannot make a HTTP request,导致云主机/物理机操作偶发失败。

根因:客户端 keep-alive 空闲时间 > 服务端 keep-alive 空闲时间,客户端成了"被动关闭方",会复用到服务端已关的连接。

修复

createAsyncRestTemplateHttpAsyncClients 补上 keep-alive 策略,把客户端连接的可复用寿命封顶到小于 agent 的 socket_timeout(新增全局属性 RESTFacade.keepAliveTimeMillis,默认 5000ms),并加后台 evictor 主动清理过期/空闲连接。这样客户端总是先于 agent 回收空闲连接,永远用新鲜连接,从机制上消除复用死连接导致的 RST。

  • core/.../rest/RESTFacadeImpl.java: setKeepAliveStrategy + 定期 closeExpiredConnections/closeIdleConnections
  • core/.../CoreGlobalProperty.java: 新增 RESTFacade.keepAliveTimeMillis(默认 5000)

验证

  • 全量编译通过(mvn 3.8.8,主仓单 reactor clean install BUILD SUCCESS)
  • 现有集成用例 RestFacadeCase 通过(Tests run: 1, Failures: 0, Errors: 0),覆盖 asyncJsonPost 失败记录路径
  • 抓包实证:加策略后客户端在 keep-alive 上限(远早于服务端超时)主动发 FIN 关闭超龄连接并换新连接复用,全程 0 RST

GlobalPropertyImpact: adds RESTFacade.keepAliveTimeMillis (default 5000)

Resolves: TIC-5970

sync from gitlab !10353

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 5 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2a164963-b627-4272-b78c-924a5d6f366c

📥 Commits

Reviewing files that changed from the base of the PR and between 12bb2b7 and 633cd3f.

📒 Files selected for processing (3)
  • core/src/main/java/org/zstack/core/CoreGlobalProperty.java
  • core/src/main/java/org/zstack/core/rest/RESTFacadeImpl.java
  • test/src/test/groovy/org/zstack/test/integration/core/rest/RestFacadeKeepAliveCase.groovy
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/moyu/TIC-5970

Comment @coderabbitai help to get the list of available commands.

@MatheMatrix MatheMatrix force-pushed the sync/moyu/TIC-5970 branch from f03d31a to 8768bd8 Compare July 1, 2026 05:24
@ZStack-Robot

Copy link
Copy Markdown
Collaborator

Comment from moyu:

感谢 review,已在 d1b7c121 修复,逐条回应:

🟡 Warning 1(生命周期泄漏)— 已修复

  • connectionEvictor 提升为 RESTFacadeImpl 字段;createAsyncRestTemplate 改为实例方法,直接给字段赋值。
  • 新增 void destroy()(bean 已配 default-destroy-method="destroy"):先 connectionEvictor.shutdownNow(),再对 asyncRestTemplate.getAsyncRequestFactory()HttpComponentsAsyncClientHttpRequestFactorydestroy(),连带关闭 async client 与 connection manager。这样容器/测试重建不再残留 daemon evictor 线程和旧 connectionManager。

🟢 Suggestion(注释去 ticket 前缀)— 已修复

  • 删除了 RESTFacadeImpl.java / CoreGlobalProperty.java / RestFacadeKeepAliveCase.groovy 里注释的 TIC-5970: 前缀,只保留技术约束描述。

🟡 Warning 2(测试防回归)— 已补强(部分)

  • 把 cap 决策抽成可测方法 cappedKeepAlive(serverDuration, capMs),并在 RestFacadeKeepAliveCase 加了边界断言(agent 不回头/回更大值 → 封顶;回更小值 → 保留)。cap 逻辑被改坏会被这个用例挡住。
  • 关于"复现复用已被服务端关闭的 idle connection"的完整回归:需要同时把客户端 cap 降到很小 + 起一个按短 idle 主动关连接的可控 server + 数秒等待,在当前 simulator(共享 Jetty,无法按用例定制 idle-close)里既慢又易 flaky,故未纳入。该运行时行为已用抓包实证:加策略后客户端在 cap(远早于 agent 超时)主动发 FIN 关闭超龄连接、下次请求换新连接、全程 0 RST。如果希望我用裸 ServerSocket 起一个自定义短超时 server 补一个端到端回归(接受多秒耗时),可以再加,请示下。

已本地验证:core 全量编译通过;RestFacadeCaseRestFacadeKeepAliveCaseTests run: 1, Failures: 0, Errors: 0

@MatheMatrix MatheMatrix force-pushed the sync/moyu/TIC-5970 branch from 8768bd8 to 633cd3f Compare July 1, 2026 05:52
MN's async RESTFacade HTTP client (Apache HttpAsyncClient) never set a
keep-alive strategy, so idle connections in the pool live forever: the
default strategy returns -1 when the agent sends no Keep-Alive header.
The agent cherrypy HTTP server closes idle keep-alive connections at its
socket_timeout (10s default, 15s after the agent-side fix). When the
pool reuses a connection the agent has already closed, the reused request
hits a RST and surfaces as "Connection reset by peer" wrapped in error
1015 "Cannot make a HTTP request", causing intermittent VM/host
operation failures.

Cap the client keep-alive duration via a new global property
RESTFacade.keepAliveTimeMillis (default 5000ms, below the agent
socket_timeout) and proactively evict expired/idle connections, so the
client always recycles an idle connection before the agent closes it.

GlobalPropertyImpact: adds RESTFacade.keepAliveTimeMillis (default 5000)
Resolves: TIC-5970

Change-Id: I7a656a6b776876726667696c657a7669657a6664
@MatheMatrix MatheMatrix force-pushed the sync/moyu/TIC-5970 branch from 633cd3f to 22540b8 Compare July 1, 2026 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants