Skip to content

DAOS-15993 rebuild: for manual rebuilds do not eval self_heal#17345

Merged
daltonbohning merged 7 commits intomasterfrom
kccain/daos_15993
Jan 29, 2026
Merged

DAOS-15993 rebuild: for manual rebuilds do not eval self_heal#17345
daltonbohning merged 7 commits intomasterfrom
kccain/daos_15993

Conversation

@kccain
Copy link
Copy Markdown
Contributor

@kccain kccain commented Jan 6, 2026

Consider a quick maintenance scenario in which a daos_engine is stopped briefly, and the administrator does not wish to have the DAOS automatic recovery / rebuild mechanism occur. That is, a pool map update (targets from UP_IN to DOWN) is to occur, the pool to enter a degraded mode (still allowing ongoing I/O), and NO rebuild to be triggered during this brief time window.

The above can be arranged by modifying the system or pool-specific self_heal property value (to not set the rebuild bit), and then stopping the engine.

Now also consider the conclusion of the maintenance that involes re-starting the engine, and reintegrating that rank back into the pool. It is most convenient to directly issue a dmg pool reintegrate command from the maintenance state.

Before this change, manual administration commands such as dmg pool exclude/reintegrate were prevented from triggering rebuilds due to the pool self_heal property setting. However, the intention of the self_heal (aka auto recovery) feature is to only apply to automatic rebuilds.

With this change, the is_pool_rebuild_allowed() function is updated to accept an indication of whether the self_heal checks are applicable. Manual pool map update and rebuild cases supply false for this argument (allowing those cases to result in a rebuild being scheduled).

Features: rebuild pool

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Consider a quick maintenance scenario in which a daos_engine
is stopped briefly, and the administrator does not wish to have
the DAOS automatic recovery / rebuild mechanism occur. That is,
a pool map update (targets from UP_IN to DOWN) is to occur, the
pool to enter a degraded mode (still allowing ongoing I/O), and
NO rebuild to be triggered during this brief time window.

The above can be arranged by modifying the system or pool-specific
self_heal property value (to not set the rebuild bit), and then
stopping the engine.

Now also consider the conclusion of the maintenance that involes
re-starting the engine, and reintegrating that rank back into the pool.
It is most convenient to directly issue a dmg pool reintegrate command
from the maintenance state.

Before this change, manual administration commands such as
dmg pool exclude/reintegrate were prevented from triggering rebuilds
due to the pool self_heal property setting. However, the intention
of the self_heal (aka auto recovery) feature is to only apply
to automatic rebuilds.

With this change, the is_pool_rebuild_allowed() function is updated
to accept an indication of whether the self_heal checks are applicable.
Manual pool map update and rebuild cases supply false for this argument
(allowing those cases to result in a rebuild being scheduled).

Features: rebuild pool

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 6, 2026

Ticket title is 'pool reintegrate issue when pool property self_heal:exclude (no rebuild)'
Status is 'In Review'
Labels: 'scrubbed_2.8,triaged'
https://daosio.atlassian.net/browse/DAOS-15993

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17345/1/display/redirect

@daosbuild3
Copy link
Copy Markdown
Collaborator

Copy link
Copy Markdown
Contributor

@liw liw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general; one "[question]" needs an answer before I approve this PR.

Comment thread src/include/daos_srv/pool.h Outdated

static inline bool
is_pool_rebuild_allowed(struct ds_pool *pool, bool check_delayed_rebuild)
is_pool_rebuild_allowed(struct ds_pool *pool, uint64_t self_heal, bool self_heal_applicable,
Copy link
Copy Markdown
Contributor

@liw liw Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] Would it be any clearer to name self_heal_applicable auto?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so. Changed to auto_recovery.

Comment thread src/include/daos_srv/rebuild.h Outdated
int
ds_rebuild_regenerate_task(struct ds_pool *pool, daos_prop_t *prop, uint64_t sys_self_heal,
uint64_t delay_sec);
bool self_heal_applicable, uint64_t delay_sec);
Copy link
Copy Markdown
Contributor

@liw liw Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] self_heal_applicable or perhaps auto? No strong opinion though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to auto_recovery.

Comment thread src/pool/srv_pool.c Outdated

rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal, 0);
rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal,
true /* self_heal_applicable */, 0 /* delay_sec*/);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] A missing space between delay_sec and */.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

Comment thread src/pool/srv_pool.c Outdated
self_heal_applicable = (opc == MAP_EXCLUDE && src == MUS_SWIM);

if (sys_self_heal_applicable) {
/* do not update pool map if system.self_heal is applicable but does not enable exclude */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] The comment isn't that helpful; if one must be added, I think a conciser one like "If applicable, check the system self-heal policy." might be more helpful.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, simplified.

Comment thread src/pool/srv_pool.c Outdated
}
}

/* Update pool map if pool.self_heal is applicable and enables exclude. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] Could we say "The pool self-heal policy is checked by the following call." instead? Considering that the call performs many other checks too...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, fixed.

Comment thread src/pool/srv_pool.c Outdated
d_freeenv_str(&env);

if (sys_self_heal_applicable && !(sys_self_heal & DS_MGMT_SELF_HEAL_POOL_REBUILD)) {
/* Do not trigger rebuild if system.self_heal is applicable but does not enable rebuild. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] The log message has already said it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed to de-duplicate the information in the source code.

Comment thread src/pool/srv_pool.c Outdated
}

if (!is_pool_rebuild_allowed(svc->ps_pool, true)) {
/* Do not trigger rebuild if pool.self_heal is applicable but does not enable rebuild. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] The log message has already said it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed to de-duplicate the information in the source code.

Comment thread src/include/daos_srv/pool.h Outdated
if (pool->sp_disable_rebuild)
return false;
if (!(pool->sp_self_heal & flags))
if (self_heal_applicable && !(self_heal & flags))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Does this change affect the delayed rebuild case? I don't know the answer; just making sure this has been considered.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. What I've done is remove that argument, since no callers currently specify anything other than true.

Also I did experiment with some manual testing of a pool whose self_heal property value was "exclude;delay_rebuild" and it seemed to work as expected (no exclude rebuilds occur, deferring until a subsequent reintegrate).

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/2/execution/node/1100/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/2/execution/node/1141/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/3/testReport/

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17345/5/display/redirect

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/7/testReport/

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/7/testReport/

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/8/testReport/

to prevent rebuild from starting (and finishing) that
affects the verification logic of this test case.

Features: rebuild pool

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@kccain
Copy link
Copy Markdown
Contributor Author

kccain commented Jan 16, 2026

@liuxuezhao could you help evaluate this aspect of the code change / impact on tests?

daos_degrade_ec.c test cases configure pool self_heal:"exclude" (no rebuild), and use "dmg pool exclude" to cause the pool map update. However, with this PR, we will actually be performing both the pool map update and triggering a rebuild. This affected the DEGRADE24 degrade_ec_partial_update_agg test in CI testing with the first version of this patch.

In general, it may be a concern that rebuild is triggered in so many test cases that (ideally) may not want a rebuild running (and potentially finishing too early? -- taking the pool out of "degraded mode" that it needs to be in for testing).

@daosbuild3
Copy link
Copy Markdown
Collaborator

daosbuild3 commented Jan 19, 2026

@daosbuild3
Copy link
Copy Markdown
Collaborator

daosbuild3 commented Jan 20, 2026

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/10/testReport/

rebuild/widely_striped.py failure (pool query timed out with 5 minute deadline) seems to be an instance of existing issue https://daosio.atlassian.net/browse/DAOS-18302

@kccain kccain marked this pull request as ready for review January 20, 2026 19:12
@kccain kccain requested review from a team as code owners January 20, 2026 19:12
Comment thread src/include/daos_srv/pool.h Outdated
if (!(pool->sp_self_heal & flags))

if (auto_recovery &&
!(self_heal & (DAOS_SELF_HEAL_AUTO_REBUILD | DAOS_SELF_HEAL_DELAY_REBUILD)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]It takes some time to think a bit this, probably could leave some comments and modify codes as:

/* If auto recovery is requested, only allow if self_heal enables auto or delay rebuild */
if (auto_recovery &&
!(self_heal & DAOS_SELF_HEAL_AUTO_REBUILD || self_heal & DAOS_SELF_HEAL_DELAY_REBUILD))
return false;

/* Otherwise, rebuild is allowed */
return true;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, should be addressed in the latest version of the patch

wangshilong
wangshilong previously approved these changes Jan 21, 2026
@kccain kccain requested a review from liw January 21, 2026 11:36
liw
liw previously approved these changes Jan 22, 2026
Comment thread src/pool/srv_pool.c

rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal, 0);
rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal, true /* auto_recovery */,
0 /* delay_sec */);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may need some discussions -
because the pool map maybe with UP ranks, that was triggered by admin's "dmg pool reint" cmd and want to trigger rebuild for it, if system restart or PS leader switched, after new PS leader stepup with this change possibly cannot trigger rebuild for the reint right?
probably need check a few details and discuss what's the good way to handle it,
I'll leave a -1 for it temporarily and get back to it if we are clear about it

Copy link
Copy Markdown
Contributor

@liuxuezhao liuxuezhao Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I checked some details looks my above comment is incorrect, please ignore it.
We do have some issues in ds_rebuild_regenerate_task() but unrelated with this PR, I can consider to refine something later.
Do you think can you refine the daos_degrade_ec.c test issue? see the other comment in the test code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it seems if pool self_heal=exclude (no rebuild) it will prevent targets in down or draining state from rebuilding during pool service step up. But the reintegrations will proceed.

I do see a problem perhaps at the top of the function where it checks the system self_heal policy that if it does not include "pool_rebuild" could prevent an interrupted reintegration rebuild from restarting.

Copy link
Copy Markdown
Contributor

@liuxuezhao liuxuezhao Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, that probably is not the intention to prevent REINT's rebuild, may confirm with @liw

The complex thing is if trigger rebuild for UP tgt, it actually will also rebuild the DOWN tgt.
and DOWN tgt also don't know if it was excluded by SWIM or admin after system restart

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On system restart, for the first pool_svc_step_up_cb() can the engine know?

  • if targets in the DOWN state are that way because of a manual exclude, or an automatic exclude (SWIM, or NVMe fault) before the restart
  • if targets in the DRAINING state are that way because of a manual drain, or automatic (CSUM scrubber) before the restart
  • if targets in the UP state are that way because of a manual reintegrate, or automatic (NVMe hotplug reintegration) before the restart

Is it any different for a PS leader change (as opposed to engine/system restart)? I forget, but maybe the SWIM / cart event history can be used in some cases @liw ? Sorry if I have terminology wrong here.

The easy assumption I guess is to assume that step_up is always an "automatic_recovery" case and do not start any rebuilds if the system self_heal and pool self_heal properties do not enable rebuild. In theory, that would apply to targets in DOWN, DRAINING, and UP states - so it would prevent all rebuilding - even if they were manually started before.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuxuezhao, @kccain, I remember we've considered this point earlier last year. It would require recording the source info persistently, either inside or along with the pool map---sounds like an overkill. So my understanding is that we always treat pool_svc_step_up_cb as "auto".

Currently, however, there happens to be a theorem that we only have automatic exclusion (i.e., no automatic reint, drain, etc.). It might be possible to improve the decisions on whether a map comp status is due to an "auto" or "manual" change. That said, if we do that, we will need to undo it once we introduce auto reint, extend, etc.

I'd vote for just treating pool_svc_step_up_cb as "auto".

Comment thread src/tests/suite/daos_degrade_ec.c Outdated
rebuild_pools_ranks(&arg, 1, &rank, 1, false);
arg->no_rebuild = 1;
rebuild_pools_ranks(&arg, 1, &rank, 1, true /* kill */);
dmg_system_start_rank(dmg_config_file, rank);
Copy link
Copy Markdown
Contributor

@liuxuezhao liuxuezhao Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks changed the original test intention.
since this PR changed the self_heal controlling, for daos_degrade_ec.c's usage of self_heal:exclude, may provide another FAIL_LOC or another approach to disable rebuild for the test cases in daos_degrade_ec.c to be basically similar as before. what do you think?

If no better idea, one option is adding a FAILLOC, and disable the rebuild in pool_svc_update_map() in the same place of checking "REBUILD_ENV_DISABLED" ENV. and set the FAIL LOC in daos_degrade_ec.c test cases. and please check if there are some other similar test cases used self_heal:exclude.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular change here keeps the intention of the test - it causes exclude to occur, and no rebuild to run. So the pool stays in degraded mode while the test is performing its subsequent verifications.

yes, probably need to take a look at how to preserve the intention of the test cases in this file. I can think about it some more.

I think the problem now is with many of the other test cases that perform a "dmg pool exclude" (by specifying rebuild_pools_ranks(..., kill=false). In those test cases now (with this patch), they actually perform both the exclude and kick off a rebuild. So there is a time window where the pool is in the degraded mode. If the tests are lucky I guess the verifications will work OK because the pool doesn't finish its rebuild during the verification?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the test case original with self_heal:exclude set, let's keep the original behavior that don't rigger rebuild for it if possible.
that can be satisfied by calling daos_debug_set_params() with a new fail inject code defined to disable the rebuild, if no better idea for it. FYI

@liuxuezhao
Copy link
Copy Markdown
Contributor

@liuxuezhao could you help evaluate this aspect of the code change / impact on tests?

daos_degrade_ec.c test cases configure pool self_heal:"exclude" (no rebuild), and use "dmg pool exclude" to cause the pool map update. However, with this PR, we will actually be performing both the pool map update and triggering a rebuild. This affected the DEGRADE24 degrade_ec_partial_update_agg test in CI testing with the first version of this patch.

In general, it may be a concern that rebuild is triggered in so many test cases that (ideally) may not want a rebuild running (and potentially finishing too early? -- taking the pool out of "degraded mode" that it needs to be in for testing).

sorry did not see it before, I replied in the test code just now.

- improve clarity of is_pool_rebuild_allowed().
- in setup functions, use fault injection DAOS_REBUILD_DISABLE
  to prevent manual rebuild from occurring in degraded mode tests.

Features: pool rebuild

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@kccain kccain dismissed stale reviews from liw and wangshilong via 265d26d January 22, 2026 22:16
@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

daosbuild3 commented Jan 23, 2026

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/11/execution/node/1364/log

  • container/boundary.py failures (pool create -DER_INVAL due to vos_pmemobj_create / vos_pool_create_ex failure): an instance of DAOS-18477 or possibly a cluster-specific issue hdr-[110,112-119] that has less memory than other HW Large clusters according to @phender
  • erasurecode/multiple_target_failure.py failure: an instance of DAOS-16766

@daosbuild3
Copy link
Copy Markdown
Collaborator

daosbuild3 commented Jan 23, 2026

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/11/execution/node/1282/log

  • pool/list_verbose.py failure: is a known regression being worked in DAOS-18347
  • container/fill_destroy_loop.py is an instance of DAOS-18454
  • fault_injection/pool.py and nvme/health.py failures are pool create -DER_INVAL (due to vos_pmemobj_create / vos_pool_create_ex failure) and are suspiciously similar looking to the container/boundary.py failures seen in the Func_HW_Large_MD_on_SSD (see separate comment). A new ticket has been filed for several tests seeing the issue DAOS-18519

Copy link
Copy Markdown
Contributor

@liuxuezhao liuxuezhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are quite some test failed, may need to check if related with this PR.

@kccain
Copy link
Copy Markdown
Contributor Author

kccain commented Jan 26, 2026

there are quite some test failed, may need to check if related with this PR.

Thanks for the review. I think the failures are unrelated, and are associated with the PR using Features: rebuild pool in the commit message (running some nightly/weekly regression tests, some of which have known failures).

@kccain
Copy link
Copy Markdown
Contributor Author

kccain commented Jan 26, 2026

Proposal - let's do final review in parallel to my merge of latest master, and (presumed) final test build (build 13 without Features pragma). Build 11 did not reveal any new regressions caused by this patch.

@kccain kccain requested review from liw and wangshilong January 26, 2026 21:47
Comment on lines +572 to +573
bool auto_rebuild_enabled = self_heal & DAOS_SELF_HEAL_AUTO_REBUILD;
bool delay_rebuild_enabled = self_heal & DAOS_SELF_HEAL_DELAY_REBUILD;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, these require !! to make sure the cast to bool retains "nonzero-ness". (Of course, current flags will work because they are not define with higher bits. But I would normally consider this a defect.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I may be wrong...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm wrong: We use standard bool now, which is special and does not require !!. Sorry for the noise.

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

daosbuild3 commented Jan 28, 2026

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/13/execution/node/1141/log

daos_test/suite.py test_daos_rebulid_interactive failure is an intermittent failure, known issue DAOS-18501

I'll re-trigger functional hw testing to try to get a cleaner test result (with just NLT failures that are affecting many PRs currently)

@kccain kccain requested a review from a team January 29, 2026 18:12
@kccain kccain added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 29, 2026
@kccain
Copy link
Copy Markdown
Contributor Author

kccain commented Jan 29, 2026

@daos-stack/daos-gatekeeper requesting forced landing for the NLT failures that are being experienced my most PRs. Build 11 included Features: rebuild pool and only encountered known failures / no new regressions seen. Latest builds after that ran per-PR tests only.

@daltonbohning daltonbohning merged commit c2d1129 into master Jan 29, 2026
39 of 41 checks passed
@daltonbohning daltonbohning deleted the kccain/daos_15993 branch January 29, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

6 participants