DAOS-15993 rebuild: for manual rebuilds do not eval self_heal#17345
DAOS-15993 rebuild: for manual rebuilds do not eval self_heal#17345daltonbohning merged 7 commits intomasterfrom
Conversation
Consider a quick maintenance scenario in which a daos_engine is stopped briefly, and the administrator does not wish to have the DAOS automatic recovery / rebuild mechanism occur. That is, a pool map update (targets from UP_IN to DOWN) is to occur, the pool to enter a degraded mode (still allowing ongoing I/O), and NO rebuild to be triggered during this brief time window. The above can be arranged by modifying the system or pool-specific self_heal property value (to not set the rebuild bit), and then stopping the engine. Now also consider the conclusion of the maintenance that involes re-starting the engine, and reintegrating that rank back into the pool. It is most convenient to directly issue a dmg pool reintegrate command from the maintenance state. Before this change, manual administration commands such as dmg pool exclude/reintegrate were prevented from triggering rebuilds due to the pool self_heal property setting. However, the intention of the self_heal (aka auto recovery) feature is to only apply to automatic rebuilds. With this change, the is_pool_rebuild_allowed() function is updated to accept an indication of whether the self_heal checks are applicable. Manual pool map update and rebuild cases supply false for this argument (allowing those cases to result in a rebuild being scheduled). Features: rebuild pool Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
|
Ticket title is 'pool reintegrate issue when pool property self_heal:exclude (no rebuild)' |
|
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17345/1/display/redirect |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/1/testReport/ |
liw
left a comment
There was a problem hiding this comment.
Looks good in general; one "[question]" needs an answer before I approve this PR.
|
|
||
| static inline bool | ||
| is_pool_rebuild_allowed(struct ds_pool *pool, bool check_delayed_rebuild) | ||
| is_pool_rebuild_allowed(struct ds_pool *pool, uint64_t self_heal, bool self_heal_applicable, |
There was a problem hiding this comment.
[Nit] Would it be any clearer to name self_heal_applicable auto?
There was a problem hiding this comment.
Yes I think so. Changed to auto_recovery.
| int | ||
| ds_rebuild_regenerate_task(struct ds_pool *pool, daos_prop_t *prop, uint64_t sys_self_heal, | ||
| uint64_t delay_sec); | ||
| bool self_heal_applicable, uint64_t delay_sec); |
There was a problem hiding this comment.
[Nit] self_heal_applicable or perhaps auto? No strong opinion though.
There was a problem hiding this comment.
Changed to auto_recovery.
|
|
||
| rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal, 0); | ||
| rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal, | ||
| true /* self_heal_applicable */, 0 /* delay_sec*/); |
There was a problem hiding this comment.
[Nit] A missing space between delay_sec and */.
| self_heal_applicable = (opc == MAP_EXCLUDE && src == MUS_SWIM); | ||
|
|
||
| if (sys_self_heal_applicable) { | ||
| /* do not update pool map if system.self_heal is applicable but does not enable exclude */ |
There was a problem hiding this comment.
[Nit] The comment isn't that helpful; if one must be added, I think a conciser one like "If applicable, check the system self-heal policy." might be more helpful.
| } | ||
| } | ||
|
|
||
| /* Update pool map if pool.self_heal is applicable and enables exclude. */ |
There was a problem hiding this comment.
[Nit] Could we say "The pool self-heal policy is checked by the following call." instead? Considering that the call performs many other checks too...
| d_freeenv_str(&env); | ||
|
|
||
| if (sys_self_heal_applicable && !(sys_self_heal & DS_MGMT_SELF_HEAL_POOL_REBUILD)) { | ||
| /* Do not trigger rebuild if system.self_heal is applicable but does not enable rebuild. */ |
There was a problem hiding this comment.
[Nit] The log message has already said it.
There was a problem hiding this comment.
Removed to de-duplicate the information in the source code.
| } | ||
|
|
||
| if (!is_pool_rebuild_allowed(svc->ps_pool, true)) { | ||
| /* Do not trigger rebuild if pool.self_heal is applicable but does not enable rebuild. */ |
There was a problem hiding this comment.
[Nit] The log message has already said it.
There was a problem hiding this comment.
Removed to de-duplicate the information in the source code.
| if (pool->sp_disable_rebuild) | ||
| return false; | ||
| if (!(pool->sp_self_heal & flags)) | ||
| if (self_heal_applicable && !(self_heal & flags)) |
There was a problem hiding this comment.
[Question] Does this change affect the delayed rebuild case? I don't know the answer; just making sure this has been considered.
There was a problem hiding this comment.
I don't think so. What I've done is remove that argument, since no callers currently specify anything other than true.
Also I did experiment with some manual testing of a pool whose self_heal property value was "exclude;delay_rebuild" and it seemed to work as expected (no exclude rebuilds occur, deferring until a subsequent reintegrate).
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/2/testReport/ |
Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/2/execution/node/1100/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/2/execution/node/1141/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/3/testReport/ |
|
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17345/5/display/redirect |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/6/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/7/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/7/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/8/testReport/ |
to prevent rebuild from starting (and finishing) that affects the verification logic of this test case. Features: rebuild pool Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
|
@liuxuezhao could you help evaluate this aspect of the code change / impact on tests? daos_degrade_ec.c test cases configure pool self_heal:"exclude" (no rebuild), and use "dmg pool exclude" to cause the pool map update. However, with this PR, we will actually be performing both the pool map update and triggering a rebuild. This affected the DEGRADE24 degrade_ec_partial_update_agg test in CI testing with the first version of this patch. In general, it may be a concern that rebuild is triggered in so many test cases that (ideally) may not want a rebuild running (and potentially finishing too early? -- taking the pool out of "degraded mode" that it needs to be in for testing). |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/10/testReport/ An instance of existing issue https://daosio.atlassian.net/browse/DAOS-17416 |
|
Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/10/testReport/ rebuild/widely_striped.py failure (pool query timed out with 5 minute deadline) seems to be an instance of existing issue https://daosio.atlassian.net/browse/DAOS-18302 |
| if (!(pool->sp_self_heal & flags)) | ||
|
|
||
| if (auto_recovery && | ||
| !(self_heal & (DAOS_SELF_HEAL_AUTO_REBUILD | DAOS_SELF_HEAL_DELAY_REBUILD))) |
There was a problem hiding this comment.
[nit]It takes some time to think a bit this, probably could leave some comments and modify codes as:
/* If auto recovery is requested, only allow if self_heal enables auto or delay rebuild */
if (auto_recovery &&
!(self_heal & DAOS_SELF_HEAL_AUTO_REBUILD || self_heal & DAOS_SELF_HEAL_DELAY_REBUILD))
return false;
/* Otherwise, rebuild is allowed */
return true;
There was a problem hiding this comment.
Good idea, should be addressed in the latest version of the patch
|
|
||
| rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal, 0); | ||
| rc = ds_rebuild_regenerate_task(svc->ps_pool, prop, sys_self_heal, true /* auto_recovery */, | ||
| 0 /* delay_sec */); |
There was a problem hiding this comment.
this may need some discussions -
because the pool map maybe with UP ranks, that was triggered by admin's "dmg pool reint" cmd and want to trigger rebuild for it, if system restart or PS leader switched, after new PS leader stepup with this change possibly cannot trigger rebuild for the reint right?
probably need check a few details and discuss what's the good way to handle it,
I'll leave a -1 for it temporarily and get back to it if we are clear about it
There was a problem hiding this comment.
sorry I checked some details looks my above comment is incorrect, please ignore it.
We do have some issues in ds_rebuild_regenerate_task() but unrelated with this PR, I can consider to refine something later.
Do you think can you refine the daos_degrade_ec.c test issue? see the other comment in the test code.
There was a problem hiding this comment.
Right, it seems if pool self_heal=exclude (no rebuild) it will prevent targets in down or draining state from rebuilding during pool service step up. But the reintegrations will proceed.
I do see a problem perhaps at the top of the function where it checks the system self_heal policy that if it does not include "pool_rebuild" could prevent an interrupted reintegration rebuild from restarting.
There was a problem hiding this comment.
right, that probably is not the intention to prevent REINT's rebuild, may confirm with @liw
The complex thing is if trigger rebuild for UP tgt, it actually will also rebuild the DOWN tgt.
and DOWN tgt also don't know if it was excluded by SWIM or admin after system restart
There was a problem hiding this comment.
On system restart, for the first pool_svc_step_up_cb() can the engine know?
- if targets in the DOWN state are that way because of a manual exclude, or an automatic exclude (SWIM, or NVMe fault) before the restart
- if targets in the DRAINING state are that way because of a manual drain, or automatic (CSUM scrubber) before the restart
- if targets in the UP state are that way because of a manual reintegrate, or automatic (NVMe hotplug reintegration) before the restart
Is it any different for a PS leader change (as opposed to engine/system restart)? I forget, but maybe the SWIM / cart event history can be used in some cases @liw ? Sorry if I have terminology wrong here.
The easy assumption I guess is to assume that step_up is always an "automatic_recovery" case and do not start any rebuilds if the system self_heal and pool self_heal properties do not enable rebuild. In theory, that would apply to targets in DOWN, DRAINING, and UP states - so it would prevent all rebuilding - even if they were manually started before.
There was a problem hiding this comment.
@liuxuezhao, @kccain, I remember we've considered this point earlier last year. It would require recording the source info persistently, either inside or along with the pool map---sounds like an overkill. So my understanding is that we always treat pool_svc_step_up_cb as "auto".
Currently, however, there happens to be a theorem that we only have automatic exclusion (i.e., no automatic reint, drain, etc.). It might be possible to improve the decisions on whether a map comp status is due to an "auto" or "manual" change. That said, if we do that, we will need to undo it once we introduce auto reint, extend, etc.
I'd vote for just treating pool_svc_step_up_cb as "auto".
| rebuild_pools_ranks(&arg, 1, &rank, 1, false); | ||
| arg->no_rebuild = 1; | ||
| rebuild_pools_ranks(&arg, 1, &rank, 1, true /* kill */); | ||
| dmg_system_start_rank(dmg_config_file, rank); |
There was a problem hiding this comment.
this looks changed the original test intention.
since this PR changed the self_heal controlling, for daos_degrade_ec.c's usage of self_heal:exclude, may provide another FAIL_LOC or another approach to disable rebuild for the test cases in daos_degrade_ec.c to be basically similar as before. what do you think?
If no better idea, one option is adding a FAILLOC, and disable the rebuild in pool_svc_update_map() in the same place of checking "REBUILD_ENV_DISABLED" ENV. and set the FAIL LOC in daos_degrade_ec.c test cases. and please check if there are some other similar test cases used self_heal:exclude.
There was a problem hiding this comment.
this particular change here keeps the intention of the test - it causes exclude to occur, and no rebuild to run. So the pool stays in degraded mode while the test is performing its subsequent verifications.
yes, probably need to take a look at how to preserve the intention of the test cases in this file. I can think about it some more.
I think the problem now is with many of the other test cases that perform a "dmg pool exclude" (by specifying rebuild_pools_ranks(..., kill=false). In those test cases now (with this patch), they actually perform both the exclude and kick off a rebuild. So there is a time window where the pool is in the degraded mode. If the tests are lucky I guess the verifications will work OK because the pool doesn't finish its rebuild during the verification?
There was a problem hiding this comment.
if the test case original with self_heal:exclude set, let's keep the original behavior that don't rigger rebuild for it if possible.
that can be satisfied by calling daos_debug_set_params() with a new fail inject code defined to disable the rebuild, if no better idea for it. FYI
sorry did not see it before, I replied in the test code just now. |
- improve clarity of is_pool_rebuild_allowed(). - in setup functions, use fault injection DAOS_REBUILD_DISABLE to prevent manual rebuild from occurring in degraded mode tests. Features: pool rebuild Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/11/testReport/ |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/11/execution/node/1364/log
|
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/11/execution/node/1282/log
|
liuxuezhao
left a comment
There was a problem hiding this comment.
there are quite some test failed, may need to check if related with this PR.
Thanks for the review. I think the failures are unrelated, and are associated with the PR using |
| bool auto_rebuild_enabled = self_heal & DAOS_SELF_HEAL_AUTO_REBUILD; | ||
| bool delay_rebuild_enabled = self_heal & DAOS_SELF_HEAL_DELAY_REBUILD; |
There was a problem hiding this comment.
Strictly speaking, these require !! to make sure the cast to bool retains "nonzero-ness". (Of course, current flags will work because they are not define with higher bits. But I would normally consider this a defect.)
There was a problem hiding this comment.
Yes, I'm wrong: We use standard bool now, which is special and does not require !!. Sorry for the noise.
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17345/13/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17345/13/execution/node/1141/log daos_test/suite.py test_daos_rebulid_interactive failure is an intermittent failure, known issue DAOS-18501 I'll re-trigger functional hw testing to try to get a cleaner test result (with just NLT failures that are affecting many PRs currently) |
|
@daos-stack/daos-gatekeeper requesting forced landing for the NLT failures that are being experienced my most PRs. Build 11 included |
Consider a quick maintenance scenario in which a daos_engine is stopped briefly, and the administrator does not wish to have the DAOS automatic recovery / rebuild mechanism occur. That is, a pool map update (targets from UP_IN to DOWN) is to occur, the pool to enter a degraded mode (still allowing ongoing I/O), and NO rebuild to be triggered during this brief time window.
The above can be arranged by modifying the system or pool-specific self_heal property value (to not set the rebuild bit), and then stopping the engine.
Now also consider the conclusion of the maintenance that involes re-starting the engine, and reintegrating that rank back into the pool. It is most convenient to directly issue a dmg pool reintegrate command from the maintenance state.
Before this change, manual administration commands such as dmg pool exclude/reintegrate were prevented from triggering rebuilds due to the pool self_heal property setting. However, the intention of the self_heal (aka auto recovery) feature is to only apply to automatic rebuilds.
With this change, the is_pool_rebuild_allowed() function is updated to accept an indication of whether the self_heal checks are applicable. Manual pool map update and rebuild cases supply false for this argument (allowing those cases to result in a rebuild being scheduled).
Features: rebuild pool
Steps for the author:
After all prior steps are complete: