DAOS-17535 chk: misc improvements for CR logic#17329
Conversation
|
Ticket title is 'DAOS checker cannot completed on Aurora after some engines excluded' |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/1/testReport/ |
8e4ad6a to
639a8ec
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/1/execution/node/1388/log |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/2/testReport/ |
639a8ec to
78579dd
Compare
|
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/2/execution/node/1076/log |
aa39da7 to
476d0f9
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/5/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/5/execution/node/1324/log |
476d0f9 to
09aaf91
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/6/testReport/ |
NLT failure for DAOS-17435, not related with the patch. All required CR tests passed. |
|
Ping reviewers, thanks! |
kjacque
left a comment
There was a problem hiding this comment.
Overall looks good, just a couple minor comments/questions.
| /* Let secondary rank == primary rank. */ | ||
| rc = crt_group_secondary_modify(ins->ci_iv_group, ins->ci_ranks, ins->ci_ranks, | ||
| CRT_GROUP_MOD_OP_REPLACE, ns_ver); |
There was a problem hiding this comment.
Since this function call is repeated in a few places with the same comment, you could put it in a simple (maybe even inline) wrapper function that takes only ins and ns_ver as params. Also, I think it would be good to add detail to the comment as to why the primary and secondary groups must be the same for the checker.
There was a problem hiding this comment.
Good idea. I will do that when need to refresh the patch next time, or in subsequent CR related PR.
|
Ping reviewers, thanks! |
|
Resolve merge conflict. |
|
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/7/execution/node/301/log |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/13/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/13/testReport/ |
baa1ffc to
40d16b7
Compare
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/15/execution/node/1139/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/15/testReport/ |
|
A lot of CI tests failed for DAOS-18004, need to retest. |
40d16b7 to
ec23619
Compare
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/17/execution/node/1098/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/17/testReport/ |
ec23619 to
8b375bb
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/18/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/18/testReport/ |
1226b77 to
14caebf
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/23/execution/node/646/log |
test_dangling_rank_entry failed because test logic cannot detect md-on-ssd mode properly. It is DAOS-18505, not related with the patch. All the other required CR related tests passed. |
|
Ping reviewers. Please help to review the patch, that is important for 2.8 release. Thanks! |
| pool: | ||
| scm_size: 1G |
There was a problem hiding this comment.
Is it intentional to only use SCM?
There was a problem hiding this comment.
No, that is wrong. I will refresh the patch.
Include the followings: 1. When create CHK IV namespace, make the secondary group to be same as the primary group. Otherwise, CHK logic may hit DER_NONEXIST trouble when communicate via IV. 2. Integrate CHK IV namespace create and destroy API, cleanup related logic, redefine the version. 3. Get ranks list and IV namespace version from CHK leader when rejoin. Adjust CHK_REJOIN RPC for related changes. 4. Remove unsupported functionality for checking the specified 'phase'. 5. Add new test for case of lost some engine(s) before start checker. 6. Dedicated ULT to handle dead rank event, that will not be affected by checker start or stop. Then even if check scheduler exited, the subsequent check query still can work against the latest rank list. Test-tag: recovery Signed-off-by: Fan Yong <fan.yong@hpe.com>
14caebf to
14f35b5
Compare
|
It is very slow to load the long PR push history. So I cloned it into another PR#17427 (#17427). Let's use such PR for further review and landing. |
Include the followings:
When create CHK IV namespace, make the secondary group to be same as
the primary group. Otherwise, CHK logic may hit DER_NONEXIST trouble
when communicate via IV.
Integrate CHK IV namespace create and destroy API, cleanup related
logic, redefine the version.
Get ranks list and IV namespace version from CHK leader when rejoin.
Adjust CHK_REJOIN RPC for related changes.
Remove unsupported functionality for checking the specified 'phase'.
Add new test for case of lost some engine(s) before start checker.
Dedicated ULT to handle dead rank event, that will not be affected
by checker start or stop. Then even if check scheduler exited, the
subsequent check query still can work against the latest rank list.
Test-tag: recovery
Signed-off-by: Fan Yong fan.yong@hpe.com
Steps for the author:
After all prior steps are complete: