Skip to content

DAOS-17535 chk: misc improvements for CR logic#17329

Closed
Nasf-Fan wants to merge 1 commit intomasterfrom
Nasf-Fan/DAOS-17535_7
Closed

DAOS-17535 chk: misc improvements for CR logic#17329
Nasf-Fan wants to merge 1 commit intomasterfrom
Nasf-Fan/DAOS-17535_7

Conversation

@Nasf-Fan
Copy link
Copy Markdown
Contributor

@Nasf-Fan Nasf-Fan commented Dec 30, 2025

Include the followings:

  1. When create CHK IV namespace, make the secondary group to be same as
    the primary group. Otherwise, CHK logic may hit DER_NONEXIST trouble
    when communicate via IV.

  2. Integrate CHK IV namespace create and destroy API, cleanup related
    logic, redefine the version.

  3. Get ranks list and IV namespace version from CHK leader when rejoin.
    Adjust CHK_REJOIN RPC for related changes.

  4. Remove unsupported functionality for checking the specified 'phase'.

  5. Add new test for case of lost some engine(s) before start checker.

  6. Dedicated ULT to handle dead rank event, that will not be affected
    by checker start or stop. Then even if check scheduler exited, the
    subsequent check query still can work against the latest rank list.

Test-tag: recovery

Signed-off-by: Fan Yong fan.yong@hpe.com

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Dec 30, 2025

Ticket title is 'DAOS checker cannot completed on Aurora after some engines excluded'
Status is 'In Review'
Labels: 'scrubbed_2.6.5'
https://daosio.atlassian.net/browse/DAOS-17535

@daosbuild3
Copy link
Copy Markdown
Collaborator

@Nasf-Fan Nasf-Fan changed the title DAOS-17535 chk: secondary group should be same as primary group for C… DAOS-17535 chk: misc improvements for CR logic Dec 31, 2025
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7 branch from 8e4ad6a to 639a8ec Compare December 31, 2025 03:16
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/1/execution/node/1388/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7 branch from 639a8ec to 78579dd Compare December 31, 2025 07:33
@daosbuild3
Copy link
Copy Markdown
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7 branch 2 times, most recently from aa39da7 to 476d0f9 Compare January 1, 2026 03:09
@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/5/execution/node/1324/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7 branch from 476d0f9 to 09aaf91 Compare January 4, 2026 02:41
@daosbuild3
Copy link
Copy Markdown
Collaborator

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Nasf-Fan commented Jan 4, 2026

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/6/testReport/

NLT failure for DAOS-17435, not related with the patch. All required CR tests passed.

@Nasf-Fan Nasf-Fan marked this pull request as ready for review January 4, 2026 09:34
@Nasf-Fan Nasf-Fan requested review from gnailzenh, kjacque and wangshilong and removed request for kjacque January 4, 2026 09:34
@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Nasf-Fan commented Jan 9, 2026

Ping reviewers, thanks!

Copy link
Copy Markdown
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just a couple minor comments/questions.

Comment thread src/chk/chk_engine.c
Comment thread src/chk/chk_internal.h Outdated
Comment thread src/chk/chk_iv.c Outdated
Comment on lines +217 to +219
/* Let secondary rank == primary rank. */
rc = crt_group_secondary_modify(ins->ci_iv_group, ins->ci_ranks, ins->ci_ranks,
CRT_GROUP_MOD_OP_REPLACE, ns_ver);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function call is repeated in a few places with the same comment, you could put it in a simple (maybe even inline) wrapper function that takes only ins and ns_ver as params. Also, I think it would be good to add detail to the comment as to why the primary and secondary groups must be the same for the checker.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I will do that when need to refresh the patch next time, or in subsequent CR related PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with me. Thanks!

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Ping reviewers, thanks!

@kjacque kjacque self-requested a review January 13, 2026 00:21
kjacque
kjacque previously approved these changes Jan 13, 2026
@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Resolve merge conflict.

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/7/execution/node/301/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/13/testReport/

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7 branch from baa1ffc to 40d16b7 Compare January 20, 2026 10:28
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/15/execution/node/1139/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/15/testReport/

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

A lot of CI tests failed for DAOS-18004, need to retest.

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7 branch from 40d16b7 to ec23619 Compare January 21, 2026 09:58
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/17/execution/node/1098/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/17/testReport/

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7 branch from ec23619 to 8b375bb Compare January 22, 2026 03:42
@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17329/18/testReport/

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7 branch 4 times, most recently from 1226b77 to 14caebf Compare January 22, 2026 17:35
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/23/execution/node/646/log

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Nasf-Fan commented Jan 23, 2026

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17329/23/execution/node/646/log

test_dangling_rank_entry failed because test logic cannot detect md-on-ssd mode properly. It is DAOS-18505, not related with the patch. All the other required CR related tests passed.

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Ping reviewers. Please help to review the patch, that is important for 2.8 release. Thanks!

Comment on lines +42 to +43
pool:
scm_size: 1G
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional to only use SCM?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that is wrong. I will refresh the patch.

Include the followings:

1. When create CHK IV namespace, make the secondary group to be same as
   the primary group. Otherwise, CHK logic may hit DER_NONEXIST trouble
   when communicate via IV.

2. Integrate CHK IV namespace create and destroy API, cleanup related
   logic, redefine the version.

3. Get ranks list and IV namespace version from CHK leader when rejoin.
   Adjust CHK_REJOIN RPC for related changes.

4. Remove unsupported functionality for checking the specified 'phase'.

5. Add new test for case of lost some engine(s) before start checker.

6. Dedicated ULT to handle dead rank event, that will not be affected
   by checker start or stop. Then even if check scheduler exited, the
   subsequent check query still can work against the latest rank list.

Test-tag: recovery

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7 branch from 14caebf to 14f35b5 Compare January 26, 2026 06:20
@Nasf-Fan Nasf-Fan requested review from a team as code owners January 26, 2026 06:20
@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

It is very slow to load the long PR push history. So I cloned it into another PR#17427 (#17427). Let's use such PR for further review and landing.

@Nasf-Fan Nasf-Fan closed this Jan 26, 2026
@Nasf-Fan Nasf-Fan deleted the Nasf-Fan/DAOS-17535_7 branch January 28, 2026 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants