Skip to content

Add backfill support and fix topology-sensitive green context tests#2031

Merged
leofang merged 1 commit intoNVIDIA:mainfrom
leofang:leof/green-ctx-v1
May 6, 2026
Merged

Add backfill support and fix topology-sensitive green context tests#2031
leofang merged 1 commit intoNVIDIA:mainfrom
leofang:leof/green-ctx-v1

Conversation

@leofang
Copy link
Copy Markdown
Member

@leofang leofang commented May 6, 2026

Fixes #2025.

Summary

  • Add backfill field to SMResourceOptions for SMResource.split().
  • Fix green context tests that fail on certain GPU topologies.

Details

Backfill support

SMResourceOptions gains a backfill field (bool | Sequence[bool], default False). When True, sets CU_DEV_SM_RESOURCE_GROUP_BACKFILL on the group's flags, allowing the driver to relax the co-scheduling constraint when assigning SMs to groups. This enables requesting arbitrary aligned SM counts that would otherwise be rejected due to hardware topology constraints.

Without backfill, cuDevSmResourceSplit may reject requests even when the total SM count is sufficient and the requested count is properly aligned to coscheduledSmCount. This is because the driver's internal assignment algorithm has additional constraints beyond simple alignment. With backfill=True, the driver satisfies the requested SM count but some SMs may not have the co-scheduling guarantee.

See the CUDA driver API documentation for details on CU_DEV_SM_RESOURCE_GROUP_BACKFILL.

Test fixes

Replace _aligned_half heuristic (computed half the SMs rounded to alignment — not always a valid partition on all GPU topologies) with _safe_two_group_count (uses min_partition_size, always valid). Add dedicated test_two_groups_backfill that exercises the aggressive even-split with backfill=True.

Validation

python -m pytest tests/test_green_context.py -v  # 32 passed, 1 skipped (arch)

-- Leo's bot

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label May 6, 2026
@leofang leofang requested a review from mdboom May 6, 2026 02:10
@leofang leofang self-assigned this May 6, 2026
@leofang leofang added bug Something isn't working P1 Medium priority - Should do labels May 6, 2026
@leofang leofang added this to the cuda.core v1.0.0 milestone May 6, 2026
…tests

Add `backfill` field to SMResourceOptions. When True, sets
CU_DEV_SM_RESOURCE_GROUP_BACKFILL on each group's flags, allowing the
driver to relax the co-scheduling constraint when assigning SMs to
groups. This enables requesting arbitrary aligned SM counts that the
driver would otherwise reject due to hardware topology constraints.

Fix test_two_groups and related tests: replace _aligned_half (which
computed half the SMs rounded to alignment — not always a valid
partition on all GPU topologies) with _safe_two_group_count (uses
min_partition_size, always valid). Add dedicated test_two_groups_backfill
that exercises the aggressive even-split with backfill=True.

Fixes NVIDIA#2025.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@leofang leofang force-pushed the leof/green-ctx-v1 branch from 06e69d4 to d703209 Compare May 6, 2026 02:20
@leofang
Copy link
Copy Markdown
Member Author

leofang commented May 6, 2026

/ok to test d703209

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. And I confirmed the tests are passing on my local machine where I first found the issue with the tests.

@leofang leofang marked this pull request as ready for review May 6, 2026 13:59
@leofang leofang merged commit fec00bf into NVIDIA:main May 6, 2026
96 checks passed
@leofang leofang deleted the leof/green-ctx-v1 branch May 6, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Local failures in test_green_context.py tests

2 participants