-
Notifications
You must be signed in to change notification settings - Fork 631
[Pytorch] Make test script generate checkpoints if they don't exist #2650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Greptile OverviewGreptile SummaryThis PR improves the first-run experience for users running PyTorch tests by automatically generating checkpoint files if they don't exist. The changes address previous review feedback by properly exporting the Key changes:
Impact: Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Script as test.sh
participant Env as Environment
participant FS as File System
participant CheckpointGen as test_checkpoint.py
participant PyTest as pytest
Script->>Env: export NVTE_TEST_CHECKPOINT_ARTIFACT_PATH
Script->>FS: Check if directory exists
alt Directory does not exist
Script->>CheckpointGen: python3 test_checkpoint.py --save-checkpoint all
CheckpointGen->>FS: mkdir(parents=True, exist_ok=True)
loop For each checkpoint type
CheckpointGen->>CheckpointGen: Create module
CheckpointGen->>FS: Save checkpoint file
end
CheckpointGen-->>Script: Success or error_exit
else Directory exists
Script->>Script: Skip generation
end
Script->>PyTest: Run pytest test_checkpoint.py
PyTest->>FS: Load checkpoint files
PyTest->>PyTest: Test module loading
PyTest-->>Script: test_fail or success
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
qa/L0_pytorch_unittest/test.sh
Outdated
| NVTE_TEST_CHECKPOINT_ARTIFACT_PATH=$TE_PATH/artifacts/tests/pytorch/test_checkpoint python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_checkpoint.xml $TE_PATH/tests/pytorch/test_checkpoint.py || test_fail "test_checkpoint.py" | ||
| NVTE_TEST_CHECKPOINT_ARTIFACT_PATH=$TE_PATH/artifacts/tests/pytorch/test_checkpoint | ||
| if [ ! -d "$NVTE_TEST_CHECKPOINT_ARTIFACT_PATH" ]; then | ||
| python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVTE_TEST_CHECKPOINT_ARTIFACT_PATH not exported; checkpoint generation won't use it
| python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all | |
| NVTE_TEST_CHECKPOINT_ARTIFACT_PATH=$NVTE_TEST_CHECKPOINT_ARTIFACT_PATH python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
5eb12d9 to
63b6471
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
qa/L0_pytorch_unittest/test.sh
Outdated
| NVTE_TEST_CHECKPOINT_ARTIFACT_PATH=$TE_PATH/artifacts/tests/pytorch/test_checkpoint python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_checkpoint.xml $TE_PATH/tests/pytorch/test_checkpoint.py || test_fail "test_checkpoint.py" | ||
| export NVTE_TEST_CHECKPOINT_ARTIFACT_PATH=$TE_PATH/artifacts/tests/pytorch/test_checkpoint | ||
| if [ ! -d "$NVTE_TEST_CHECKPOINT_ARTIFACT_PATH" ]; then | ||
| python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for checkpoint generation - if this command fails, the script continues and pytest will fail with a less clear error message
| python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all | |
| python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all || error_exit "Failed to generate checkpoint files" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
63b6471 to
66aaead
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
|
/te-ci |
Description
For users who run pytorch tests for the first time, they are likely to haven't generate checkpoint files and they have to go to
tests/pytorch/test_checkpoint.pyto figure out how to run this file to generate them.Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
tests/pytorch/test_checkpoint.pyto generate all the checkpoint files under that pathcheckpoint_dir.mkdircreate parent directories if they don't exist, which is likely to be the case for people who doexport TE_PATH=$(pwd)before run the testsChecklist: