ci: upgrade to GPU runner via moveit_pro_ci v0.3.1, enrich test diagnostics#655
ci: upgrade to GPU runner via moveit_pro_ci v0.3.1, enrich test diagnostics#655davetcoleman wants to merge 3 commits into
Conversation
|
Consider whether the change should land upstream in Overlapping files
|
5efd377 to
8640403
Compare
|
failing because of https://github.com/PickNikRobotics/moveit_pro/issues/19269 |
8640403 to
608fabe
Compare
📊 Integration test reportPer-distro HTML reports (status table + per-test ROS log slices) are attached to this run's artifacts:
Download the zip, extract, and open |
|
@JWhitleyWork @shaur-k — two design questions on #655 before I keep going. State today:
Looking for yes / no / yes-but on either. |
… comment
Three CI-infra changes folded together:
1. Bump the reusable workflow ref. v0.3.1 (d490a1d) had the GPU + CUDA-suffix
fix. v0.3.2 (90b506e, currently pre-tag) adds the test-results artifact-name
suffix `-${{ matrix.ros_distro }}`, so the humble and jazzy jobs no longer
both upload to the same artifact name (moveit_pro_ci#27). The pin is by
SHA, not tag, so this works against the merged branch before the tag is
formally cut.
2. Add render-report job (matrix on humble/jazzy) that downloads each distro's
test-results artifact and runs .github/scripts/render_report.py against it,
uploading report.html as integration-test-report-${{ matrix.ros_distro }}.
Runs whether the integration test passed, failed, or timed out -- the
report is most useful for failure post-mortem.
3. Add post-report-comment job that posts (or updates in place via a sticky
marker) a single PR comment linking to the rendered reports for that run.
Also retained from before:
- enable_gpu: true on a picknik-16-amd64-gpu runner so MuJoCo EGL rendering
uses the GPU instead of llvmpipe.
- src/lab_sim/test/conftest.py pytest hooks (logstart, logreport) writing to
fd 2 directly so per-test progress survives a CTest timeout.
Reads pytest's JUnit xunit XML (the test artifact already published by moveit_pro_ci's reusable workflow) and produces a self-contained, single-file HTML report. Groups tests by their parent XML directory, shows the human-readable objective name extracted from each objective XML's main_tree_to_execute attribute, and surfaces filter/search/collapse UI without any external JS dependencies.
ament_add_pytest_test does not set ROS_LOG_DIR, so launched nodes write to the default ~/.ros/log/<ts>/ inside the doomed CI container -- never uploaded. Point ROS_LOG_DIR at build/lab_sim/test_results/lab_sim/ros_logs/ instead, which is already inside the existing 'test-results' artifact glob, so launch.log + per-node *.log come back with each CI run.
608fabe to
62e8e3a
Compare
|
The latest version only publishes the comment IF the integration test fails, so i think it should only publish the html file also if it fails. How about that?
I dont mind if this is public, but my goal is for this to be used in moveit_pro... @shaur-k 's new updates runs the example_ws integration tests for every moveit_pro PR, right? |
This is fine with me.
I don't think he has done this yet but I think it is planned. |
CI infra upgrade and test-diagnostics improvements for
objectives_integration_test. Three commits, kept separate intentionally (please do not squash).Commit 1 —
enable_gpu: trueon apicknik-16-amd64-gpurunnerBumps the reusable workflow ref to
PickNikRobotics/moveit_pro_ci@v0.3.1, setsenable_gpu: true, and switches the runner label frompicknik-16-amd64topicknik-16-amd64-gpu. v0.3.1 appends the CUDA suffix to the image whenenable_gpuis true (moveit_pro_ci#26) — without that,v0.3.0set the runner label but kept the non-CUDA image, so MuJoCo's EGL rendering still went through llvmpipe on CPU.image_tagis pinned to9.3.0-rc9until themain-*-cuda12.6-cudnn9images are being published.Test-diagnostics addition:
src/lab_sim/test/conftest.pyTwo pytest hooks (
pytest_runtest_logstart,pytest_runtest_logreport) write directly to fd 2, bypassing pytest's--capture=fd. Without this, a CTest timeout kills pytest before any per-test output is flushed, leaving the CI log silent past "collected N items". Now each test printsSTART <nodeid>on entry andPASSED|FAILED|SKIPPED <nodeid> (<elapsed>s)on completion, so CI logs always show which objective was running and how long each one took — critical for triaging flakes and timeouts.Commit 2 —
.github/scripts/render_report.py(HTML report)Self-contained Python script that turns pytest's
objectives_integration_test.xunit.xmlartifact into a single-file HTML report:move_flasks_to_burners.xml→Move Flasks to Burners) by reading each XML'smain_tree_to_executeattribute against the localmoveit_pro/moveit_pro_example_wscheckouts.Usage:
python3 .github/scripts/render_report.py <xunit.xml> <out.html>.Commit 3 — redirect ROS node logs into the test_results artifact
ament_add_pytest_testdoes not setROS_LOG_DIR, so launched nodes write to~/.ros/log/<ts>/inside the doomed CI container — those logs never get uploaded, making post-mortem of objective failures impossible. PointsROS_LOG_DIRatbuild/lab_sim/test_results/lab_sim/ros_logs/, which is already inside the existingtest-resultsartifact glob, solaunch.log+ per-node*.logcome back with each CI run.reset_simulation_before_testrelaunches the stack per-test, so each test gets its own timestamped<ts>/subdirectory underros_logs/.