Elastic events goodput recording integration#3901
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
lukebaumann
left a comment
There was a problem hiding this comment.
Were you able to run at a small scale with these changes and confirm that the events are logged correctly?
| from pathwaysutils.elastic import manager | ||
|
|
||
| elastic_manager: manager.Manager | None = None | ||
| _pending_reinit_recorder = None |
There was a problem hiding this comment.
Nit: This shouldn't have an underscore prefix since it is used in train.py, right?
| if recorder: | ||
| recorder.record_custom_badput_event_end_time(custom_badput_event_type=event_type) | ||
| recorder.record_custom_badput_event_start_time(custom_badput_event_type='elastic_reinitialization') | ||
| elastic_utils._pending_reinit_recorder = recorder |
There was a problem hiding this comment.
Shouldn't this be updated through a setter function in elastic_utils?
|
@gemini-cli /investigate |
|
🤖 Hi @shralex, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
🤖 CI Failure Investigation ReportI have analyzed the recent test failures in the CI pipeline for PR #3901 and identified that they are not caused by the changes in this PR. Instead, they appear to be regressions introduced by the JAX 0.10.0 upgrade (PR #3854), which was recently integrated into the CI environment. 🔍 What Failed
🪵 Error Details & Stack TraceImportError in Jupyter Notebooks: File "/__w/maxtext/maxtext/.venv/lib/python3.12/site-packages/huggingface_hub/cli/hf.py", line 19, in <module>
import typer
File "/__w/maxtext/maxtext/.venv/lib/python3.12/site-packages/typer/__init__.py", line 12, in <module>
from click.termui import get_terminal_size as get_terminal_size
ImportError: cannot import name 'get_terminal_size' from 'click.termui'Segmentation fault in Post-Training Tests: Segmentation fault $PYTHON_EXE -m pytest ${INPUTS_PYTEST_ADDOPTS} -v -m "${FINAL_PYTEST_MARKER}" --durations=0 $PYTEST_COV_ARGS $SPLIT_ARGS ${INPUTS_PYTEST_EXTRA_ARGS}💡 Root Cause Analysis & ContextConfidence: High The failures are highly correlated with the JAX 0.10.0 upgrade and the associated dependency updates in PR #3854:
🛠️ Recommended ActionSince these issues are regressions on |
Description
This PR instruments the elastic training implementation with GoodputRecorder API calls to capture the breakdown of time during elastic events. This is replica of this PR which tested these changes on replica-resize for GA2.
If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456
Tests
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.