Skip to content

Conversation

@RexBearIU
Copy link
Collaborator

@RexBearIU RexBearIU commented Jan 22, 2026

Description

This pull request updates the RL and SFT demo notebooks to improve compatibility with non-interactive execution environments (such as Papermill), replacing notebook magic commands with Python subprocess calls and providing more robust error handling and logging. It also updates kernel and Python version metadata and enhances output visibility for key initialization steps.

Error log:

ValueError                                Traceback (most recent call last)
Cell In[8], line 3
      1 if not os.path.exists(MODEL_CHECKPOINT_PATH):
      2     # install torch for the conversion script
----> 3     get_ipython().system('python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu')
      5     get_ipython().system('JAX_PLATFORMS=cpu PYTHONPATH={MAXTEXT_REPO_ROOT} {sys.executable} -m MaxText.utils.ckpt_conversion.to_maxtext        {MAXTEXT_REPO_ROOT}/configs/base.yml        model_name={MODEL_NAME}        base_output_directory={MODEL_CHECKPOINT_PATH}        hf_access_token={HF_TOKEN}        use_multimodal=false        scan_layers=true        skip_jax_distributed_system=True')
      7 if not os.path.exists(MOD
[2026-01-22, 06:14:00 UTC] {logging_mixin.py:190} WARNING - EL_CHECKPOINT_PATH):

File ~/maxtext/maxtext_venv/lib/python3.12/site-packages/ipykernel/zmqshell.py:788, in ZMQInteractiveShell.system_piped(self, cmd)
    786         self.user_ns["_exit_code"] = system(cmd)
    787 else:
--> 788     self.user_ns["_exit_code"] = system(self.var_expand(cmd, depth=1))

File ~/maxtext/maxtext_venv/lib/python3.12/site-packages/IPython/utils/_process_posix.py:130, in ProcessHandler.system(self, cmd)
    126 flush = sys.stdout.flush
    127 while True:
    128     # res is the index of the pattern that caused the match, so we
    129     # know whether we've finished (if we matched EOF) or not
--> 130     res_idx = child.expect_list(patterns, self.read_timeout)
    131     print(child.before[out_size:].decode(enc, 'replace'), end='')
    132     flush()

File ~/maxtext/maxtext_venv/lib/python3.12/site-packages/pexpect/spawnbase.py:383, in SpawnBase.expect_list(self, pattern_list, timeout, searchwindowsize, async_, **kw)
    381     return expect_async(exp,
[2026-01-22, 06:14:00 UTC] {logging_mixin.py:190} WARNING -  timeout)
    382 else:
--> 383     return exp.expect_loop(timeout)

File ~/maxtext/maxtext_venv/lib/python3.12/site-packages/pexpect/expect.py:169, in Expecter.expect_loop(self, timeout)
    167     return self.timeout()
    168 # Still have time left, so read more data
--> 169 incoming = spawn.read_nonblocking(spawn.maxread, timeout)
    170 if self.spawn.delayafterread is not None:
    171     time.sleep(self.spawn.delayafterread)

File ~/maxtext/maxtext_venv/lib/python3.12/site-packages/pexpect/pty_spawn.py:458, in spawn.read_nonblocking(self, size, timeout)
    450         return select_ignore_interrupts([self.child_fd], [], [], timeout)[0]
    452 # If there is data available to read right now, read as much as
    453 # we can. We do this to increase performance if there are a lot
    454 # of bytes to be read. This also avoids calling isalive() too
    455 # often. See also:
    456 # * https://github.com/pexpect/pexpect/pull/304
    457 # * http://trac.sagemath.org/ticket/10295
[2026-01-22, 06:14:00 UTC] {logging_mixin.py:190} WARNING - 
--> 458 if select(0):
    459     try:
    460         incoming = super(spawn, self).read_nonblocking(size)

File ~/maxtext/maxtext_venv/lib/python3.12/site-packages/pexpect/pty_spawn.py:450, in spawn.read_nonblocking.<locals>.select(timeout)
    449 def select(timeout):
--> 450     return select_ignore_interrupts([self.child_fd], [], [], timeout)[0]

File ~/maxtext/maxtext_venv/lib/python3.12/site-packages/pexpect/utils.py:143, in select_ignore_interrupts(iwtd, owtd, ewtd, timeout)
    141 while True:
    142     try:
--> 143         return select.select(iwtd, owtd, ewtd, timeout)
    144     except InterruptedError:
    145         err = sys.exc_info()[1]

ValueError: filedescriptor out of range in select()

Tests

Manually triggered the three notebook and monitored the execution flow step-by-step. Confirmed that the training loop finished and resources were released.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@SurbhiJainUSC
Copy link
Collaborator

Please check why notebook CI test is failing

@RexBearIU RexBearIU force-pushed the jackyf/fix_posttraining_notebook branch from 2b26a59 to 9ccdbea Compare January 23, 2026 06:14
@RexBearIU RexBearIU force-pushed the jackyf/fix_posttraining_notebook branch from 9ccdbea to 3d8d9ad Compare January 23, 2026 09:22
@RexBearIU
Copy link
Collaborator Author

Please check why notebook CI test is failing

The CI for notebook is blocking right now, but it works in my local with papermill after added back the import of pyconfig

@SurbhiJainUSC
Copy link
Collaborator

Please rebase after this is merged: #3000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants