Recipe Modularization, DAG Integration, and BigQuery Integration by DannyLiCom · Pull Request #2715 · AI-Hypercomputer/maxtext

DannyLiCom · 2025-11-19T05:38:21Z

Description

When integrating Recipe Modularization, DAG Integration, and BigQuery, the DAG will pass in a command that includes multiple command-line flags.

Tests

Users can use commands similar to the following to run tests

python3 -m benchmarks.recipes.pw_mcjax_benchmark_recipe \
--user='user' \
--cluster_name='pw-scale-test-v5e-32' \
--project='cloud-tpu-multipod-dev' \
--zone='us-south1-a' \
--device_type='v5litepod-32' \
--benchmark_steps=20 \
--num_slices_list='2' \
--server_image='gcr.io/tpu-prod-env-one-vm/lidanny/unsanitized_server:latest' \
--proxy_image='gcr.io/tpu-prod-env-one-vm/lidanny/unsanitized_proxy_server:latest' \
--runner='gcr.io/tpu-prod-env-one-vm/lidanny_latest:latest' \
--selected_model_framework='pathways' \
--selected_model_names='llama3_1_8b_8192_v5e_256' \
--priority='medium' \
--max_restarts=0 \
--bq_enable=True \
--bq_db_project='cloud-tpu-multipod-dev' \
--bq_db_dataset="chzheng_test_100steps"

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

DannyLiCom · 2025-11-28T03:02:24Z

benchmarks/recipes/args_helper.py



-def handle_cmd_args(cluster_config: XpkClusterConfig, *actions: str, **kwargs) -> bool:
+def handle_cmd_args(cluster_config: XpkClusterConfig, is_delete: bool, user: str, **kwargs) -> bool:


Since the original recipe could only accept the --delete flag, this section had to be modified to allow it to accept multiple flags.

DannyLiCom · 2025-11-28T03:03:07Z

benchmarks/recipes/pw_mcjax_benchmark_recipe.py

+  parser_utils.add_arguments(parser)
+  args = parser.parse_args()
+
+  if len(sys.argv) > 2:


This logic is implemented to determine whether multiple flags are being used.

codecov · 2025-12-29T02:31:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions · 2026-01-28T16:11:54Z

This PR has been automatically marked as stale because it has not had recent activity. It will be closed soon if no further activity occurs. Thank you for your contributions.

github-actions · 2026-02-04T16:19:02Z

This PR was closed because it has been inactive for a while. Please reopen it if you are still working on it.

DannyLiCom · 2026-02-05T08:11:44Z

src/MaxText/layers/attention_op.py

+from MaxText.layers import nnx_wrappers
+from MaxText.layers.initializers import variable_to_logically_partitioned
+from MaxText.layers.quantizations import AqtQuantization as Quant
+from MaxText.sharding import logical_to_mesh_axes, maybe_shard_with_name


I moved it because Pylint failed.

This fix has been merged #3093

DannyLiCom · 2026-02-05T08:11:50Z

src/MaxText/layers/moe.py

 from MaxText.sharding import logical_to_mesh_axes
 from MaxText.layers import attentions, linears, nnx_wrappers, quantizations
 from MaxText.layers.initializers import NdInitializer, default_bias_init, nd_dense_init, variable_to_logically_partitioned
+from maxtext.kernels import megablox as mblx


I moved it because Pylint failed.

This fix has been merged in #3093

DannyLiCom · 2026-02-05T08:12:53Z

tools/dev/code_style.sh

 done

-echo "Successfully clean up all codes."
+echo "Successfully clean up all codes."


I’m not sure why this was flagged as a change; it looks exactly the same to me.

You can remove the unintended change from the commit git restore <path/to/file>

SujeethJinesh · 2026-02-05T18:30:42Z

LGTM, but I would just verify that the imports work properly at the top of the files. Since the MaxText modularization refactor, we may need to check that since a lot of these changes were from before that.

aireenmei

Thanks for the work! I run the PR through AI and got these comments: https://paste.googleplex.com/6085130735190016. I looked at some points and they make sense to me, so I left inline comments. But there are more points, could you take a look?

aireenmei · 2026-02-05T20:36:10Z

benchmarks/recipes/parser_utils.py

+      "--selected_model_framework",
+      type=parse_str_list,
+      default=["pathways"],
+      help="List of model frameworks (e.g., pathways, mcjax",


Need a closing ")" for help message.

aireenmei · 2026-02-05T20:36:22Z

benchmarks/recipes/parser_utils.py

+      "--selected_model_names",
+      type=parse_str_list,
+      default=["llama3_1_8b_8192_v5e_256"],
+      help="List of model names (e.g., llama3_1_8b_8192_v5e_256, llama2-7b-v5e-256",


Need a closing ")" for help message.

aireenmei · 2026-02-05T20:40:51Z

tools/dev/code_style.sh

 done

-echo "Successfully clean up all codes."
+echo "Successfully clean up all codes."


You can remove the unintended change from the commit git restore <path/to/file>

aireenmei · 2026-02-05T20:41:59Z

src/MaxText/layers/attention_op.py

+from MaxText.layers import nnx_wrappers
+from MaxText.layers.initializers import variable_to_logically_partitioned
+from MaxText.layers.quantizations import AqtQuantization as Quant
+from MaxText.sharding import logical_to_mesh_axes, maybe_shard_with_name


This fix has been merged #3093

aireenmei · 2026-02-05T20:42:31Z

src/MaxText/layers/moe.py

 from MaxText.sharding import logical_to_mesh_axes
 from MaxText.layers import attentions, linears, nnx_wrappers, quantizations
 from MaxText.layers.initializers import NdInitializer, default_bias_init, nd_dense_init, variable_to_logically_partitioned
+from maxtext.kernels import megablox as mblx


This fix has been merged in #3093

aireenmei · 2026-02-05T21:47:59Z

benchmarks/recipes/pw_suspend_resume.py

  user_configs.USER_CONFIG.headless = False
  should_continue = helper.handle_cmd_args(
-      user_configs.USER_CONFIG.cluster_config, helper.DELETE, xpk_path=user_configs.USER_CONFIG.xpk_path
+      user_configs.USER_CONFIG.cluster_config, user_configs.USER_CONFIG.DELETE, user_configs.USER_CONFIG.user


Should this be user_configs.USER_CONFIG.delete (lowercase)?

aireenmei · 2026-02-05T21:53:06Z

benchmarks/recipes/parser_utils.py

+  parser.add_argument(
+      "--server_image",
+      type=str,
+      default="us-docker.pkg.dev/cloud-tpu-v2-images/pathways/proxy_server",


The default values of server_image and proxy_image looks swapped.

github-actions · 2026-02-05T22:02:14Z

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This pull request introduces significant refactoring to the benchmarking recipes, modularizing argument parsing and improving configuration management. The changes make the recipes more flexible and easier to use from the command line, especially for DAG integration.

🔍 General Feedback

The introduction of parser_utils.py is a good step towards centralizing command-line argument handling.
The use of a UserConfig class helps in managing configurations cleanly.
The addition of check_and_create_bucket is a useful utility, though its error handling could be improved in the calling code.

github-actions · 2026-02-05T22:04:16Z

benchmarks/recipes/parser_utils.py

+  parser.add_argument(
+      "--colocated_python_image",
+      type=str,
+      default=None,
+      help="Colocated Python image.",
+  )
+  parser.add_argument("--worker_flags", type=str, default="", help="Worker flags.")
+  parser.add_argument("--proxy_flags", type=str, default="", help="Proxy flags.")
+  parser.add_argument("--server_flags", type=str, default="", help="Server flags.")
+
+  # Model Configuration
+  parser.add_argument("--benchmark_steps", type=int, default=20, help="Number of benchmark steps.")


🟠 The default values for `server_image` and `proxy_image` appear to be swapped. In `benchmarks/recipes/user_configs.py` they were swapped, and the help texts here ("Docker image for the proxy server." for `server_image`) also indicate they are swapped. Please verify and correct the default values and help texts.

Suggested change

parser.add_argument(

"--colocated_python_image",

type=str,

default=None,

help="Colocated Python image.",

)

parser.add_argument("--worker_flags", type=str, default="", help="Worker flags.")

parser.add_argument("--proxy_flags", type=str, default="", help="Proxy flags.")

parser.add_argument("--server_flags", type=str, default="", help="Server flags.")

# Model Configuration

parser.add_argument("--benchmark_steps", type=int, default=20, help="Number of benchmark steps.")

parser.add_argument(

"--server_image",

type=str,

default="us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server",

help="Docker image for the server.",

)

parser.add_argument(

"--proxy_image",

type=str,

default="us-docker.pkg.dev/cloud-tpu-v2-images/pathways/proxy_server",

help="Docker image for the proxy server.",

)

github-actions · 2026-02-05T22:04:17Z

benchmarks/recipes/pw_mcjax_benchmark_recipe.py


+  Returns:
+    An integer exit code from the workload execution (0 for success, non-zero for failure).
+  """
+  storage_client = storage.Client(project=user_config.project)
+  check_and_create_bucket(


🟡 The return value of `check_and_create_bucket` is not checked. If the bucket creation fails (e.g., due to permissions), the function returns `None` and logs an error, but the script continues. This could lead to failures later on. It would be more robust to check the return value and exit if the bucket is not available.

Suggested change

Returns:

An integer exit code from the workload execution (0 for success, non-zero for failure).

"""

storage_client = storage.Client(project=user_config.project)

check_and_create_bucket(

storage_client = storage.Client(project=user_config.project)

bucket = check_and_create_bucket(

storage_client,

user_config.base_output_directory[5:].split("/")[0],

user_config.region,

)

if not bucket:

print("Exiting due to failure in GCS bucket creation or access.")

return 1

github-actions · 2026-02-05T22:04:19Z

benchmarks/recipes/pw_utils.py

 """
 This module provides utility functions for Pathways-related benchmark recipes.
-
-It includes helpers for building lists of model configurations based on user
-selections and for generating `XpkClusterConfig` and `PathwaysConfig` objects.
 """



🟢 The docstring for this module was shortened and is now less descriptive. Consider restoring the previous level of detail to help others understand the module's purpose.

Suggested change

"""

This module provides utility functions for Pathways-related benchmark recipes.

It includes helpers for building lists of model configurations based on user

selections and for generating `XpkClusterConfig` and `PathwaysConfig` objects.

"""

"""

This module provides utility functions for Pathways-related benchmark recipes.

It includes helpers for building lists of model configurations based on user

selections and for generating `XpkClusterConfig` and `PathwaysConfig` objects.

"""

github-actions · 2026-02-05T22:04:33Z

🤖 I'm sorry @aireenmei, but I was unable to process your request. Please see the logs for more details.

DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch from a961644 to f2a46a6 Compare November 19, 2025 05:46

DannyLiCom marked this pull request as ready for review November 19, 2025 06:05

DannyLiCom requested review from NuojCheng, Obliviour, RissyRan, SujeethJinesh, bvandermoon, gobbleturk, khatwanimohit, mitalisi, richjames0, shauryagup, shralex and vipannalla as code owners November 19, 2025 06:06

DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch from f2a46a6 to 3840670 Compare November 23, 2025 16:39

DannyLiCom requested a review from notabee as a code owner November 23, 2025 16:39

DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch 2 times, most recently from 8adde2d to b08ea95 Compare November 28, 2025 02:17

DannyLiCom commented Nov 28, 2025

View reviewed changes

DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch 3 times, most recently from 0303bcf to 78b8073 Compare December 17, 2025 09:02

DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch 3 times, most recently from ce0fc18 to c37edc0 Compare December 29, 2025 02:19

DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch from c37edc0 to 57a9040 Compare December 29, 2025 02:43

DannyLiCom requested review from SurbhiJainUSC, gagika and hengtaoguo as code owners December 29, 2025 02:43

DannyLiCom requested review from A9isha, NicoGrande, aireenmei, jiangjy1982 and suexu1025 as code owners December 29, 2025 02:43

github-actions bot added the stale Automatically applied to stale PRs. label Jan 28, 2026

github-actions bot closed this Feb 4, 2026

ycchenzheng reopened this Feb 4, 2026

ycchenzheng requested a review from jesselu-google as a code owner February 4, 2026 22:20

DannyLiCom added 2 commits February 5, 2026 06:29

Recipe Modularization, DAG Integration, and BigQuery Integration

32c369f

rebase

991aaf6

DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch from 57a9040 to 991aaf6 Compare February 5, 2026 06:32

fixed pylint

0cefe4c

DannyLiCom requested review from michelle-yooh, parambole and shuningjin as code owners February 5, 2026 07:16

fixed pylint

cd24087

DannyLiCom commented Feb 5, 2026

View reviewed changes

github-actions bot removed the stale Automatically applied to stale PRs. label Feb 5, 2026

SujeethJinesh approved these changes Feb 5, 2026

View reviewed changes

aireenmei reviewed Feb 5, 2026

View reviewed changes

aireenmei added the gemini-review label Feb 5, 2026

github-actions bot reviewed Feb 5, 2026

View reviewed changes



		def handle_cmd_args(cluster_config: XpkClusterConfig, actions: str, *kwargs) -> bool:
		def handle_cmd_args(cluster_config: XpkClusterConfig, is_delete: bool, user: str, **kwargs) -> bool:

-  Returns:
-    An integer exit code from the workload execution (0 for success, non-zero for failure).
-  """
-  storage_client = storage.Client(project=user_config.project)
-  check_and_create_bucket(
+  storage_client = storage.Client(project=user_config.project)
+  bucket = check_and_create_bucket(
+      storage_client,
+      user_config.base_output_directory[5:].split("/")[0],
+      user_config.region,
+  )
+  if not bucket:
+    print("Exiting due to failure in GCS bucket creation or access.")
+    return 1

Conversation

DannyLiCom commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 29, 2025

Codecov Report

Uh oh!

github-actions bot commented Jan 28, 2026

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SujeethJinesh commented Feb 5, 2026

Uh oh!

aireenmei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DannyLiCom commented Nov 19, 2025 •

edited

Loading