Skip to content

Recipe Modularization, DAG Integration, and BigQuery Integration#2715

Open
DannyLiCom wants to merge 4 commits intomainfrom
lidanny/pw-DAG-integrate2
Open

Recipe Modularization, DAG Integration, and BigQuery Integration#2715
DannyLiCom wants to merge 4 commits intomainfrom
lidanny/pw-DAG-integrate2

Conversation

@DannyLiCom
Copy link
Collaborator

@DannyLiCom DannyLiCom commented Nov 19, 2025

Description

When integrating Recipe Modularization, DAG Integration, and BigQuery, the DAG will pass in a command that includes multiple command-line flags.

Tests

Users can use commands similar to the following to run tests

python3 -m benchmarks.recipes.pw_mcjax_benchmark_recipe \
--user='user' \
--cluster_name='pw-scale-test-v5e-32' \
--project='cloud-tpu-multipod-dev' \
--zone='us-south1-a' \
--device_type='v5litepod-32' \
--benchmark_steps=20 \
--num_slices_list='2' \
--server_image='gcr.io/tpu-prod-env-one-vm/lidanny/unsanitized_server:latest' \
--proxy_image='gcr.io/tpu-prod-env-one-vm/lidanny/unsanitized_proxy_server:latest' \
--runner='gcr.io/tpu-prod-env-one-vm/lidanny_latest:latest' \
--selected_model_framework='pathways' \
--selected_model_names='llama3_1_8b_8192_v5e_256' \
--priority='medium' \
--max_restarts=0 \
--bq_enable=True \
--bq_db_project='cloud-tpu-multipod-dev' \
--bq_db_dataset="chzheng_test_100steps"

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@DannyLiCom DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch from a961644 to f2a46a6 Compare November 19, 2025 05:46
@DannyLiCom DannyLiCom marked this pull request as ready for review November 19, 2025 06:05
@DannyLiCom DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch from f2a46a6 to 3840670 Compare November 23, 2025 16:39
@DannyLiCom DannyLiCom requested a review from notabee as a code owner November 23, 2025 16:39
@DannyLiCom DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch 2 times, most recently from 8adde2d to b08ea95 Compare November 28, 2025 02:17


def handle_cmd_args(cluster_config: XpkClusterConfig, *actions: str, **kwargs) -> bool:
def handle_cmd_args(cluster_config: XpkClusterConfig, is_delete: bool, user: str, **kwargs) -> bool:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the original recipe could only accept the --delete flag, this section had to be modified to allow it to accept multiple flags.

parser_utils.add_arguments(parser)
args = parser.parse_args()

if len(sys.argv) > 2:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is implemented to determine whether multiple flags are being used.

@DannyLiCom DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch 3 times, most recently from 0303bcf to 78b8073 Compare December 17, 2025 09:02
@DannyLiCom DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch 3 times, most recently from ce0fc18 to c37edc0 Compare December 29, 2025 02:19
@codecov
Copy link

codecov bot commented Dec 29, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link

This PR has been automatically marked as stale because it has not had recent activity. It will be closed soon if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Automatically applied to stale PRs. label Jan 28, 2026
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

This PR was closed because it has been inactive for a while. Please reopen it if you are still working on it.

@DannyLiCom DannyLiCom force-pushed the lidanny/pw-DAG-integrate2 branch from 57a9040 to 991aaf6 Compare February 5, 2026 06:32
from MaxText.layers import nnx_wrappers
from MaxText.layers.initializers import variable_to_logically_partitioned
from MaxText.layers.quantizations import AqtQuantization as Quant
from MaxText.sharding import logical_to_mesh_axes, maybe_shard_with_name
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it because Pylint failed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix has been merged #3093

from MaxText.sharding import logical_to_mesh_axes
from MaxText.layers import attentions, linears, nnx_wrappers, quantizations
from MaxText.layers.initializers import NdInitializer, default_bias_init, nd_dense_init, variable_to_logically_partitioned
from maxtext.kernels import megablox as mblx
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it because Pylint failed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix has been merged in #3093

done

echo "Successfully clean up all codes."
echo "Successfully clean up all codes." No newline at end of file
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure why this was flagged as a change; it looks exactly the same to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the unintended change from the commit git restore <path/to/file>

@github-actions github-actions bot removed the stale Automatically applied to stale PRs. label Feb 5, 2026
@SujeethJinesh
Copy link
Collaborator

LGTM, but I would just verify that the imports work properly at the top of the files. Since the MaxText modularization refactor, we may need to check that since a lot of these changes were from before that.

Copy link
Collaborator

@aireenmei aireenmei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! I run the PR through AI and got these comments: https://paste.googleplex.com/6085130735190016. I looked at some points and they make sense to me, so I left inline comments. But there are more points, could you take a look?

"--selected_model_framework",
type=parse_str_list,
default=["pathways"],
help="List of model frameworks (e.g., pathways, mcjax",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a closing ")" for help message.

"--selected_model_names",
type=parse_str_list,
default=["llama3_1_8b_8192_v5e_256"],
help="List of model names (e.g., llama3_1_8b_8192_v5e_256, llama2-7b-v5e-256",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a closing ")" for help message.

done

echo "Successfully clean up all codes."
echo "Successfully clean up all codes." No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the unintended change from the commit git restore <path/to/file>

from MaxText.layers import nnx_wrappers
from MaxText.layers.initializers import variable_to_logically_partitioned
from MaxText.layers.quantizations import AqtQuantization as Quant
from MaxText.sharding import logical_to_mesh_axes, maybe_shard_with_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix has been merged #3093

from MaxText.sharding import logical_to_mesh_axes
from MaxText.layers import attentions, linears, nnx_wrappers, quantizations
from MaxText.layers.initializers import NdInitializer, default_bias_init, nd_dense_init, variable_to_logically_partitioned
from maxtext.kernels import megablox as mblx
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix has been merged in #3093

user_configs.USER_CONFIG.headless = False
should_continue = helper.handle_cmd_args(
user_configs.USER_CONFIG.cluster_config, helper.DELETE, xpk_path=user_configs.USER_CONFIG.xpk_path
user_configs.USER_CONFIG.cluster_config, user_configs.USER_CONFIG.DELETE, user_configs.USER_CONFIG.user
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be user_configs.USER_CONFIG.delete (lowercase)?

parser.add_argument(
"--server_image",
type=str,
default="us-docker.pkg.dev/cloud-tpu-v2-images/pathways/proxy_server",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default values of server_image and proxy_image looks swapped.

@github-actions
Copy link

github-actions bot commented Feb 5, 2026

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This pull request introduces significant refactoring to the benchmarking recipes, modularizing argument parsing and improving configuration management. The changes make the recipes more flexible and easier to use from the command line, especially for DAG integration.

🔍 General Feedback

  • The introduction of parser_utils.py is a good step towards centralizing command-line argument handling.
  • The use of a UserConfig class helps in managing configurations cleanly.
  • The addition of check_and_create_bucket is a useful utility, though its error handling could be improved in the calling code.

Comment on lines +95 to +106
parser.add_argument(
"--colocated_python_image",
type=str,
default=None,
help="Colocated Python image.",
)
parser.add_argument("--worker_flags", type=str, default="", help="Worker flags.")
parser.add_argument("--proxy_flags", type=str, default="", help="Proxy flags.")
parser.add_argument("--server_flags", type=str, default="", help="Server flags.")

# Model Configuration
parser.add_argument("--benchmark_steps", type=int, default=20, help="Number of benchmark steps.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 The default values for `server_image` and `proxy_image` appear to be swapped. In `benchmarks/recipes/user_configs.py` they were swapped, and the help texts here ("Docker image for the proxy server." for `server_image`) also indicate they are swapped. Please verify and correct the default values and help texts.
Suggested change
parser.add_argument(
"--colocated_python_image",
type=str,
default=None,
help="Colocated Python image.",
)
parser.add_argument("--worker_flags", type=str, default="", help="Worker flags.")
parser.add_argument("--proxy_flags", type=str, default="", help="Proxy flags.")
parser.add_argument("--server_flags", type=str, default="", help="Server flags.")
# Model Configuration
parser.add_argument("--benchmark_steps", type=int, default=20, help="Number of benchmark steps.")
parser.add_argument(
"--server_image",
type=str,
default="us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server",
help="Docker image for the server.",
)
parser.add_argument(
"--proxy_image",
type=str,
default="us-docker.pkg.dev/cloud-tpu-v2-images/pathways/proxy_server",
help="Docker image for the proxy server.",
)

Comment on lines 43 to +48

Returns:
An integer exit code from the workload execution (0 for success, non-zero for failure).
"""
storage_client = storage.Client(project=user_config.project)
check_and_create_bucket(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The return value of `check_and_create_bucket` is not checked. If the bucket creation fails (e.g., due to permissions), the function returns `None` and logs an error, but the script continues. This could lead to failures later on. It would be more robust to check the return value and exit if the bucket is not available.
Suggested change
Returns:
An integer exit code from the workload execution (0 for success, non-zero for failure).
"""
storage_client = storage.Client(project=user_config.project)
check_and_create_bucket(
storage_client = storage.Client(project=user_config.project)
bucket = check_and_create_bucket(
storage_client,
user_config.base_output_directory[5:].split("/")[0],
user_config.region,
)
if not bucket:
print("Exiting due to failure in GCS bucket creation or access.")
return 1

Comment on lines 15 to 18
"""
This module provides utility functions for Pathways-related benchmark recipes.

It includes helpers for building lists of model configurations based on user
selections and for generating `XpkClusterConfig` and `PathwaysConfig` objects.
"""

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 The docstring for this module was shortened and is now less descriptive. Consider restoring the previous level of detail to help others understand the module's purpose.
Suggested change
"""
This module provides utility functions for Pathways-related benchmark recipes.
It includes helpers for building lists of model configurations based on user
selections and for generating `XpkClusterConfig` and `PathwaysConfig` objects.
"""
"""
This module provides utility functions for Pathways-related benchmark recipes.
It includes helpers for building lists of model configurations based on user
selections and for generating `XpkClusterConfig` and `PathwaysConfig` objects.
"""

@github-actions
Copy link

github-actions bot commented Feb 5, 2026

🤖 I'm sorry @aireenmei, but I was unable to process your request. Please see the logs for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants