Skip to content

Add gemma4-26b E2E test scripts for pre-training#4278

Draft
chiajunglien wants to merge 3 commits into
AI-Hypercomputer:mainfrom
CIeNET-International:emma/e2e-training
Draft

Add gemma4-26b E2E test scripts for pre-training#4278
chiajunglien wants to merge 3 commits into
AI-Hypercomputer:mainfrom
CIeNET-International:emma/e2e-training

Conversation

@chiajunglien

@chiajunglien chiajunglien commented Jun 26, 2026

Copy link
Copy Markdown

Description

This PR introduces end-to-end (E2E) testing and validation pipelines for the Gemma 4 26B model in MaxText.
Blocked by: https://buganizer.corp.google.com/issues/527749229

Tests

Pre-Training:

export RUN_ID=$(date +%Y-%m-%d-%H-%M)
bash tests/end_to_end/tpu/gemma4/26b/test_gemma4_to_mt.sh $RUN_ID
bash tests/end_to_end/tpu/gemma4/26b/test_gemma4.sh $RUN_ID

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

…-26b

Introduces E2E test configurations and scripts for gemma4-26b model,
covering both inference decoding and pre-training validation pipelines.
# Non-Googlers please remember to point `BASE_OUTPUT_DIRECTORY` to the GCS paths where you want to store scanned and unscanned checkpoints
BASE_OUTPUT_DIRECTORY=gs://runner-maxtext-logs/${MODEL_NAME}/to_maxtext

# Step 1: Install torch

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test whether we still need it.

@chiajunglien chiajunglien Jun 26, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the command below to test:

xpk workload create \
  --cluster=mesa-v6e32-eu \
  --workload=gemma4-logit-check-$(date +%m%d%H%M) \
  --device-type=v6e-32 \
  --num-slices=1 \
  --priority=high \
  --project=cienet-cmcs \
  --zone=europe-west4-a \
  --skip-validation \
  --docker-image=gcr.io/tpu-prod-env-multipod/maxtext_jax_nightly:28210168847 \
  --command="set -xue; \
    python3 -m tests.utils.forward_pass_logit_checker \
      load_parameters_path='gs://us-central1-emmalien-test-83df3ecd-bucket/gemma4-26b-0625/unscanned/2026-06-25-06-21-12/0/items' \
      model_name=gemma4-26b \
      use_multimodal=false \
      scan_layers=false \
      --hf_model_path=google/gemma-4-26b-a4b-it \
      --max_kl_div=0.03 \
      --run_hf_model=true \
      hardware=cpu \
      skip_jax_distributed_system=True"

and got the error:

[transformers] PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
[transformers] DeepseekV32Config got `key=rope_scaling` in kwargs but hasn't set it as attribute. For RoPE standardization you need to set `self.rope_parameters` in model's config. 
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/deps/tests/utils/forward_pass_logit_checker.py", line 50, in <module>
    from maxtext.checkpoint_conversion.utils.hf_utils import convert_jax_weight_to_torch
  File "/deps/src/maxtext/checkpoint_conversion/utils/hf_utils.py", line 25, in <module>
    import torch.nn.functional as F
ModuleNotFoundError: No module named 'torch'
XPK End: Fri Jun 26 09:08:08 UTC 2026
EXIT_CODE=1

Seems it still need torch to be installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants