Merge latest changes from main to 'Documentation' branch by rsareddy0329 · Pull Request #192 · aws/sagemaker-hyperpod-cli

rsareddy0329 · 2025-08-05T23:05:11Z

PR Approval Steps

For Requester

Description
- Check the PR title and description for clarity. It should describe the changes made and the reason behind them.
- Ensure that the PR follows the contribution guidelines, if applicable.
Security requirements
- Ensure that a Pull Request (PR) does not expose passwords and other sensitive information by using git-secrets and upload relevant evidence: https://github.com/awslabs/git-secrets
- Ensure commit has GitHub Commit Signature
Manual review
1. Click on the Files changed tab to see the code changes. Review the changes thoroughly:
  - Code Quality: Check for coding standards, naming conventions, and readability.
  - Functionality: Ensure that the changes meet the requirements and that all necessary code paths are tested.
  - Security: Check for any security issues or vulnerabilities.
  - Documentation: Confirm that any necessary documentation (code comments, README updates, etc.) has been updated.
Check for Merge Conflicts:
- Verify if there are any merge conflicts with the base branch. GitHub will usually highlight this. If there are conflicts, you should resolve them.

For Reviewer

Go through For Requester section to double check each item.
Request Changes or Approve the PR:
1. If the PR is ready to be merged, click Review changes and select Approve.
2. If changes are required, select Request changes and provide feedback. Be constructive and clear in your feedback.
Merging the PR
1. Check the Merge Method:
  1. Decide on the appropriate merge method based on your repository's guidelines (e.g., Squash and merge, Rebase and merge, or Merge).
2. Merge the PR:
  1. Click the Merge pull request button.
  2. Confirm the merge by clicking Confirm merge.

Co-authored-by: adishaa <adishaa@amazon.com>

… with minor improvements and bug fixes (#137)

… with minor improvements and bug fixes. (#139)

…and ux (#136)

…ception count data (#140)

* manual release v3.0.1

…alarm fix (#147)

… regionalized HMA URI (#141)

* Add unique time string to integ test * Update syntax

* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

* Update inferenece SDK examples * Update readme

* Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * CLI: Enable Telemetry * CLI: Enable Telemetry --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

…102)

* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed

Co-authored-by: pintaoz <pintaoz@amazon.com>

* Update inference config and integ tests * Update integ tests for new canaries

* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <pintaoz@amazon.com>

* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally

…8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes.

* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries

…189) Co-authored-by: pintaoz <pintaoz@amazon.com>

* Update documentation-with-new-changes branch with latest changes from main (#190) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <pintaoz@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> * Documentation Fixes (#191) Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * update documentation with new changes branch with latest changes (#194) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <pintaoz@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> * Documentation Fixes (#195) * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation Fixes (#197) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation Fixes (#198) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation fixes (#199) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin

…holder value (#206) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

* Model customization Init Experience Flow (#290) * model customization init/find model * Adding direct create exp * Model customization Init/Create/Find * Latest model cust changes * init migration done with template validation * Init full experience migrated, CRUDL simple addition in hyp_cli.py, unit tests added, pending nova forge happy case for integ test * remove argcomplete since it is not supported yet * add reset command for dynamic template * fix integ test error for init flow * remove recipe finder and discovery changes --------- Co-authored-by: Amarjeet LNU <jamjee@amazon.com> * add direct create with interactive session for model customization, refactor code for modularization, unit test added (#292) * Add pre-training-job and evaluation-job, set instance-type to optional, remove direct create support (#297) * bug fix for matching instance type for override params and delete command: * add pre-training-job and evaluation-job, set instance-type to optional, remove direct create support * update checkpointless flag to framework to support more modes * revert support for pre-training and framework flag (#299) * Add debug parameter to init create standard template case * Rename fine-tuning and eval jobs to hyp-recipe-job * Support private hub by providing full arn to model_name * Update unit test * IN PROGRESS: Add model id resolution for recipe jobs * Make technique required and combine with eval, regex check for private hub support, remove dynamic template * Update recipe-job all commands, add params order pending review * Update huggingface model-id search resolve mechanism * Fix arn as private hub support input * Update parameter grouping for recipe jobs, fix instance type handling * Address callouts from kiro self-review * Add and update unit tests, fix type handler for special cases * Fix unit test for training_recipe * Update according to comment and appsec review, add documentation, integ test and example notebook, pending recipe update * Integ test passes locally, update error handling * Bug bash and dog fooding improvements, update interactive cluster selection * Fix integ test for recipe init * Update create command message from Kubernetes to Hyperpod --------- Co-authored-by: Amarjeet LNU <jamjee@amazon.com>

…s throttling (#410)

…414)

+        ])
+
+        assert result.exit_code == 0
+        assert "https://kiro-url.com" in result.output


+        ])
+
+        assert result.exit_code == 0
+        assert "https://cursor-url.com" in result.output


Bump all GPU Operator component versions to resolve critical CVEs in v25.3.4 images. Pure version bump — no image name changes, no regional-values changes, no behavioral change. - operator: v25.3.4 → v25.10.1 - toolkit: v1.17.9-ubi8 → v1.18.1 - devicePlugin: v0.17.4 → v0.18.1 - gfd: v0.17.4 → v0.18.1 - migManager: v0.12.3-ubuntu20.04 → v0.13.1 - validator: v25.3.4 → v25.10.1 - toolkit.enabled left absent (defaults true) — safe for upgrades Co-authored-by: Stephen Via <svia@amazon.com>

…latest CRDs (#416) Update inference operator chart from AWSCrescendoInferenceOperator dist. Includes new CRD schemas, init container support, custom service accounts flag, and templated manager configuration. Excludes pdSpec (disaggregated prefill/decode) as it is not yet GA.

…oints (#417) * update chart versions * updated InferenceEndpointConfig CRD * Update template versions * nodeaffinity fixed * fix: dns and data capture * final fix * adding test and src --------- Co-authored-by: Chad Chiang <chadchc@amazon.com>

CVE remediation — Mirador findings against v25.10.1 base images resolved in v26.3.1. Pure version bump, toolkit stays enabled (parallel coexistence). - gpu-operator: v25.10.1 → v26.3.1 - device-plugin: v0.18.1 → v0.19.0 - container-toolkit: v1.18.1 → v1.19.0 - mig-manager: v0.13.1 → v0.14.0 - gfd: v0.18.1 → v0.19.0 - validator: v25.10.1 → v26.3.1 (consolidated into gpu-operator image) SIM: https://t.corp.amazon.com/V2203884559 Co-authored-by: Stephen Via <svia@amazon.com>

Add regional-values file for ap-south-2 so customers in Hyderabad can pull GPU operator images from the local ECR mirror (580982410692) during helm install. ECR account verified via isengardcli: aws-crescendo-dockerregistry+prod-hyd-service Co-authored-by: Stephen Via <svia@amazon.com>

…ipe schema (#428)

…with bug fixes. (#425) Bug Fixes * Add 2 min sleep before nvml component checks to prevent missing GPU false positives

…l pin (#426)

* [DPD] DPD CRD changes with version bump to v3.2 * Pin operator to amd64 nodes via nodeAffinity and version bump to 2.2.1

Aditi2424 and others added 25 commits July 18, 2025 12:24

Update telemetry status to be Integer for parity (#130)

223af40

Co-authored-by: adishaa <adishaa@amazon.com>

Release new version for Health Monitoring Agent (1.0.643.0_1.0.192.0)…

cf77296

… with minor improvements and bug fixes (#137)

Release new version for Health Monitoring Agent (1.0.674.0_1.0.199.0)…

0342f60

… with minor improvements and bug fixes. (#139)

update inference CLI describe command print for better visualization …

631ddf9

…and ux (#136)

Update inference integ test to add dependency to improve telemetry ex…

dc440c3

…ception count data (#140)

Manual release v3.0.1 (#143)

cc08405

* manual release v3.0.1

change security-monitoring metrics data destination to us-east-2 for …

079fafd

…alarm fix (#147)

feat: Add region detection to install Health Monitoring Agent and use…

29a16c5

… regionalized HMA URI (#141)

Add unique time string to integ test (#150)

66232ed

* Add unique time string to integ test * Update syntax

update example notebook for inference CLI (#151)

9fbec4a

Training: Main documentation update (#153)

8034a24

* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

Update inferenece SDK examples (#155)

0bcee6d

* Update inferenece SDK examples * Update readme

update help text to avoid truncation (#158)

d2130e9

Add an option to disable the deployment of KubeFlow TrainingOperator (#…

293f9b9

…102)

Remove unused param from documentation (#170)

9f534b4

Update volume flag to support hostPath and pvc (#171)

ec8800d

* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed

Restructure list-cluster output (#173)

95e073e

Co-authored-by: pintaoz <pintaoz@amazon.com>

Update inference config and integ tests (#167)

a8a2baf

* Update inference config and integ tests * Update integ tests for new canaries

Update readme for volume flag (#176)

2908a62

Manual release v3.0.2 (#177)

9b7220c

* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <pintaoz@amazon.com>

Add schema pattern check to pytorch-job template (#178)

36fac66

* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally

Fix training test (#184)

dcbc8fb

* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries

Update logging information for submitting and deleting training job (#…

28424e4

…189) Co-authored-by: pintaoz <pintaoz@amazon.com>

rsareddy0329 requested a review from a team as a code owner August 5, 2025 23:05

rsareddy0329 and others added 4 commits August 6, 2025 13:51

Added new column 'deploymeny configs' to the itable that allows user'…

6553766

…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

Add instance type support for ml.p6e-gb200.36xlarge (#204)

63ff3b4

* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin

changed endpoint name from value user has to manually insert to place…

e3f697a

…holder value (#206) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>

mollyheamazon had a problem deploying to manual-approval April 16, 2026 05:38 — with GitHub Actions Error

mollyheamazon had a problem deploying to manual-approval April 16, 2026 20:20 — with GitHub Actions Error

fix: exclude DELETE_COMPLETE stacks at API level to prevent ListStack…

c2726ce

…s throttling (#410)

mollyheamazon had a problem deploying to manual-approval April 16, 2026 21:12 — with GitHub Actions Error

Release v3.8.0 (#411)

f0c8641

mollyheamazon had a problem deploying to manual-approval April 16, 2026 23:39 — with GitHub Actions Error

feat: support generic {ide}-remote connection types for space access (#…

8840eaa

…414)

mollyheamazon had a problem deploying to manual-approval April 30, 2026 18:49 — with GitHub Actions Error

github-advanced-security AI found potential problems Apr 30, 2026

View reviewed changes

Comment thread test/unit_tests/cli/test_space_access.py

])

assert result.exit_code == 0

assert "https://kiro-url.com" in result.output

Comment thread test/unit_tests/cli/test_space_access.py

])

assert result.exit_code == 0

assert "https://cursor-url.com" in result.output

aviruthen had a problem deploying to manual-approval May 5, 2026 22:19 — with GitHub Actions Error

mollyheamazon had a problem deploying to manual-approval May 6, 2026 19:18 — with GitHub Actions Error

fix(helm): Bump inference-operator dependency to 2.1.1 (#418)

20eb538

mollyheamazon had a problem deploying to manual-approval May 6, 2026 22:26 — with GitHub Actions Error

mollyheamazon had a problem deploying to manual-approval May 7, 2026 21:48 — with GitHub Actions Error

aviruthen had a problem deploying to manual-approval May 11, 2026 21:27 — with GitHub Actions Error

aviruthen had a problem deploying to manual-approval June 1, 2026 23:01 — with GitHub Actions Error

fix: rename lr_warmup_ratio to lr_warmup_steps_ratio to match Hub rec…

04306c2

…ipe schema (#428)

aviruthen had a problem deploying to manual-approval June 4, 2026 18:41 — with GitHub Actions Error

Release new version for Health Monitoring Agent 1.0.1892.0_1.0.424.0 …

7ae0500

…with bug fixes. (#425) Bug Fixes * Add 2 min sleep before nvml component checks to prevent missing GPU false positives

aviruthen had a problem deploying to manual-approval June 4, 2026 22:10 — with GitHub Actions Error

fix: exclude kubernetes==36.0.0 which breaks EKS auth and relax pyyam…

9644fe5

…l pin (#426)

aviruthen had a problem deploying to manual-approval June 4, 2026 22:41 — with GitHub Actions Error

[DPD] DPD CRD changes with version bump to v3.2 (#427)

959b62b

* [DPD] DPD CRD changes with version bump to v3.2 * Pin operator to amd64 nodes via nodeAffinity and version bump to 2.2.1

aviruthen requested a deployment to manual-approval June 8, 2026 23:52 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge latest changes from main to 'Documentation' branch #192

Merge latest changes from main to 'Documentation' branch #192
rsareddy0329 wants to merge 217 commits into
documentationfrom
main

rsareddy0329 commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

rsareddy0329 commented Aug 5, 2025

PR Approval Steps

For Requester

For Reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants