Skip to content

Merge latest changes from main to 'Documentation' branch #192

Open
rsareddy0329 wants to merge 206 commits intodocumentationfrom
main
Open

Merge latest changes from main to 'Documentation' branch #192
rsareddy0329 wants to merge 206 commits intodocumentationfrom
main

Conversation

@rsareddy0329
Copy link
Copy Markdown
Collaborator

PR Approval Steps

For Requester

  1. Description
    • Check the PR title and description for clarity. It should describe the changes made and the reason behind them.
    • Ensure that the PR follows the contribution guidelines, if applicable.
  2. Security requirements
    • Ensure that a Pull Request (PR) does not expose passwords and other sensitive information by using git-secrets and upload relevant evidence: https://github.com/awslabs/git-secrets
    • Ensure commit has GitHub Commit Signature
  3. Manual review
    1. Click on the Files changed tab to see the code changes. Review the changes thoroughly:
      • Code Quality: Check for coding standards, naming conventions, and readability.
      • Functionality: Ensure that the changes meet the requirements and that all necessary code paths are tested.
      • Security: Check for any security issues or vulnerabilities.
      • Documentation: Confirm that any necessary documentation (code comments, README updates, etc.) has been updated.
  4. Check for Merge Conflicts:
    • Verify if there are any merge conflicts with the base branch. GitHub will usually highlight this. If there are conflicts, you should resolve them.

For Reviewer

  1. Go through For Requester section to double check each item.
  2. Request Changes or Approve the PR:
    1. If the PR is ready to be merged, click Review changes and select Approve.
    2. If changes are required, select Request changes and provide feedback. Be constructive and clear in your feedback.
  3. Merging the PR
    1. Check the Merge Method:
      1. Decide on the appropriate merge method based on your repository's guidelines (e.g., Squash and merge, Rebase and merge, or Merge).
    2. Merge the PR:
      1. Click the Merge pull request button.
      2. Confirm the merge by clicking Confirm merge.

Aditi2424 and others added 25 commits July 18, 2025 12:24
Co-authored-by: adishaa <adishaa@amazon.com>
* manual release v3.0.1
* Add unique time string to integ test

* Update syntax
* Training CLI & SDK: example notebook and README update

* Update training cli example notebook

---------

Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
* Update inferenece SDK examples

* Update readme
* Enable Hyperpod telemetry

* Enable Hyperpod telemetry

* Enable Hyperpod telemetry

* Enable Hyperpod telemetry

* Enable Hyperpod telemetry

* Enable Hyperpod telemetry

* CLI: Enable Telemetry

* CLI: Enable Telemetry

---------

Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
* update help text to avoid truncation

* update volume flag to support hostPath and pvc, before e2e testing

* clean up and e2e working

* Minor updates after PR

* update

* Added unit tests for volume, all cli unit tests passed
Co-authored-by: pintaoz <pintaoz@amazon.com>
* Update inference config and integ tests

* Update integ tests for new canaries
* Manual release v3.0.2

* Update changelog

---------

Co-authored-by: pintaoz <pintaoz@amazon.com>
* Update readme for volume flag

* Add schema pattern check to pytorch-job template, unit test added, all test passed locally
…8s (#138)

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Fix unit test cases

* Move regex to a constant.

**Description**
- Removed an integration test case that was being mocked.
- Moved a regex to a constant.

**Testing Done**
Unit test cases pass no changes made to integration test cases and they should not be affected.

* Add k8s version validation check between server and client version according to the supported versioning constraints by k8s

* Add ref link for version comptability contraints

**Description**
Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server.

**Testing Done**
No breaking changes.
* Fix SDK training test: Add wait time before refresh

* Fix training tests in canaries
@rsareddy0329 rsareddy0329 requested a review from a team as a code owner August 5, 2025 23:05
rsareddy0329 and others added 4 commits August 6, 2025 13:51
* Update documentation-with-new-changes branch with latest changes from main (#190)

* Fix training test (#184)

* Fix SDK training test: Add wait time before refresh

* Fix training tests in canaries

* Update logging information for submitting and deleting training job (#189)

Co-authored-by: pintaoz <pintaoz@amazon.com>

---------

Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com>
Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com>
Co-authored-by: pintaoz <pintaoz@amazon.com>

* Documentation Fixes (#191)

Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

* update documentation with new changes branch with latest changes (#194)

* Fix training test (#184)

* Fix SDK training test: Add wait time before refresh

* Fix training tests in canaries

* Update logging information for submitting and deleting training job (#189)

Co-authored-by: pintaoz <pintaoz@amazon.com>

---------

Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com>
Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com>
Co-authored-by: pintaoz <pintaoz@amazon.com>

* Documentation Fixes (#195)

* Documentation Fixes

* Documentation Fixes

---------

Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

* Documentation Fixes (#197)

* Documentation Fixes

* Documentation Fixes

* Documentation Fixes

* Documentation Fixes

---------

Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

* Documentation Fixes (#198)

* Documentation Fixes

* Documentation Fixes

* Documentation Fixes

* Documentation Fixes

* Documentation Fixes

---------

Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

* Documentation fixes (#199)

* Documentation Fixes

* Documentation Fixes

* Documentation Fixes

* Documentation Fixes

* Documentation Fixes

* Documentation Fixes

---------

Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>

---------

Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com>
Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com>
Co-authored-by: pintaoz <pintaoz@amazon.com>
Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
…s to view SDK config code (#188)

Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
* Add instance type support for ml.p6e-gb200.36xlarge

Updated support for ml.p6-b200.48xlarge as well

* Add ml.p6e-gb200.36xlarge to efa plugin
…holder value (#206)

Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
…boto3 client creation (#395)

* Support AWS_REGION env var, cluster context fallback, and centralize boto3 client creation

* fix: update test mocks to use create_boto3_client instead of boto3.client

---------

Co-authored-by: Farhan Tejani <8650465+FarhanTejani@users.noreply.github.com>
Co-authored-by: Farhan Tejani <8650465+FarhanTejani@users.noreply.github.com>
…393)

* Require --instance-type when specifying accelerator resources (#317)

* fix: move validation before early return, re-enable and improve resource allocation tests

---------

Co-authored-by: Farhan Tejani <8650465+FarhanTejani@users.noreply.github.com>
Add ml.p6-b300.48xlarge to INSTANCE_TYPE_MIG_PROFILES in constants.py
with the correct B300 MIG profiles derived from the NVIDIA GPU Operator
v25.3.0 upstream ConfigMap (device-filter 0x318210DE):

- mig-1g.34gb, mig-1g.67gb, mig-2g.67gb
- mig-3g.135gb, mig-4g.135gb, mig-7g.269gb

Also add the corresponding uniform and mixed MIG partition profiles
to the Helm chart default-mig-config.yaml ConfigMap, following the
same pattern used for existing GPU types (H100, H200, B200).

The B300 GPU (288GB HBM3e, ~269GB usable) was already registered in
INSTANCE_RESOURCES but had no MIG profile mapping, causing HyperPod
MIG validation to reject accelerator partition requests on this
instance type.
* update chart versions

* updated InferenceEndpointConfig CRD

---------

Co-authored-by: Chad Chiang <chadchc@amazon.com>
…00) (#403)

Remove MIG-specific configuration for g7e instances while keeping
instance type recognition and hardware specs intact:

- Remove g7e entries from INSTANCE_TYPE_MIG_PROFILES in constants.py
- Remove g7e MIG config block from GPU operator default-mig-config.yaml

g7e instances remain valid for HyperPod CLI operations in whole-GPU mode.
MIG partitioning will be re-enabled in a future PR.

Partially reverts: 902e88f (PR #390), fully reverts MIG portion of 51b342f (PR #391)
…with bug fixes. (#405)

Bug Fixes

* Added handling for Nvidia GPU Xid 94 errors (ROBUST_CHANNEL_CONTAINED_ERROR) as a new fault category with no action triggering on Kubernetes platforms
The p6-b200.48xlarge key was missing the ml. prefix in both
INSTANCE_TYPE_MIG_PROFILES (training) and INSTANCE_MIG_PROFILES
(inference), causing MIG validation to always reject B200 instances.
The instance type flowing through the system from the Kubernetes
node label (node.kubernetes.io/instance-type) is always
ml.p6-b200.48xlarge, so the dict lookup never matched.

Additionally, the inference constant had the wrong MIG profiles
for B200 — it used GB200 values (47gb, 93gb, 186gb) instead of
the correct B200 values (45gb, 90gb, 180gb), likely a copy-paste
from the ml.p6e-gb200.36xlarge entry.

Fixes:
- training/constants.py: 'p6-b200.48xlarge' -> 'ml.p6-b200.48xlarge'
- inference/constant.py: key prefix + correct B200 profiles
- test: update to use ml. prefixed instance type
…#383) (#401)

* Introduce replica_count support for training and deprecate node_count (#383)

* Fix integration tests for replica-count validation changes
* Model customization Init Experience Flow (#290)

* model customization init/find model

* Adding direct create exp

* Model customization Init/Create/Find

* Latest model cust changes

* init migration done with template validation

* Init full experience migrated, CRUDL simple addition in hyp_cli.py, unit tests added, pending nova forge happy case for integ test

* remove argcomplete since it is not supported yet

* add reset command for dynamic template

* fix integ test error for init flow

* remove recipe finder and discovery changes

---------

Co-authored-by: Amarjeet LNU <jamjee@amazon.com>

* add direct create with interactive session for model customization, refactor code for modularization, unit test added (#292)

* Add pre-training-job and evaluation-job, set instance-type to optional, remove direct create support (#297)

* bug fix for matching instance type for override params and delete command:

* add pre-training-job and evaluation-job, set instance-type to optional, remove direct create support

* update checkpointless flag to framework to support more modes

* revert support for pre-training and framework flag (#299)

* Add debug parameter to init create standard template case

* Rename fine-tuning and eval jobs to hyp-recipe-job

* Support private hub by providing full arn to model_name

* Update unit test

* IN PROGRESS: Add model id resolution for recipe jobs

* Make technique required and combine with eval, regex check for private hub support, remove dynamic template

* Update recipe-job all commands, add params order pending review

* Update huggingface model-id search resolve mechanism

* Fix arn as private hub support input

* Update parameter grouping for recipe jobs, fix instance type handling

* Address callouts from kiro self-review

* Add and update unit tests, fix type handler for special cases

* Fix unit test for training_recipe

* Update according to comment and appsec review, add documentation, integ test and example notebook, pending recipe update

* Integ test passes locally, update error handling

* Bug bash and dog fooding improvements, update interactive cluster selection

* Fix integ test for recipe init

* Update create command message from Kubernetes to Hyperpod

---------

Co-authored-by: Amarjeet LNU <jamjee@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.