Merge latest changes from main to 'Documentation' branch #192
Open
rsareddy0329 wants to merge 206 commits intodocumentationfrom
Open
Merge latest changes from main to 'Documentation' branch #192rsareddy0329 wants to merge 206 commits intodocumentationfrom
rsareddy0329 wants to merge 206 commits intodocumentationfrom
Conversation
Co-authored-by: adishaa <adishaa@amazon.com>
… with minor improvements and bug fixes (#137)
… with minor improvements and bug fixes. (#139)
…ception count data (#140)
* manual release v3.0.1
… regionalized HMA URI (#141)
* Add unique time string to integ test * Update syntax
* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
* Update inferenece SDK examples * Update readme
* Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * CLI: Enable Telemetry * CLI: Enable Telemetry --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed
Co-authored-by: pintaoz <pintaoz@amazon.com>
* Update inference config and integ tests * Update integ tests for new canaries
* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <pintaoz@amazon.com>
* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally
…8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes.
* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries
…189) Co-authored-by: pintaoz <pintaoz@amazon.com>
* Update documentation-with-new-changes branch with latest changes from main (#190) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <pintaoz@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> * Documentation Fixes (#191) Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * update documentation with new changes branch with latest changes (#194) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <pintaoz@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> * Documentation Fixes (#195) * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation Fixes (#197) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation Fixes (#198) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> * Documentation fixes (#199) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com> --------- Co-authored-by: Zhaoqi <zhaoqiwang.baruch@gmail.com> Co-authored-by: pintaoz-aws <167920275+pintaoz-aws@users.noreply.github.com> Co-authored-by: pintaoz <pintaoz@amazon.com> Co-authored-by: Roja Reddy Sareddy <rsareddy@amazon.com>
…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin
…holder value (#206) Co-authored-by: Mohamed Zeidan <zeidmo@amazon.com>
…boto3 client creation (#395) * Support AWS_REGION env var, cluster context fallback, and centralize boto3 client creation * fix: update test mocks to use create_boto3_client instead of boto3.client --------- Co-authored-by: Farhan Tejani <8650465+FarhanTejani@users.noreply.github.com>
Add ml.p6-b300.48xlarge to INSTANCE_TYPE_MIG_PROFILES in constants.py with the correct B300 MIG profiles derived from the NVIDIA GPU Operator v25.3.0 upstream ConfigMap (device-filter 0x318210DE): - mig-1g.34gb, mig-1g.67gb, mig-2g.67gb - mig-3g.135gb, mig-4g.135gb, mig-7g.269gb Also add the corresponding uniform and mixed MIG partition profiles to the Helm chart default-mig-config.yaml ConfigMap, following the same pattern used for existing GPU types (H100, H200, B200). The B300 GPU (288GB HBM3e, ~269GB usable) was already registered in INSTANCE_RESOURCES but had no MIG profile mapping, causing HyperPod MIG validation to reject accelerator partition requests on this instance type.
* update chart versions * updated InferenceEndpointConfig CRD --------- Co-authored-by: Chad Chiang <chadchc@amazon.com>
…00) (#403) Remove MIG-specific configuration for g7e instances while keeping instance type recognition and hardware specs intact: - Remove g7e entries from INSTANCE_TYPE_MIG_PROFILES in constants.py - Remove g7e MIG config block from GPU operator default-mig-config.yaml g7e instances remain valid for HyperPod CLI operations in whole-GPU mode. MIG partitioning will be re-enabled in a future PR. Partially reverts: 902e88f (PR #390), fully reverts MIG portion of 51b342f (PR #391)
…with bug fixes. (#405) Bug Fixes * Added handling for Nvidia GPU Xid 94 errors (ROBUST_CHANNEL_CONTAINED_ERROR) as a new fault category with no action triggering on Kubernetes platforms
The p6-b200.48xlarge key was missing the ml. prefix in both INSTANCE_TYPE_MIG_PROFILES (training) and INSTANCE_MIG_PROFILES (inference), causing MIG validation to always reject B200 instances. The instance type flowing through the system from the Kubernetes node label (node.kubernetes.io/instance-type) is always ml.p6-b200.48xlarge, so the dict lookup never matched. Additionally, the inference constant had the wrong MIG profiles for B200 — it used GB200 values (47gb, 93gb, 186gb) instead of the correct B200 values (45gb, 90gb, 180gb), likely a copy-paste from the ml.p6e-gb200.36xlarge entry. Fixes: - training/constants.py: 'p6-b200.48xlarge' -> 'ml.p6-b200.48xlarge' - inference/constant.py: key prefix + correct B200 profiles - test: update to use ml. prefixed instance type
* Model customization Init Experience Flow (#290) * model customization init/find model * Adding direct create exp * Model customization Init/Create/Find * Latest model cust changes * init migration done with template validation * Init full experience migrated, CRUDL simple addition in hyp_cli.py, unit tests added, pending nova forge happy case for integ test * remove argcomplete since it is not supported yet * add reset command for dynamic template * fix integ test error for init flow * remove recipe finder and discovery changes --------- Co-authored-by: Amarjeet LNU <jamjee@amazon.com> * add direct create with interactive session for model customization, refactor code for modularization, unit test added (#292) * Add pre-training-job and evaluation-job, set instance-type to optional, remove direct create support (#297) * bug fix for matching instance type for override params and delete command: * add pre-training-job and evaluation-job, set instance-type to optional, remove direct create support * update checkpointless flag to framework to support more modes * revert support for pre-training and framework flag (#299) * Add debug parameter to init create standard template case * Rename fine-tuning and eval jobs to hyp-recipe-job * Support private hub by providing full arn to model_name * Update unit test * IN PROGRESS: Add model id resolution for recipe jobs * Make technique required and combine with eval, regex check for private hub support, remove dynamic template * Update recipe-job all commands, add params order pending review * Update huggingface model-id search resolve mechanism * Fix arn as private hub support input * Update parameter grouping for recipe jobs, fix instance type handling * Address callouts from kiro self-review * Add and update unit tests, fix type handler for special cases * Fix unit test for training_recipe * Update according to comment and appsec review, add documentation, integ test and example notebook, pending recipe update * Integ test passes locally, update error handling * Bug bash and dog fooding improvements, update interactive cluster selection * Fix integ test for recipe init * Update create command message from Kubernetes to Hyperpod --------- Co-authored-by: Amarjeet LNU <jamjee@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Approval Steps
For Requester
For Reviewer
For Requestersection to double check each item.