Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
32bb552
Model customization Init Experience Flow (#290)
mollyheamazon Nov 13, 2025
d49e992
add direct create with interactive session for model customization, r…
mollyheamazon Nov 14, 2025
7f19cf9
Add pre-training-job and evaluation-job, set instance-type to optiona…
mollyheamazon Nov 19, 2025
ea19fe9
revert support for pre-training and framework flag (#299)
mollyheamazon Nov 20, 2025
58fb47c
Add debug parameter to init create standard template case
mollyheamazon Dec 16, 2025
d635e01
Rename fine-tuning and eval jobs to hyp-recipe-job
mollyheamazon Mar 18, 2026
50d6919
Support private hub by providing full arn to model_name
mollyheamazon Mar 18, 2026
072c86f
Update unit test
mollyheamazon Mar 19, 2026
39ec00e
IN PROGRESS: Add model id resolution for recipe jobs
mollyheamazon Mar 19, 2026
4c41f32
Make technique required and combine with eval, regex check for privat…
mollyheamazon Mar 19, 2026
f7d8c80
Update recipe-job all commands, add params order pending review
mollyheamazon Mar 23, 2026
6d0d9bb
Update huggingface model-id search resolve mechanism
mollyheamazon Mar 24, 2026
19ee5e6
Fix arn as private hub support input
mollyheamazon Mar 24, 2026
fed3dd0
Merge model-id branch: HF model ID resolution + private hub ARN fixes
mollyheamazon Mar 24, 2026
afc264c
Update parameter grouping for recipe jobs, fix instance type handling
mollyheamazon Mar 25, 2026
dfabafc
Address callouts from kiro self-review
mollyheamazon Mar 25, 2026
dcae80b
Add and update unit tests, fix type handler for special cases
mollyheamazon Mar 25, 2026
2159d56
Fix unit test for training_recipe
mollyheamazon Mar 25, 2026
6773494
Update according to comment and appsec review, add documentation, int…
mollyheamazon Apr 2, 2026
83ac236
Integ test passes locally, update error handling
mollyheamazon Apr 4, 2026
48b8234
Bug bash and dog fooding improvements, update interactive cluster sel…
mollyheamazon Apr 15, 2026
7ab0ee8
Fix integ test for recipe init
mollyheamazon Apr 15, 2026
f8d9b5a
Update create command message from Kubernetes to Hyperpod
mollyheamazon Apr 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -429,6 +429,105 @@ hyp get-operator-logs hyp-pytorch-job --since-hours 0.5
hyp delete hyp-pytorch-job --job-name <job-name>
```

### Recipe Job

Use `hyp-recipe-job` to submit fine-tuning and evaluation jobs using pre-built recipes from SageMaker JumpStart Hub — no YAML authoring required.


#### Initialize Recipe Job Configuration

```bash
mkdir my-recipe-job && cd my-recipe-job

# Option A: HuggingFace model ID
hyp init hyp-recipe-job . \
--huggingface-model-id Qwen/Qwen3-0.6B \
--technique SFT \
--instance-type ml.g5.48xlarge

# Option B: JumpStart model ID
hyp init hyp-recipe-job . \
--model-id huggingface-reasoning-qwen3-06b \
--technique SFT \
--instance-type ml.g5.48xlarge
```

Supported job types:
- **Fine-tuning**: `SFT`, `DPO`, `CPT`, `PPO`, `RLAIF`, `RLVR`
- **Evaluation**: `deterministic`, `LLMAJ`

> **Note**: If you omit `--instance-type`, the CLI will automatically query your HyperPod clusters and find clusters with instance types supported by the selected recipe and technique. You will be presented with a list of compatible clusters to choose from.

#### Configure Recipe Job Parameters

```bash
hyp configure \
--name my-recipe-job \
--namespace default \
--data-path /data/recipes-data/sft/train.jsonl \
--global-batch-size 8 \
--learning-rate 0.0001 \
--max-epochs 1 \
--output-path /data/output/my-model \
--instance-type ml.g5.48xlarge
```

#### Validate Configuration

```bash
hyp validate
```

#### Reset Configuration

To reset `config.yaml` back to its default values:

```bash
hyp reset
```

#### Submit Recipe Job

```bash
hyp create
```

#### List Recipe Jobs

```bash
hyp list hyp-recipe-job --namespace default
```

#### Describe a Recipe Job

```bash
hyp describe hyp-recipe-job --job-name <job-name> --namespace default
```

#### List Pods for a Recipe Job

```bash
hyp list-pods hyp-recipe-job --job-name <job-name> --namespace default
```

#### Get Logs from a Recipe Job Pod

```bash
hyp get-logs hyp-recipe-job --job-name <job-name> --pod-name <pod-name> --namespace default
```

#### Get Operator Logs

```bash
hyp get-operator-logs hyp-recipe-job
```

#### Delete a Recipe Job

```bash
hyp delete hyp-recipe-job --job-name <job-name> --namespace default
```

### Inference

### Jumpstart Endpoint Creation
Expand Down
7 changes: 7 additions & 0 deletions doc/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,13 @@ For detailed examples of training with HyperPod, see:
**Training Examples** Refer the Training SDK Example.
:::

:::{grid-item-card} Recipe Job CLI Example
:link: https://github.com/aws/sagemaker-hyperpod-cli/blob/main/examples/end_to_end_walkthrough/01-training-job-submission/02-recipe-job-cli.ipynb
:class-card: sd-border-primary

**Recipe Job Example** Submit a fine-tuning job using `hyp-recipe-job` — pre-built recipes from SageMaker JumpStart Hub, no YAML required.
:::

::::


Expand Down
102 changes: 102 additions & 0 deletions doc/getting_started/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,107 @@ This will:
- Initialize the job creation process


## Creating Training Jobs -- Recipe Job Init Experience

The `hyp-recipe-job` experience lets you submit fine-tuning and evaluation jobs using pre-built recipes published to SageMaker JumpStart Hub. No YAML authoring required — the CLI fetches the Kubernetes job template and parameter spec automatically.

### 1. Initialize a Recipe Job

`````{tab-set}
````{tab-item} CLI (HuggingFace model ID)
```bash
mkdir my-recipe-job
cd my-recipe-job
hyp init hyp-recipe-job . \
--huggingface-model-id Qwen/Qwen3-0.6B \
--technique SFT \
--instance-type ml.g5.48xlarge
```
````
````{tab-item} CLI (JumpStart model ID)
```bash
mkdir my-recipe-job
cd my-recipe-job
hyp init hyp-recipe-job . \
--model-id huggingface-reasoning-qwen3-06b \
--technique SFT \
--instance-type ml.g5.48xlarge
```
````
`````

Supported job types:
- **Fine-tuning**: `SFT`, `DPO`, `CPT`, `PPO`, `RLAIF`, `RLVR`
- **Evaluation**: `deterministic`, `LLMAJ`

```{note}
If you omit `--instance-type`, the CLI will automatically query your HyperPod clusters and find clusters with instance types supported by the selected recipe and technique. You will be presented with a list of compatible clusters to choose from. Note that this interactive prompt requires a terminal and is not supported in Jupyter notebooks.
```

This creates three files in your job directory:
- `config.yaml` — your editable training parameters
- `.override_spec.json` — the parameter schema
- `k8s.jinja` — the Kubernetes job template

### 3. Configure Recipe Job Parameters

```bash
hyp configure \
--name my-recipe-job \
--namespace default \
--data-path /data/recipes-data/sft/train.jsonl \
--global-batch-size 8 \
--learning-rate 0.0001 \
--max-epochs 1 \
--output-path /data/output/my-model \
--instance-type ml.g5.48xlarge
```

### 4. Validate Configuration

```bash
hyp validate
```

### 4a. Reset Configuration (Optional)

To reset `config.yaml` back to its default values:

```bash
hyp reset
```

### 5. Submit the Recipe Job

```bash
hyp create
```

### 6. Manage Recipe Jobs

```bash
# List jobs
hyp list hyp-recipe-job --namespace default

# Describe a job
hyp describe hyp-recipe-job --job-name <job-name> --namespace default

# List pods
hyp list-pods hyp-recipe-job --job-name <job-name> --namespace default

# Get logs
hyp get-logs hyp-recipe-job --job-name <job-name> --pod-name <pod-name> --namespace default

# Get operator logs
hyp get-operator-logs hyp-recipe-job

# Exec into pods
hyp exec hyp-recipe-job --job-name <job-name> --namespace default --all-pods -- echo hello

# Delete job
hyp delete hyp-recipe-job --job-name <job-name> --namespace default
```

## Creating Training Jobs -- CLI/SDK

You can create training jobs using either the CLI or SDK approach:
Expand Down Expand Up @@ -295,5 +396,6 @@ For detailed examples of training with HyperPod, see:
- <a href="https://github.com/aws/sagemaker-hyperpod-cli/blob/main/examples/training/CLI/training-init-experience.ipynb" target="_blank">CLI Training Init Experience Example</a>
- <a href="https://github.com/aws/sagemaker-hyperpod-cli/blob/main/examples/training/CLI/training-e2e-cli.ipynb" target="_blank">CLI Training Example</a>
- <a href="https://github.com/aws/sagemaker-hyperpod-cli/blob/main/examples/training/SDK/training_sdk_example.ipynb" target="_blank">SDK Training Example</a>
- <a href="https://github.com/aws/sagemaker-hyperpod-cli/blob/main/examples/end_to_end_walkthrough/01-training-job-submission/02-recipe-job-cli.ipynb" target="_blank">Recipe Job CLI Example</a>

These examples demonstrate end-to-end workflows for creating and managing training jobs using both the CLI and SDK approaches.
Loading
Loading