Skip to content

Commit d221dae

Browse files
authored
Merge pull request #18 from VectorInstitute/develop
v0.4.0
2 parents 97f22a6 + f74c4f6 commit d221dae

11 files changed

+367
-143
lines changed

Dockerfile

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,19 +48,25 @@ RUN wget https://bootstrap.pypa.io/get-pip.py && \
4848
rm get-pip.py
4949

5050
# Ensure pip for Python 3.10 is used
51-
RUN python3.10 -m pip install --upgrade pip
51+
RUN python3.10 -m pip install --upgrade pip setuptools wheel
5252

5353
# Install Poetry using Python 3.10
5454
RUN python3.10 -m pip install poetry
5555

5656
# Don't create venv
5757
RUN poetry config virtualenvs.create false
5858

59+
# Set working directory
60+
WORKDIR /vec-inf
61+
62+
# Copy current directory
63+
COPY . /vec-inf
64+
5965
# Update Poetry lock file if necessary
6066
RUN poetry lock
6167

6268
# Install vec-inf
63-
RUN python3.10 -m pip install vec-inf[dev]
69+
RUN poetry install --extras "dev"
6470

6571
# Install Flash Attention 2 backend
6672
RUN python3.10 -m pip install flash-attn --no-build-isolation

README.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,23 @@ pip install vec-inf
99
Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package
1010

1111
## Launch an inference server
12+
### `launch` command
1213
We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
1314
```bash
1415
vec-inf launch Meta-Llama-3.1-8B-Instruct
1516
```
1617
You should see an output like the following:
1718

18-
<img width="400" alt="launch_img" src="https://github.com/user-attachments/assets/557eb421-47db-4810-bccd-c49c526b1b43">
19+
<img width="700" alt="launch_img" src="https://github.com/user-attachments/assets/ab658552-18b2-47e0-bf70-e539c3b898d5">
1920

20-
The model would be launched using the [default parameters](vec_inf/models/models.csv), you can override these values by providing additional options, use `--help` to see the full list. You can also launch your own customized model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), you'll need to specify all model launching related options to run a successful run.
21+
The model would be launched using the [default parameters](vec_inf/models/models.csv), you can override these values by providing additional parameters, use `--help` to see the full list. You can also launch your own customized model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
22+
* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`.
23+
* Your model weights directory should contain HF format weights.
24+
* The following launch parameters will conform to default value if not specified: `--max-num-seqs`, `--partition`, `--data-type`, `--venv`, `--log-dir`, `--model-weights-parent-dir`, `--pipeline-parallelism`, `--enforce-eager`. All other launch parameters need to be specified for custom models.
25+
* Example for setting the model weights parent directory: `--model-weights-parent-dir /h/user_name/my_weights`.
26+
* For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
2127

28+
### `status` command
2229
You can check the inference server status by providing the Slurm job ID to the `status` command:
2330
```bash
2431
vec-inf status 13014393
@@ -38,24 +45,36 @@ There are 5 possible states:
3845

3946
Note that the base URL is only available when model is in `READY` state, and if you've changed the Slurm log directory path, you also need to specify it when using the `status` command.
4047

48+
### `metrics` command
49+
Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command:
50+
```bash
51+
vec-inf metrics 13014393
52+
```
53+
54+
And you will see the performance metrics streamed to your console, note that the metrics are updated with a 10-second interval.
55+
56+
<img width="400" alt="metrics_img" src="https://github.com/user-attachments/assets/e5ff2cd5-659b-4c88-8ebc-d8f3fdc023a4">
57+
58+
### `shutdown` command
4159
Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
4260
```bash
4361
vec-inf shutdown 13014393
4462

4563
> Shutting down model with Slurm Job ID: 13014393
4664
```
4765

66+
### `list` command
4867
You call view the full list of available models by running the `list` command:
4968
```bash
5069
vec-inf list
5170
```
52-
<img width="1200" alt="list_img" src="https://github.com/user-attachments/assets/a4f0d896-989d-43bf-82a2-6a6e5d0d288f">
71+
<img width="900" alt="list_img" src="https://github.com/user-attachments/assets/7cb2b2ac-d30c-48a8-b773-f648c27d9de2">
5372

5473
You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`:
5574
```bash
5675
vec-inf list Meta-Llama-3.1-70B-Instruct
5776
```
58-
<img width="400" alt="list_model_img" src="https://github.com/user-attachments/assets/5dec7a33-ba6b-490d-af47-4cf7341d0b42">
77+
<img width="400" alt="list_model_img" src="https://github.com/user-attachments/assets/30e42ab7-dde2-4d20-85f0-187adffefc3d">
5978

6079
`launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.
6180

pyproject.toml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "vec-inf"
3-
version = "0.3.3"
3+
version = "0.4.0"
44
description = "Efficient LLM inference on Slurm clusters using vLLM."
55
authors = ["Marshall Wang <marshall.wang@vectorinstitute.ai>"]
66
license = "MIT license"
@@ -11,8 +11,9 @@ python = "^3.10"
1111
requests = "^2.31.0"
1212
click = "^8.1.0"
1313
rich = "^13.7.0"
14-
pandas = "^2.2.2"
15-
vllm = { version = "^0.5.0", optional = true }
14+
pandas = "^1.15.0"
15+
numpy = "^1.24.0"
16+
vllm = { version = "^0.6.0", optional = true }
1617
vllm-nccl-cu12 = { version = ">=2.18,<2.19", optional = true }
1718
ray = { version = "^2.9.3", optional = true }
1819
cupy-cuda12x = { version = "12.1.0", optional = true }

vec_inf/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
# `vec-inf` Commands
22

33
* `launch`: Specify a model family and other optional parameters to launch an OpenAI compatible inference server, `--json-mode` supported. Check [`here`](./models/README.md) for complete list of available options.
4-
* `list`: List all available model names, `--json-mode` supported.
4+
* `list`: List all available model names, or append a supported model name to view the default configuration, `--json-mode` supported.
5+
* `metrics`: Streams performance metrics to the console.
56
* `status`: Check the model status by providing its Slurm job ID, `--json-mode` supported.
67
* `shutdown`: Shutdown a model by providing its Slurm job ID.
78

vec_inf/cli/_cli.py

Lines changed: 151 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
11
import os
2-
from typing import Optional
2+
import time
3+
from typing import Optional, cast
34

45
import click
6+
7+
import polars as pl
58
from rich.columns import Columns
69
from rich.console import Console
10+
from rich.live import Live
711
from rich.panel import Panel
812

913
import vec_inf.cli._utils as utils
@@ -24,9 +28,19 @@ def cli():
2428
@click.option(
2529
"--max-model-len",
2630
type=int,
27-
help="Model context length. If unspecified, will be automatically derived from the model config.",
31+
help="Model context length. Default value set based on suggested resource allocation.",
32+
)
33+
@click.option(
34+
"--max-num-seqs",
35+
type=int,
36+
help="Maximum number of sequences to process in a single request",
37+
)
38+
@click.option(
39+
"--partition",
40+
type=str,
41+
default="a40",
42+
help="Type of compute partition, default to a40",
2843
)
29-
@click.option("--partition", type=str, help="Type of compute partition, default to a40")
3044
@click.option(
3145
"--num-nodes",
3246
type=int,
@@ -40,24 +54,48 @@ def cli():
4054
@click.option(
4155
"--qos",
4256
type=str,
43-
help="Quality of service, default depends on suggested resource allocation required for the model",
57+
help="Quality of service",
4458
)
4559
@click.option(
4660
"--time",
4761
type=str,
48-
help="Time limit for job, this should comply with QoS, default to max walltime of the chosen QoS",
62+
help="Time limit for job, this should comply with QoS limits",
4963
)
5064
@click.option(
5165
"--vocab-size",
5266
type=int,
5367
help="Vocabulary size, this option is intended for custom models",
5468
)
55-
@click.option("--data-type", type=str, help="Model data type, default to auto")
56-
@click.option("--venv", type=str, help="Path to virtual environment")
69+
@click.option(
70+
"--data-type", type=str, default="auto", help="Model data type, default to auto"
71+
)
72+
@click.option(
73+
"--venv",
74+
type=str,
75+
default="singularity",
76+
help="Path to virtual environment, default to preconfigured singularity container",
77+
)
5778
@click.option(
5879
"--log-dir",
5980
type=str,
60-
help="Path to slurm log directory, default to .vec-inf-logs in home directory",
81+
default="default",
82+
help="Path to slurm log directory, default to .vec-inf-logs in user home directory",
83+
)
84+
@click.option(
85+
"--model-weights-parent-dir",
86+
type=str,
87+
default="/model-weights",
88+
help="Path to parent directory containing model weights, default to '/model-weights' for supported models",
89+
)
90+
@click.option(
91+
"--pipeline-parallelism",
92+
type=str,
93+
help="Enable pipeline parallelism, accepts 'True' or 'False', default to 'True' for supported models",
94+
)
95+
@click.option(
96+
"--enforce-eager",
97+
type=str,
98+
help="Always use eager-mode PyTorch, accepts 'True' or 'False', default to 'False' for custom models if not set",
6199
)
62100
@click.option(
63101
"--json-mode",
@@ -69,6 +107,7 @@ def launch(
69107
model_family: Optional[str] = None,
70108
model_variant: Optional[str] = None,
71109
max_model_len: Optional[int] = None,
110+
max_num_seqs: Optional[int] = None,
72111
partition: Optional[str] = None,
73112
num_nodes: Optional[int] = None,
74113
num_gpus: Optional[int] = None,
@@ -78,30 +117,40 @@ def launch(
78117
data_type: Optional[str] = None,
79118
venv: Optional[str] = None,
80119
log_dir: Optional[str] = None,
120+
model_weights_parent_dir: Optional[str] = None,
121+
pipeline_parallelism: Optional[str] = None,
122+
enforce_eager: Optional[str] = None,
81123
json_mode: bool = False,
82124
) -> None:
83125
"""
84126
Launch a model on the cluster
85127
"""
128+
129+
if isinstance(pipeline_parallelism, str):
130+
pipeline_parallelism = (
131+
"True" if pipeline_parallelism.lower() == "true" else "False"
132+
)
133+
86134
launch_script_path = os.path.join(
87135
os.path.dirname(os.path.dirname(os.path.realpath(__file__))), "launch_server.sh"
88136
)
89137
launch_cmd = f"bash {launch_script_path}"
90138

91139
models_df = utils.load_models_df()
92140

93-
if model_name in models_df["model_name"].values:
141+
if model_name in models_df["model_name"].to_list():
94142
default_args = utils.load_default_args(models_df, model_name)
95143
for arg in default_args:
96144
if arg in locals() and locals()[arg] is not None:
97145
default_args[arg] = locals()[arg]
98146
renamed_arg = arg.replace("_", "-")
99147
launch_cmd += f" --{renamed_arg} {default_args[arg]}"
100148
else:
101-
model_args = models_df.columns.tolist()
102-
excluded_keys = ["model_name", "pipeline_parallelism"]
149+
model_args = models_df.columns
150+
model_args.remove("model_name")
151+
model_args.remove("model_type")
103152
for arg in model_args:
104-
if arg not in excluded_keys and locals()[arg] is not None:
153+
if locals()[arg] is not None:
105154
renamed_arg = arg.replace("_", "-")
106155
launch_cmd += f" --{renamed_arg} {locals()[arg]}"
107156

@@ -225,40 +274,111 @@ def shutdown(slurm_job_id: int) -> None:
225274
is_flag=True,
226275
help="Output in JSON string",
227276
)
228-
def list(model_name: Optional[str] = None, json_mode: bool = False) -> None:
277+
def list_models(model_name: Optional[str] = None, json_mode: bool = False) -> None:
229278
"""
230279
List all available models, or get default setup of a specific model
231280
"""
232-
models_df = utils.load_models_df()
233281

234-
if model_name:
235-
if model_name not in models_df["model_name"].values:
282+
def list_model(model_name: str, models_df: pl.DataFrame, json_mode: bool):
283+
if model_name not in models_df["model_name"].to_list():
236284
raise ValueError(f"Model name {model_name} not found in available models")
237285

238-
excluded_keys = {"venv", "log_dir", "pipeline_parallelism"}
239-
model_row = models_df.loc[models_df["model_name"] == model_name]
286+
excluded_keys = {"venv", "log_dir"}
287+
model_row = models_df.filter(models_df["model_name"] == model_name)
240288

241289
if json_mode:
242-
# click.echo(model_row.to_json(orient='records'))
243-
filtered_model_row = model_row.drop(columns=excluded_keys, errors="ignore")
244-
click.echo(filtered_model_row.to_json(orient="records"))
290+
filtered_model_row = model_row.drop(excluded_keys, strict=False)
291+
click.echo(filtered_model_row.to_dicts()[0])
245292
return
246293
table = utils.create_table(key_title="Model Config", value_title="Value")
247-
for _, row in model_row.iterrows():
294+
for row in model_row.to_dicts():
248295
for key, value in row.items():
249296
if key not in excluded_keys:
250297
table.add_row(key, str(value))
251298
CONSOLE.print(table)
252-
return
253299

254-
if json_mode:
255-
click.echo(models_df["model_name"].to_json(orient="records"))
256-
return
257-
panels = []
258-
for _, row in models_df.iterrows():
259-
styled_text = f"[magenta]{row['model_family']}[/magenta]-{row['model_variant']}"
260-
panels.append(Panel(styled_text, expand=True))
261-
CONSOLE.print(Columns(panels, equal=True))
300+
def list_all(models_df: pl.DataFrame, json_mode: bool):
301+
if json_mode:
302+
click.echo(models_df["model_name"].to_list())
303+
return
304+
panels = []
305+
model_type_colors = {
306+
"LLM": "cyan",
307+
"VLM": "bright_blue",
308+
"Text Embedding": "purple",
309+
"Reward Modeling": "bright_magenta",
310+
}
311+
312+
models_df = models_df.with_columns(
313+
pl.when(pl.col("model_type") == "LLM")
314+
.then(0)
315+
.when(pl.col("model_type") == "VLM")
316+
.then(1)
317+
.when(pl.col("model_type") == "Text Embedding")
318+
.then(2)
319+
.when(pl.col("model_type") == "Reward Modeling")
320+
.then(3)
321+
.otherwise(-1)
322+
.alias("model_type_order")
323+
)
324+
325+
models_df = models_df.sort("model_type_order")
326+
models_df = models_df.drop("model_type_order")
327+
328+
for row in models_df.to_dicts():
329+
panel_color = model_type_colors.get(row["model_type"], "white")
330+
styled_text = (
331+
f"[magenta]{row['model_family']}[/magenta]-{row['model_variant']}"
332+
)
333+
panels.append(Panel(styled_text, expand=True, border_style=panel_color))
334+
CONSOLE.print(Columns(panels, equal=True))
335+
336+
models_df = utils.load_models_df()
337+
338+
if model_name:
339+
list_model(model_name, models_df, json_mode)
340+
else:
341+
list_all(models_df, json_mode)
342+
343+
344+
@cli.command("metrics")
345+
@click.argument("slurm_job_id", type=int, nargs=1)
346+
@click.option(
347+
"--log-dir",
348+
type=str,
349+
help="Path to slurm log directory. This is required if --log-dir was set in model launch",
350+
)
351+
def metrics(slurm_job_id: int, log_dir: Optional[str] = None) -> None:
352+
"""
353+
Stream performance metrics to the console
354+
"""
355+
status_cmd = f"scontrol show job {slurm_job_id} --oneliner"
356+
output = utils.run_bash_command(status_cmd)
357+
slurm_job_name = output.split(" ")[1].split("=")[1]
358+
359+
with Live(refresh_per_second=1, console=CONSOLE) as live:
360+
while True:
361+
out_logs = utils.read_slurm_log(
362+
slurm_job_name, slurm_job_id, "out", log_dir
363+
)
364+
# if out_logs is a string, then it is an error message
365+
if isinstance(out_logs, str):
366+
live.update(out_logs)
367+
break
368+
out_logs = cast(list, out_logs)
369+
latest_metrics = utils.get_latest_metric(out_logs)
370+
# if latest_metrics is a string, then it is an error message
371+
if isinstance(latest_metrics, str):
372+
live.update(latest_metrics)
373+
break
374+
latest_metrics = cast(dict, latest_metrics)
375+
table = utils.create_table(key_title="Metric", value_title="Value")
376+
for key, value in latest_metrics.items():
377+
table.add_row(key, value)
378+
379+
live.update(table)
380+
381+
time.sleep(2)
262382

263383

264384
if __name__ == "__main__":

0 commit comments

Comments
 (0)