VectorInstitute
diff --git a/‎Dockerfile
Lines changed: 8 additions & 2 deletions b/‎Dockerfile
Lines changed: 8 additions & 2 deletions
diff --git a/‎README.md
Lines changed: 23 additions & 4 deletions b/‎README.md
Lines changed: 23 additions & 4 deletions
diff --git a/‎pyproject.toml
Lines changed: 4 additions & 3 deletions b/‎pyproject.toml
Lines changed: 4 additions & 3 deletions
diff --git a/‎vec_inf/README.md
Lines changed: 2 additions & 1 deletion b/‎vec_inf/README.md
Lines changed: 2 additions & 1 deletion
diff --git a/‎vec_inf/cli/_cli.py
Lines changed: 151 additions & 31 deletions b/‎vec_inf/cli/_cli.py
Lines changed: 151 additions & 31 deletions
@@ -48,19 +48,25 @@ RUN wget https://bootstrap.pypa.io/get-pip.py && \
     rm get-pip.py
 
 # Ensure pip for Python 3.10 is used
-RUN python3.10 -m pip install --upgrade pip
+RUN python3.10 -m pip install --upgrade pip setuptools wheel
 
 # Install Poetry using Python 3.10
 RUN python3.10 -m pip install poetry
 
 # Don't create venv
 RUN poetry config virtualenvs.create false
 
+# Set working directory
+WORKDIR /vec-inf
+
+# Copy current directory
+COPY . /vec-inf
+
 # Update Poetry lock file if necessary
 RUN poetry lock
 
 # Install vec-inf
-RUN python3.10 -m pip install vec-inf[dev]
+RUN poetry install --extras "dev"
 
 # Install Flash Attention 2 backend
 RUN python3.10 -m pip install flash-attn --no-build-isolation
 
@@ -9,16 +9,23 @@ pip install vec-inf
 Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package
 
 ## Launch an inference server
+### `launch` command
 We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
 ```bash
 vec-inf launch Meta-Llama-3.1-8B-Instruct
 ```
 You should see an output like the following:
 
-<img width="400" alt="launch_img" src="https://github.com/user-attachments/assets/557eb421-47db-4810-bccd-c49c526b1b43">
+<img width="700" alt="launch_img" src="https://github.com/user-attachments/assets/ab658552-18b2-47e0-bf70-e539c3b898d5">
 
-The model would be launched using the [default parameters](vec_inf/models/models.csv), you can override these values by providing additional options, use `--help` to see the full list. You can also launch your own customized model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), you'll need to specify all model launching related options to run a successful run.
+The model would be launched using the [default parameters](vec_inf/models/models.csv), you can override these values by providing additional parameters, use `--help` to see the full list. You can also launch your own customized model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
+* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`.
+* Your model weights directory should contain HF format weights.
+* The following launch parameters will conform to default value if not specified: `--max-num-seqs`, `--partition`, `--data-type`, `--venv`, `--log-dir`, `--model-weights-parent-dir`, `--pipeline-parallelism`, `--enforce-eager`. All other launch parameters need to be specified for custom models.
+* Example for setting the model weights parent directory: `--model-weights-parent-dir /h/user_name/my_weights`.
+* For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
 
+### `status` command
 You can check the inference server status by providing the Slurm job ID to the `status` command:
 ```bash
 vec-inf status 13014393
@@ -38,24 +45,36 @@ There are 5 possible states:
 
 Note that the base URL is only available when model is in `READY` state, and if you've changed the Slurm log directory path, you also need to specify it when using the `status` command.
 
+### `metrics` command
+Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command:
+```bash
+vec-inf metrics 13014393
+```
+
+And you will see the performance metrics streamed to your console, note that the metrics are updated with a 10-second interval.
+
+<img width="400" alt="metrics_img" src="https://github.com/user-attachments/assets/e5ff2cd5-659b-4c88-8ebc-d8f3fdc023a4">
+
+### `shutdown` command
 Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
 ```bash
 vec-inf shutdown 13014393
 
 > Shutting down model with Slurm Job ID: 13014393
 ```
 
+### `list` command
 You call view the full list of available models by running the `list` command:
 ```bash
 vec-inf list
 ```
-<img width="1200" alt="list_img" src="https://github.com/user-attachments/assets/a4f0d896-989d-43bf-82a2-6a6e5d0d288f">
+<img width="900" alt="list_img" src="https://github.com/user-attachments/assets/7cb2b2ac-d30c-48a8-b773-f648c27d9de2">
 
 You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`:
 ```bash
 vec-inf list Meta-Llama-3.1-70B-Instruct
 ```
-<img width="400" alt="list_model_img" src="https://github.com/user-attachments/assets/5dec7a33-ba6b-490d-af47-4cf7341d0b42">
+<img width="400" alt="list_model_img" src="https://github.com/user-attachments/assets/30e42ab7-dde2-4d20-85f0-187adffefc3d">
 
 `launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.
 
 
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "vec-inf"
-version = "0.3.3"
+version = "0.4.0"
 description = "Efficient LLM inference on Slurm clusters using vLLM."
 authors = ["Marshall Wang <marshall.wang@vectorinstitute.ai>"]
 license = "MIT license"
@@ -11,8 +11,9 @@ python = "^3.10"
 requests = "^2.31.0"
 click = "^8.1.0"
 rich = "^13.7.0"
-pandas = "^2.2.2"
-vllm = { version = "^0.5.0", optional = true }
+pandas = "^1.15.0"
+numpy = "^1.24.0"
+vllm = { version = "^0.6.0", optional = true }
 vllm-nccl-cu12 = { version = ">=2.18,<2.19", optional = true }
 ray = { version = "^2.9.3", optional = true }
 cupy-cuda12x = { version = "12.1.0", optional = true }
 
@@ -1,7 +1,8 @@
 # `vec-inf` Commands
 
 * `launch`: Specify a model family and other optional parameters to launch an OpenAI compatible inference server, `--json-mode` supported. Check [`here`](./models/README.md) for complete list of available options.
-* `list`: List all available model names, `--json-mode` supported.
+* `list`: List all available model names, or append a supported model name to view the default configuration, `--json-mode` supported.
+* `metrics`: Streams performance metrics to the console.
 * `status`: Check the model status by providing its Slurm job ID, `--json-mode` supported.
 * `shutdown`: Shutdown a model by providing its Slurm job ID.
 
 
@@ -1,9 +1,13 @@
 import os
-from typing import Optional
+import time
+from typing import Optional, cast
 
 import click
+
+import polars as pl
 from rich.columns import Columns
 from rich.console import Console
+from rich.live import Live
 from rich.panel import Panel
 
 import vec_inf.cli._utils as utils
@@ -24,9 +28,19 @@ def cli():
 @click.option(
     "--max-model-len",
     type=int,
-    help="Model context length. If unspecified, will be automatically derived from the model config.",
+    help="Model context length. Default value set based on suggested resource allocation.",
+)
+@click.option(
+    "--max-num-seqs",
+    type=int,
+    help="Maximum number of sequences to process in a single request",
+)
+@click.option(
+    "--partition",
+    type=str,
+    default="a40",
+    help="Type of compute partition, default to a40",
 )
-@click.option("--partition", type=str, help="Type of compute partition, default to a40")
 @click.option(
     "--num-nodes",
     type=int,
@@ -40,24 +54,48 @@ def cli():
 @click.option(
     "--qos",
     type=str,
-    help="Quality of service, default depends on suggested resource allocation required for the model",
+    help="Quality of service",
 )
 @click.option(
     "--time",
     type=str,
-    help="Time limit for job, this should comply with QoS, default to max walltime of the chosen QoS",
+    help="Time limit for job, this should comply with QoS limits",
 )
 @click.option(
     "--vocab-size",
     type=int,
     help="Vocabulary size, this option is intended for custom models",
 )
-@click.option("--data-type", type=str, help="Model data type, default to auto")
-@click.option("--venv", type=str, help="Path to virtual environment")
+@click.option(
+    "--data-type", type=str, default="auto", help="Model data type, default to auto"
+)
+@click.option(
+    "--venv",
+    type=str,
+    default="singularity",
+    help="Path to virtual environment, default to preconfigured singularity container",
+)
 @click.option(
     "--log-dir",
     type=str,
-    help="Path to slurm log directory, default to .vec-inf-logs in home directory",
+    default="default",
+    help="Path to slurm log directory, default to .vec-inf-logs in user home directory",
+)
+@click.option(
+    "--model-weights-parent-dir",
+    type=str,
+    default="/model-weights",
+    help="Path to parent directory containing model weights, default to '/model-weights' for supported models",
+)
+@click.option(
+    "--pipeline-parallelism",
+    type=str,
+    help="Enable pipeline parallelism, accepts 'True' or 'False', default to 'True' for supported models",
+)
+@click.option(
+    "--enforce-eager",
+    type=str,
+    help="Always use eager-mode PyTorch, accepts 'True' or 'False', default to 'False' for custom models if not set",
 )
 @click.option(
     "--json-mode",
@@ -69,6 +107,7 @@ def launch(
     model_family: Optional[str] = None,
     model_variant: Optional[str] = None,
     max_model_len: Optional[int] = None,
+    max_num_seqs: Optional[int] = None,
     partition: Optional[str] = None,
     num_nodes: Optional[int] = None,
     num_gpus: Optional[int] = None,
@@ -78,30 +117,40 @@ def launch(
     data_type: Optional[str] = None,
     venv: Optional[str] = None,
     log_dir: Optional[str] = None,
+    model_weights_parent_dir: Optional[str] = None,
+    pipeline_parallelism: Optional[str] = None,
+    enforce_eager: Optional[str] = None,
     json_mode: bool = False,
 ) -> None:
     """
     Launch a model on the cluster
     """
+
+    if isinstance(pipeline_parallelism, str):
+        pipeline_parallelism = (
+            "True" if pipeline_parallelism.lower() == "true" else "False"
+        )
+
     launch_script_path = os.path.join(
         os.path.dirname(os.path.dirname(os.path.realpath(__file__))), "launch_server.sh"
     )
     launch_cmd = f"bash {launch_script_path}"
 
     models_df = utils.load_models_df()
 
-    if model_name in models_df["model_name"].values:
+    if model_name in models_df["model_name"].to_list():
         default_args = utils.load_default_args(models_df, model_name)
         for arg in default_args:
             if arg in locals() and locals()[arg] is not None:
                 default_args[arg] = locals()[arg]
             renamed_arg = arg.replace("_", "-")
             launch_cmd += f" --{renamed_arg} {default_args[arg]}"
     else:
-        model_args = models_df.columns.tolist()
-        excluded_keys = ["model_name", "pipeline_parallelism"]
+        model_args = models_df.columns
+        model_args.remove("model_name")
+        model_args.remove("model_type")
         for arg in model_args:
-            if arg not in excluded_keys and locals()[arg] is not None:
+            if locals()[arg] is not None:
                 renamed_arg = arg.replace("_", "-")
                 launch_cmd += f" --{renamed_arg} {locals()[arg]}"
 
@@ -225,40 +274,111 @@ def shutdown(slurm_job_id: int) -> None:
     is_flag=True,
     help="Output in JSON string",
 )
-def list(model_name: Optional[str] = None, json_mode: bool = False) -> None:
+def list_models(model_name: Optional[str] = None, json_mode: bool = False) -> None:
     """
     List all available models, or get default setup of a specific model
     """
-    models_df = utils.load_models_df()
 
-    if model_name:
-        if model_name not in models_df["model_name"].values:
+    def list_model(model_name: str, models_df: pl.DataFrame, json_mode: bool):
+        if model_name not in models_df["model_name"].to_list():
             raise ValueError(f"Model name {model_name} not found in available models")
 
-        excluded_keys = {"venv", "log_dir", "pipeline_parallelism"}
-        model_row = models_df.loc[models_df["model_name"] == model_name]
+        excluded_keys = {"venv", "log_dir"}
+        model_row = models_df.filter(models_df["model_name"] == model_name)
 
         if json_mode:
-            # click.echo(model_row.to_json(orient='records'))
-            filtered_model_row = model_row.drop(columns=excluded_keys, errors="ignore")
-            click.echo(filtered_model_row.to_json(orient="records"))
+            filtered_model_row = model_row.drop(excluded_keys, strict=False)
+            click.echo(filtered_model_row.to_dicts()[0])
             return
         table = utils.create_table(key_title="Model Config", value_title="Value")
-        for _, row in model_row.iterrows():
+        for row in model_row.to_dicts():
             for key, value in row.items():
                 if key not in excluded_keys:
                     table.add_row(key, str(value))
         CONSOLE.print(table)
-        return
 
-    if json_mode:
-        click.echo(models_df["model_name"].to_json(orient="records"))
-        return
-    panels = []
-    for _, row in models_df.iterrows():
-        styled_text = f"[magenta]{row['model_family']}[/magenta]-{row['model_variant']}"
-        panels.append(Panel(styled_text, expand=True))
-    CONSOLE.print(Columns(panels, equal=True))
+    def list_all(models_df: pl.DataFrame, json_mode: bool):
+        if json_mode:
+            click.echo(models_df["model_name"].to_list())
+            return
+        panels = []
+        model_type_colors = {
+            "LLM": "cyan",
+            "VLM": "bright_blue",
+            "Text Embedding": "purple",
+            "Reward Modeling": "bright_magenta",
+        }
+
+        models_df = models_df.with_columns(
+            pl.when(pl.col("model_type") == "LLM")
+            .then(0)
+            .when(pl.col("model_type") == "VLM")
+            .then(1)
+            .when(pl.col("model_type") == "Text Embedding")
+            .then(2)
+            .when(pl.col("model_type") == "Reward Modeling")
+            .then(3)
+            .otherwise(-1)
+            .alias("model_type_order")
+        )
+
+        models_df = models_df.sort("model_type_order")
+        models_df = models_df.drop("model_type_order")
+
+        for row in models_df.to_dicts():
+            panel_color = model_type_colors.get(row["model_type"], "white")
+            styled_text = (
+                f"[magenta]{row['model_family']}[/magenta]-{row['model_variant']}"
+            )
+            panels.append(Panel(styled_text, expand=True, border_style=panel_color))
+        CONSOLE.print(Columns(panels, equal=True))
+
+    models_df = utils.load_models_df()
+
+    if model_name:
+        list_model(model_name, models_df, json_mode)
+    else:
+        list_all(models_df, json_mode)
+
+
+@cli.command("metrics")
+@click.argument("slurm_job_id", type=int, nargs=1)
+@click.option(
+    "--log-dir",
+    type=str,
+    help="Path to slurm log directory. This is required if --log-dir was set in model launch",
+)
+def metrics(slurm_job_id: int, log_dir: Optional[str] = None) -> None:
+    """
+    Stream performance metrics to the console
+    """
+    status_cmd = f"scontrol show job {slurm_job_id} --oneliner"
+    output = utils.run_bash_command(status_cmd)
+    slurm_job_name = output.split(" ")[1].split("=")[1]
+
+    with Live(refresh_per_second=1, console=CONSOLE) as live:
+        while True:
+            out_logs = utils.read_slurm_log(
+                slurm_job_name, slurm_job_id, "out", log_dir
+            )
+            # if out_logs is a string, then it is an error message
+            if isinstance(out_logs, str):
+                live.update(out_logs)
+                break
+            out_logs = cast(list, out_logs)
+            latest_metrics = utils.get_latest_metric(out_logs)
+            # if latest_metrics is a string, then it is an error message
+            if isinstance(latest_metrics, str):
+                live.update(latest_metrics)
+                break
+            latest_metrics = cast(dict, latest_metrics)
+            table = utils.create_table(key_title="Metric", value_title="Value")
+            for key, value in latest_metrics.items():
+                table.add_row(key, value)
+
+            live.update(table)
+
+            time.sleep(2)
 
 
 if __name__ == "__main__":