Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/skills/learning-path-structure-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Use this skill when a Learning Path needs a structural review. Focus on whether
- Prefer direct relevance, Arm Learning Paths, required tools, foundation knowledge, and logical next steps.
- Avoid link piles that pull readers away from the task.
6. Review recap and transition sections:
- Include concise recap and forward-looking transition at major instructional boundaries.
- Include concise recap and forward-looking transition at major instructional boundaries. Do not treat a transition sentence alone as a recap. Note the absence of a transition as a finding.
- Use `what you've learned` for conceptual sections and `what you've accomplished` for task sections.
- Avoid repeating earlier content verbatim.
7. If the Learning Path demonstrates Arm-specific performance features, apply the performance integrity checks.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
---
title: Build and Run vLLM on Arm Servers
title: Build and run vLLM on Arm servers
description: Build vLLM from source on an Arm Linux server, run batch inference with a Hugging Face model, and expose the model through an OpenAI-compatible API.

minutes_to_complete: 45

who_is_this_for: This is an introductory topic for software developers and AI engineers interested in learning how to use the vLLM library on Arm servers.

learning_objectives:
- Build vLLM from source on an Arm server.
- Download a Qwen LLM from Hugging Face.
- Use a Qwen LLM from Hugging Face.
- Run local batch inference using vLLM.
- Create and interact with an OpenAI-compatible server provided by vLLM on your Arm server.

prerequisites:
- An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider, or a local Arm Linux computer with at least 8 CPUs and 16 GB RAM.
- An [Arm-based Linux instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider, or a local Arm Linux computer running Ubuntu 24.04 with at least 8 CPUs, 16 GB RAM, and 50 GB of disk storage.
- A system that includes support for BFloat16.

author: Jason Andrews

Expand Down Expand Up @@ -59,4 +61,3 @@ weight: 1 # _index.md always has weight of 1 to order corr
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---

Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: Run batch inference using vLLM
description: Use vLLM to load a Qwen model from Hugging Face and run batch inference prompts on an Arm server.
weight: 3

### FIXED, DO NOT MODIFY
Expand All @@ -8,27 +9,27 @@ layout: learningpathall

## Use a model from Hugging Face

vLLM is designed to work seamlessly with models from the Hugging Face Hub.
vLLM is designed to work with models from the Hugging Face Hub.

The first time you run vLLM, it downloads the required model. This means that you do not have to explicitly download any models.
The first time you run vLLM, it downloads the required model. You don't have to explicitly download any models.

If you want to use a model that requires you to request access or accept the terms, you need to log in to Hugging Face using a token.
To use a model that requires you to request access or accept terms and conditions, log in to Hugging Face using a token:

```bash
huggingface-cli login
```

Enter your Hugging Face token. You can generate a token from [Hugging Face Hub](https://huggingface.co/) by clicking your profile on the top right corner and selecting **Access Tokens**.

You also need to visit the Hugging Face link printed in the login output and accept the terms by clicking the **Agree and access repository** button or filling out the request-for-access form, depending on the model.
Visit the Hugging Face link printed in the login output and accept the terms and conditions. Click the **Agree and access repository** button or fill out the request-for-access form, depending on the model.

To run batched inference without the need for a login, you can use the `Qwen/Qwen2.5-0.5B-Instruct` model.
To run batched inference without logging in, use the `Qwen/Qwen2.5-0.5B-Instruct` model.

## Create a batch script

To run inference with multiple prompts, you can create a simple Python script to load a model and run the prompts.
To run inference with multiple prompts, create a Python script to load a model and run the prompts.

Use a text editor to save the Python script below in a file called `batch.py`:
Use a text editor to save the following Python script in a file called `batch.py`:

```python
import json
Expand Down Expand Up @@ -137,4 +138,10 @@ Processed prompts: 100%|██████████████████
}
```

You can try with other prompts and models such as `meta-llama/Llama-3.2-1B`. Continue to learn how to set up an OpenAI-compatible server.
## What you've accomplished and what's next

You've now created a Python batch inference script that loads the `Qwen/Qwen2.5-0.5B-Instruct` model from Hugging Face, configures `bfloat16` precision, and sends multiple prompts to vLLM.

You ran the script and confirmed that vLLM starts on the CPU backend, loads the model, processes the prompts, and returns generated text.

Next, you'll set up an OpenAI-compatible server so client applications can send requests to vLLM.
Original file line number Diff line number Diff line change
@@ -1,28 +1,31 @@
---
title: Run an OpenAI-compatible server
title: Run an OpenAI-compatible vLLM server
description: Start a local vLLM OpenAI-compatible server on Arm Linux and send a chat completion request with curl.
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

Instead of a batch run from Python, you can create an OpenAI-compatible server. This allows you to leverage the power of Large Language Models without relying on external APIs.
## Create a local vLLM server compatible with OpenAI

To run Large Language Models (LLMs) without relying on external APIs, create an OpenAI-compatible server.

Running a local LLM offers several advantages:

* Cost-effective - it avoids the costs associated with using external APIs, especially for high-usage scenarios.
* Cost-effectiveness - it avoids the costs associated with using external APIs, especially for high-usage scenarios.
* Privacy - it keeps your data and prompts within your local environment, which enhances privacy and security.
* Offline Capability - it enables operation without an internet connection, making it ideal for scenarios with limited or unreliable network access.
* Offline capability - it enables operation without an internet connection, making it ideal for scenarios with limited or unreliable network access.

OpenAI compatibility means that you can reuse existing software which was designed to communicate with OpenAI and use it to communicate with your local vLLM service.
OpenAI compatibility means that you can reuse existing software to communicate with your local vLLM service.

Run vLLM with the same `Qwen/Qwen2.5-0.5B-Instruct` model:

```bash
python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --dtype float16 --max-num-batched-tokens 32768
```

The server output displays that it is ready for requests:
The output is similar to:

```output
INFO 12-12 22:54:40 cpu_executor.py:186] # CPU blocks: 21845
Expand Down Expand Up @@ -51,7 +54,7 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

You can submit requests to the server using the `curl` command.

For example, run the command below using another terminal on the same server:
For example, run the following command using another terminal on the same server:

```bash
curl http://0.0.0.0:8000/v1/chat/completions \
Expand All @@ -72,12 +75,16 @@ curl http://0.0.0.0:8000/v1/chat/completions \
}'
```

The server processes the request, and the output prints the results:
The server processes the request, and the output is similar to:

```output
"id":"chatcmpl-6677cb4263b34d18b436b9cb8c6a5a65","object":"chat.completion","created":1734044182,"model":"Qwen/Qwen2.5-0.5B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Certainly! Here is a simple \"Hello, World!\" program in C:\n\n```c\n#include <stdio.h>\n\nint main() {\n printf(\"Hello, World!\\n\");\n return 0;\n}\n```\n\nThis program defines a function called `main` which contains the body of the program. Inside the `main` function, it calls the `printf` function to display the text \"Hello, World!\" to the console. The `return 0` statement indicates that the program was successful and the program has ended.\n\nTo compile and run this program:\n\n1. Save the code above to a file named `hello.c`.\n2. Open a terminal or command prompt.\n3. Navigate to the directory where you saved the file.\n4. Compile the program using the following command:\n ```\n gcc hello.c -o hello\n ```\n5. Run the compiled program using the following command:\n ```\n ./hello\n ```\n Or simply type `hello` in the terminal.\n\nYou should see the output:\n\n```\nHello, World!\n```","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":26,"total_tokens":241,"completion_tokens":215,"prompt_tokens_details":null},"prompt_logprobs":null}
```

There are many other experiments you can try. Most Hugging Face models have a **Use this model** button on the top-right of the model card with the instructions for vLLM. You can now use these instructions on your Arm Linux computer.
## What you've accomplished

You've now set up a local OpenAI-compatible server and tested sending requests to the server.

You can use the instructions in this Learning Path to experiment with other models on your Arm Linux computer. Most Hugging Face models include a **Use this model** button with instructions for vLLM.

You can also try out OpenAI-compatible chat clients to connect to the served model.
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: Build a vLLM from Source Code
title: Build a vLLM from source code on Arm Linux
description: Prepare an Arm Ubuntu server for vLLM by checking BFloat16 support, installing dependencies, and building the CPU backend from source.
weight: 2

### FIXED, DO NOT MODIFY
Expand All @@ -8,35 +9,33 @@ layout: learningpathall

## Before you begin

To follow the instructions for this Learning Path, you will need an Arm server running Ubuntu 24.04 LTS with at least 8 cores, 16GB of RAM, and 50GB of disk storage. You also need a system which supports BFloat16.
[Virtual Large Language Model (vLLM)](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for inference and model serving.

To check if your system includes BFloat16, use the `lscpu` command:
You can use vLLM in batch mode, or by running an OpenAI-compatible server.

In this Learning Path, you'll learn how to build vLLM from source and run inference on an Arm-based server.

Start by checking if your system includes BFloat16 using the `lscpu` command:

```console
lscpu | grep bf16
```

If the `Flags` are printed, you have a processor with BFloat16.
If you have a processor with BFloat16, the output is similar to:

```output
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
```

If the result is blank, you do not have a processor with BFloat16.
If the output is blank, you don't have a processor with BFloat16.

BFloat16 provides improved performance and smaller memory footprint with the same dynamic range. You might experience a drop in model inference accuracy with BFloat16, but the impact is acceptable for the majority of applications.
BFloat16 provides improved performance and smaller memory footprint with the same dynamic range. You might experience a drop in model inference accuracy with BFloat16, but the tradeoff is acceptable for the majority of applications.

The instructions have been tested on an AWS Graviton3 `m7g.2xlarge` instance.
The instructions in this Learning Path have been tested on an AWS Graviton3 `m7g.2xlarge` instance.

## What is vLLM?
## Install dependencies to build vLLM

[vLLM](https://github.com/vllm-project/vllm) stands for Virtual Large Language Model, and is a fast and easy-to-use library for inference and model serving.

You can use vLLM in batch mode, or by running an OpenAI-compatible server.

In this Learning Path, you will learn how to build vLLM from source and run inference on an Arm-based server, highlighting its effectiveness.

### What software do I need to install to build vLLM?
After validating that your system supports BFloat16, install vLLM dependencies.

First, ensure your system is up-to-date and install the required tools and libraries:

Expand All @@ -45,16 +44,16 @@ sudo apt-get update -y
sudo apt-get install -y curl ccache git wget vim numactl gcc g++ python3 python3-pip python3-venv python-is-python3 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 libssl-dev pkg-config
```

Next, install Rust. For more information, see the [Rust install guide](/install-guides/rust/).
Next, install Rust. For installation steps, see the [Rust install guide](/install-guides/rust/).

```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
```

Four environment variables are required. You can enter these at the command line or add them to your `$HOME/.bashrc` file and source the file.
Next, set up required environment variables. You can either enter the variables at the command line or add them to your `$HOME/.bashrc` file and source the file.

To add them at the command line, use the following:
To add the environment variables at the command line, run:

```bash
export CCACHE_DIR=/home/ubuntu/.cache/ccache
Expand All @@ -72,14 +71,14 @@ source env/bin/activate

Your command-line prompt is prefixed by `(env)`, which indicates that you are in the Python virtual environment.

Now update Pip and install Python packages:
Now update `pip` and install Python packages:

```bash
pip install --upgrade pip
pip install py-cpuinfo
```

### How do I download vLLM and build it?
## Download and build vLLM

First, clone the vLLM repository from GitHub:

Expand All @@ -90,9 +89,9 @@ git checkout releases/v0.11.0
```

{{% notice Note %}}
The Git checkout specifies a specific hash known to work for this example.
The Git checkout specifies a version known to work for this example.

Omit this command to use the latest code on the main branch.
Omit this checkout command to use the latest code on the main branch.
{{% /notice %}}

Install the Python packages for vLLM:
Expand All @@ -102,7 +101,7 @@ pip install -r requirements/build.txt
pip install -v -r requirements/cpu.txt
```

Build vLLM using Pip:
Build vLLM using `pip`:

```bash
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel
Expand All @@ -116,4 +115,8 @@ rm -rf dist
cd ..
```

You are now ready to download a large language model (LLM) and run vLLM.
## What you've accomplished and what's next

You've now verified BFloat16 support, installed the build dependencies, configured the required environment variables, and built vLLM from source for the CPU backend.

Next, you'll use vLLM with a Hugging Face model to run batch inference on your Arm server.
Loading