diff --git a/.github/skills/learning-path-structure-review/SKILL.md b/.github/skills/learning-path-structure-review/SKILL.md index a152c79db0..08b3aa94e4 100644 --- a/.github/skills/learning-path-structure-review/SKILL.md +++ b/.github/skills/learning-path-structure-review/SKILL.md @@ -29,7 +29,7 @@ Use this skill when a Learning Path needs a structural review. Focus on whether - Prefer direct relevance, Arm Learning Paths, required tools, foundation knowledge, and logical next steps. - Avoid link piles that pull readers away from the task. 6. Review recap and transition sections: - - Include concise recap and forward-looking transition at major instructional boundaries. + - Include concise recap and forward-looking transition at major instructional boundaries. Do not treat a transition sentence alone as a recap. Note the absence of a transition as a finding. - Use `what you've learned` for conceptual sections and `what you've accomplished` for task sections. - Avoid repeating earlier content verbatim. 7. If the Learning Path demonstrates Arm-specific performance features, apply the performance integrity checks. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm/_index.md b/content/learning-paths/servers-and-cloud-computing/vllm/_index.md index 069912aca1..030c0ed599 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm/_index.md @@ -1,5 +1,6 @@ --- -title: Build and Run vLLM on Arm Servers +title: Build and run vLLM on Arm servers +description: Build vLLM from source on an Arm Linux server, run batch inference with a Hugging Face model, and expose the model through an OpenAI-compatible API. minutes_to_complete: 45 @@ -7,12 +8,13 @@ who_is_this_for: This is an introductory topic for software developers and AI en learning_objectives: - Build vLLM from source on an Arm server. - - Download a Qwen LLM from Hugging Face. + - Use a Qwen LLM from Hugging Face. - Run local batch inference using vLLM. - Create and interact with an OpenAI-compatible server provided by vLLM on your Arm server. prerequisites: - - An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider, or a local Arm Linux computer with at least 8 CPUs and 16 GB RAM. + - An [Arm-based Linux instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider, or a local Arm Linux computer running Ubuntu 24.04 with at least 8 CPUs, 16 GB RAM, and 50 GB of disk storage. + - A system that includes support for BFloat16. author: Jason Andrews @@ -59,4 +61,3 @@ weight: 1 # _index.md always has weight of 1 to order corr layout: "learningpathall" # All files under learning paths have this same wrapper learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. --- - diff --git a/content/learning-paths/servers-and-cloud-computing/vllm/vllm-run.md b/content/learning-paths/servers-and-cloud-computing/vllm/vllm-run.md index eb0f2a17f6..23cffc9ffc 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm/vllm-run.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm/vllm-run.md @@ -1,5 +1,6 @@ --- title: Run batch inference using vLLM +description: Use vLLM to load a Qwen model from Hugging Face and run batch inference prompts on an Arm server. weight: 3 ### FIXED, DO NOT MODIFY @@ -8,11 +9,11 @@ layout: learningpathall ## Use a model from Hugging Face -vLLM is designed to work seamlessly with models from the Hugging Face Hub. +vLLM is designed to work with models from the Hugging Face Hub. -The first time you run vLLM, it downloads the required model. This means that you do not have to explicitly download any models. +The first time you run vLLM, it downloads the required model. You don't have to explicitly download any models. -If you want to use a model that requires you to request access or accept the terms, you need to log in to Hugging Face using a token. +To use a model that requires you to request access or accept terms and conditions, log in to Hugging Face using a token: ```bash huggingface-cli login @@ -20,15 +21,15 @@ huggingface-cli login Enter your Hugging Face token. You can generate a token from [Hugging Face Hub](https://huggingface.co/) by clicking your profile on the top right corner and selecting **Access Tokens**. -You also need to visit the Hugging Face link printed in the login output and accept the terms by clicking the **Agree and access repository** button or filling out the request-for-access form, depending on the model. +Visit the Hugging Face link printed in the login output and accept the terms and conditions. Click the **Agree and access repository** button or fill out the request-for-access form, depending on the model. -To run batched inference without the need for a login, you can use the `Qwen/Qwen2.5-0.5B-Instruct` model. +To run batched inference without logging in, use the `Qwen/Qwen2.5-0.5B-Instruct` model. ## Create a batch script -To run inference with multiple prompts, you can create a simple Python script to load a model and run the prompts. +To run inference with multiple prompts, create a Python script to load a model and run the prompts. -Use a text editor to save the Python script below in a file called `batch.py`: +Use a text editor to save the following Python script in a file called `batch.py`: ```python import json @@ -137,4 +138,10 @@ Processed prompts: 100%|██████████████████ } ``` -You can try with other prompts and models such as `meta-llama/Llama-3.2-1B`. Continue to learn how to set up an OpenAI-compatible server. +## What you've accomplished and what's next + +You've now created a Python batch inference script that loads the `Qwen/Qwen2.5-0.5B-Instruct` model from Hugging Face, configures `bfloat16` precision, and sends multiple prompts to vLLM. + +You ran the script and confirmed that vLLM starts on the CPU backend, loads the model, processes the prompts, and returns generated text. + +Next, you'll set up an OpenAI-compatible server so client applications can send requests to vLLM. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm/vllm-server.md b/content/learning-paths/servers-and-cloud-computing/vllm/vllm-server.md index 68b848503e..d9b73b8cd9 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm/vllm-server.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm/vllm-server.md @@ -1,20 +1,23 @@ --- -title: Run an OpenAI-compatible server +title: Run an OpenAI-compatible vLLM server +description: Start a local vLLM OpenAI-compatible server on Arm Linux and send a chat completion request with curl. weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -Instead of a batch run from Python, you can create an OpenAI-compatible server. This allows you to leverage the power of Large Language Models without relying on external APIs. +## Create a local vLLM server compatible with OpenAI + +To run Large Language Models (LLMs) without relying on external APIs, create an OpenAI-compatible server. Running a local LLM offers several advantages: -* Cost-effective - it avoids the costs associated with using external APIs, especially for high-usage scenarios. +* Cost-effectiveness - it avoids the costs associated with using external APIs, especially for high-usage scenarios. * Privacy - it keeps your data and prompts within your local environment, which enhances privacy and security. -* Offline Capability - it enables operation without an internet connection, making it ideal for scenarios with limited or unreliable network access. +* Offline capability - it enables operation without an internet connection, making it ideal for scenarios with limited or unreliable network access. -OpenAI compatibility means that you can reuse existing software which was designed to communicate with OpenAI and use it to communicate with your local vLLM service. +OpenAI compatibility means that you can reuse existing software to communicate with your local vLLM service. Run vLLM with the same `Qwen/Qwen2.5-0.5B-Instruct` model: @@ -22,7 +25,7 @@ Run vLLM with the same `Qwen/Qwen2.5-0.5B-Instruct` model: python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --dtype float16 --max-num-batched-tokens 32768 ``` -The server output displays that it is ready for requests: +The output is similar to: ```output INFO 12-12 22:54:40 cpu_executor.py:186] # CPU blocks: 21845 @@ -51,7 +54,7 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) You can submit requests to the server using the `curl` command. -For example, run the command below using another terminal on the same server: +For example, run the following command using another terminal on the same server: ```bash curl http://0.0.0.0:8000/v1/chat/completions \ @@ -72,12 +75,16 @@ curl http://0.0.0.0:8000/v1/chat/completions \ }' ``` -The server processes the request, and the output prints the results: +The server processes the request, and the output is similar to: ```output "id":"chatcmpl-6677cb4263b34d18b436b9cb8c6a5a65","object":"chat.completion","created":1734044182,"model":"Qwen/Qwen2.5-0.5B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Certainly! Here is a simple \"Hello, World!\" program in C:\n\n```c\n#include \n\nint main() {\n printf(\"Hello, World!\\n\");\n return 0;\n}\n```\n\nThis program defines a function called `main` which contains the body of the program. Inside the `main` function, it calls the `printf` function to display the text \"Hello, World!\" to the console. The `return 0` statement indicates that the program was successful and the program has ended.\n\nTo compile and run this program:\n\n1. Save the code above to a file named `hello.c`.\n2. Open a terminal or command prompt.\n3. Navigate to the directory where you saved the file.\n4. Compile the program using the following command:\n ```\n gcc hello.c -o hello\n ```\n5. Run the compiled program using the following command:\n ```\n ./hello\n ```\n Or simply type `hello` in the terminal.\n\nYou should see the output:\n\n```\nHello, World!\n```","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":26,"total_tokens":241,"completion_tokens":215,"prompt_tokens_details":null},"prompt_logprobs":null} ``` -There are many other experiments you can try. Most Hugging Face models have a **Use this model** button on the top-right of the model card with the instructions for vLLM. You can now use these instructions on your Arm Linux computer. +## What you've accomplished + +You've now set up a local OpenAI-compatible server and tested sending requests to the server. + +You can use the instructions in this Learning Path to experiment with other models on your Arm Linux computer. Most Hugging Face models include a **Use this model** button with instructions for vLLM. You can also try out OpenAI-compatible chat clients to connect to the served model. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm/vllm-setup.md b/content/learning-paths/servers-and-cloud-computing/vllm/vllm-setup.md index 4f5a8d9b85..309bbc93a5 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm/vllm-setup.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm/vllm-setup.md @@ -1,5 +1,6 @@ --- -title: Build a vLLM from Source Code +title: Build a vLLM from source code on Arm Linux +description: Prepare an Arm Ubuntu server for vLLM by checking BFloat16 support, installing dependencies, and building the CPU backend from source. weight: 2 ### FIXED, DO NOT MODIFY @@ -8,35 +9,33 @@ layout: learningpathall ## Before you begin -To follow the instructions for this Learning Path, you will need an Arm server running Ubuntu 24.04 LTS with at least 8 cores, 16GB of RAM, and 50GB of disk storage. You also need a system which supports BFloat16. +[Virtual Large Language Model (vLLM)](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for inference and model serving. -To check if your system includes BFloat16, use the `lscpu` command: +You can use vLLM in batch mode, or by running an OpenAI-compatible server. + +In this Learning Path, you'll learn how to build vLLM from source and run inference on an Arm-based server. + +Start by checking if your system includes BFloat16 using the `lscpu` command: ```console lscpu | grep bf16 ``` -If the `Flags` are printed, you have a processor with BFloat16. +If you have a processor with BFloat16, the output is similar to: ```output Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti ``` -If the result is blank, you do not have a processor with BFloat16. +If the output is blank, you don't have a processor with BFloat16. -BFloat16 provides improved performance and smaller memory footprint with the same dynamic range. You might experience a drop in model inference accuracy with BFloat16, but the impact is acceptable for the majority of applications. +BFloat16 provides improved performance and smaller memory footprint with the same dynamic range. You might experience a drop in model inference accuracy with BFloat16, but the tradeoff is acceptable for the majority of applications. -The instructions have been tested on an AWS Graviton3 `m7g.2xlarge` instance. +The instructions in this Learning Path have been tested on an AWS Graviton3 `m7g.2xlarge` instance. -## What is vLLM? +## Install dependencies to build vLLM -[vLLM](https://github.com/vllm-project/vllm) stands for Virtual Large Language Model, and is a fast and easy-to-use library for inference and model serving. - -You can use vLLM in batch mode, or by running an OpenAI-compatible server. - -In this Learning Path, you will learn how to build vLLM from source and run inference on an Arm-based server, highlighting its effectiveness. - -### What software do I need to install to build vLLM? +After validating that your system supports BFloat16, install vLLM dependencies. First, ensure your system is up-to-date and install the required tools and libraries: @@ -45,16 +44,16 @@ sudo apt-get update -y sudo apt-get install -y curl ccache git wget vim numactl gcc g++ python3 python3-pip python3-venv python-is-python3 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 libssl-dev pkg-config ``` -Next, install Rust. For more information, see the [Rust install guide](/install-guides/rust/). +Next, install Rust. For installation steps, see the [Rust install guide](/install-guides/rust/). ```bash curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y source "$HOME/.cargo/env" ``` -Four environment variables are required. You can enter these at the command line or add them to your `$HOME/.bashrc` file and source the file. +Next, set up required environment variables. You can either enter the variables at the command line or add them to your `$HOME/.bashrc` file and source the file. -To add them at the command line, use the following: +To add the environment variables at the command line, run: ```bash export CCACHE_DIR=/home/ubuntu/.cache/ccache @@ -72,14 +71,14 @@ source env/bin/activate Your command-line prompt is prefixed by `(env)`, which indicates that you are in the Python virtual environment. -Now update Pip and install Python packages: +Now update `pip` and install Python packages: ```bash pip install --upgrade pip pip install py-cpuinfo ``` -### How do I download vLLM and build it? +## Download and build vLLM First, clone the vLLM repository from GitHub: @@ -90,9 +89,9 @@ git checkout releases/v0.11.0 ``` {{% notice Note %}} -The Git checkout specifies a specific hash known to work for this example. +The Git checkout specifies a version known to work for this example. -Omit this command to use the latest code on the main branch. +Omit this checkout command to use the latest code on the main branch. {{% /notice %}} Install the Python packages for vLLM: @@ -102,7 +101,7 @@ pip install -r requirements/build.txt pip install -v -r requirements/cpu.txt ``` -Build vLLM using Pip: +Build vLLM using `pip`: ```bash VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel @@ -116,4 +115,8 @@ rm -rf dist cd .. ``` -You are now ready to download a large language model (LLM) and run vLLM. +## What you've accomplished and what's next + +You've now verified BFloat16 support, installed the build dependencies, configured the required environment variables, and built vLLM from source for the CPU backend. + +Next, you'll use vLLM with a Hugging Face model to run batch inference on your Arm server.