Testing workflow for llama3.1-8b (MLC2 self hosted runner) by anandhu-eng · Pull Request #286 · mlcommons/endpoints

anandhu-eng · 2026-04-20T16:44:57Z

What does this PR do?

Type of change

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

TBD

Set up GitHub runner in endpoints repo
Allow necessary permission for triggering workflows to maintainers

Extra Details

This workflow runs an automated performance benchmark comparing the PR branch against main, using a self-hosted GitHub GPU runner on MLC2.

Important: Only one pipeline run executes at a time. The endpts-gpu-benchmark-testing-pipeline concurrency group (with cancel-in-progress: false) ensures no two runs overlap — a new run queues and waits rather than cancelling an in-progress one.

Jobs

1. setup_vllm_server

Requires manual approval via the sef-hosted-runner-benchmark-approval environment before any GPU work begins.
Polls nvidia-smi every 60 s (up to 2 hours) until at least one GPU is free, then selects it.
Starts the vllm/vllm-openai:latest container (vllm_server_llama3_endpts) on port 9000, serving meta-llama/Llama-3.1-8B-Instruct.
Polls the /health endpoint until the server is ready (up to 20 min), then passes the base and head commit SHAs to downstream jobs.

2. run_benchmarks (matrix: concurrency [1, 4, 16] · max-parallel: 1)
Each matrix leg runs sequentially (one at a time) and covers both branches in a single job:

Checks out main, installs dependencies, runs a 50-sample warmup (discarded), then a full 2000-sample benchmark → uploads results as a GitHub artifact named llama-3.1-8b_vllm_perf_concurrency${{ matrix.target_concurrencies }}-${{ needs.setup_vllm_server.outputs.base_sha }}-${{ needs.setup_vllm_server.outputs.head_sha }}-Main.
Checks out the PR head, repeats the same warmup + benchmark → uploads results as llama-3.1-8b_vllm_perf_concurrency${{ matrix.target_concurrencies }}-${{ needs.setup_vllm_server.outputs.base_sha }}-${{ needs.setup_vllm_server.outputs.head_sha }}--PR.

This produces 6 artifacts in total (3 concurrency levels × 2 branches), each retained for 30 days.

3. post_pr_comment

Downloads all 6 artifacts.
A Python script loads each result_summary.json pair (main vs PR) per concurrency level, computes percentage deltas for QPS, TTFT (median/p90/p99), and latency (median/p90/p99), and flags any metric that regresses by more than 2%.
Posts the resulting Markdown table as a comment on the PR via the GitHub REST API.

4. teardown_server (runs if: always())
Stops and removes the vLLM container, deletes the Python venv, and cleans the workspace — regardless of whether earlier jobs succeeded or failed.

Trigger

on:
  pull_request:
    types: [opened, synchronize, reopened]
    paths:
      - 'src/inference_endpoints/**'
      - 'examples/**'
      - 'pyproject.toml'

gemini-code-assist · 2026-04-20T16:45:06Z

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

github-actions · 2026-04-20T16:45:10Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Testing workflow for llama3.1-8b (MLC2 self hosted runner)

d3f0b40

anandhu-eng requested a review from a team April 20, 2026 16:44

anandhu-eng marked this pull request as draft April 20, 2026 16:45

github-actions Bot requested review from arekay-nv and nvzhihanj April 20, 2026 16:45

Security fix: Run the workflow from the base branch's (main) code

3d7df30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing workflow for llama3.1-8b (MLC2 self hosted runner)#286

Testing workflow for llama3.1-8b (MLC2 self hosted runner)#286
anandhu-eng wants to merge 2 commits intomainfrom
anandhu-eng-patch-3

anandhu-eng commented Apr 20, 2026

Uh oh!

gemini-code-assist Bot commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anandhu-eng commented Apr 20, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

TBD

Extra Details

Jobs

Trigger

Uh oh!

gemini-code-assist Bot commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Apr 20, 2026 •

edited

Loading