Skip to content

Testing workflow for llama3.1-8b (MLC2 self hosted runner)#286

Draft
anandhu-eng wants to merge 2 commits intomainfrom
anandhu-eng-patch-3
Draft

Testing workflow for llama3.1-8b (MLC2 self hosted runner)#286
anandhu-eng wants to merge 2 commits intomainfrom
anandhu-eng-patch-3

Conversation

@anandhu-eng
Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup
  • New Github action workflow

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

TBD

  • Set up GitHub runner in endpoints repo
  • Allow necessary permission for triggering workflows to maintainers

Extra Details

image

This workflow runs an automated performance benchmark comparing the PR branch against main, using a self-hosted GitHub GPU runner on MLC2.

Important: Only one pipeline run executes at a time. The endpts-gpu-benchmark-testing-pipeline concurrency group (with cancel-in-progress: false) ensures no two runs overlap — a new run queues and waits rather than cancelling an in-progress one.


Jobs

1. setup_vllm_server

  • Requires manual approval via the sef-hosted-runner-benchmark-approval environment before any GPU work begins.
  • Polls nvidia-smi every 60 s (up to 2 hours) until at least one GPU is free, then selects it.
  • Starts the vllm/vllm-openai:latest container (vllm_server_llama3_endpts) on port 9000, serving meta-llama/Llama-3.1-8B-Instruct.
  • Polls the /health endpoint until the server is ready (up to 20 min), then passes the base and head commit SHAs to downstream jobs.

2. run_benchmarks (matrix: concurrency [1, 4, 16] · max-parallel: 1)
Each matrix leg runs sequentially (one at a time) and covers both branches in a single job:

  • Checks out main, installs dependencies, runs a 50-sample warmup (discarded), then a full 2000-sample benchmark → uploads results as a GitHub artifact named llama-3.1-8b_vllm_perf_concurrency${{ matrix.target_concurrencies }}-${{ needs.setup_vllm_server.outputs.base_sha }}-${{ needs.setup_vllm_server.outputs.head_sha }}-Main.
  • Checks out the PR head, repeats the same warmup + benchmark → uploads results as llama-3.1-8b_vllm_perf_concurrency${{ matrix.target_concurrencies }}-${{ needs.setup_vllm_server.outputs.base_sha }}-${{ needs.setup_vllm_server.outputs.head_sha }}--PR.

This produces 6 artifacts in total (3 concurrency levels × 2 branches), each retained for 30 days.

3. post_pr_comment

  • Downloads all 6 artifacts.
  • A Python script loads each result_summary.json pair (main vs PR) per concurrency level, computes percentage deltas for QPS, TTFT (median/p90/p99), and latency (median/p90/p99), and flags any metric that regresses by more than 2%.
  • Posts the resulting Markdown table as a comment on the PR via the GitHub REST API.

4. teardown_server (runs if: always())
Stops and removes the vLLM container, deletes the Python venv, and cleans the workspace — regardless of whether earlier jobs succeeded or failed.


Trigger

on:
  pull_request:
    types: [opened, synchronize, reopened]
    paths:
      - 'src/inference_endpoints/**'
      - 'examples/**'
      - 'pyproject.toml'

@anandhu-eng anandhu-eng requested a review from a team April 20, 2026 16:44
@anandhu-eng anandhu-eng marked this pull request as draft April 20, 2026 16:45
@gemini-code-assist
Copy link
Copy Markdown

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 20, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions Bot requested review from arekay-nv and nvzhihanj April 20, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant