ArmDeveloperEcosystem · kieranhejmadi01 · May 21, 2026 · Jun 8, 2026
diff --git a/.../learning-paths/servers-and-cloud-computing/performix-instruction-mix/_index.md b/.../learning-paths/servers-and-cloud-computing/performix-instruction-mix/_index.md
@@ -0,0 +1,68 @@
+---
+title: Profile GPT-2 instruction mix with Arm Performix
+
+description: Learn how to profile GPT-2 inference on Arm Neoverse with the Arm Performix Instruction Mix recipe, identify scalar versus vector execution patterns, and improve throughput with NEON, SVE, and KleidiAI kernels.
+
+minutes_to_complete: 45
+
+who_is_this_for: This is an introductory topic for developers who want to get started using the instruction mix recipe in Arm Performix through a practical example.
+
+learning_objectives: 
+    - Explain how the Instruction Mix recipe combines static disassembly with runtime sampling to show execution behavior
+    - Build and run the GPT-2 inference example on an Arm Linux server
+    - Identify why matrix multiplication dominates runtime and how vectorization changes the instruction mix
+    - Compare throughput and instruction mix across scalar, NEON, SVE, and KleidiAI implementations
+
+prerequisites:
+    - Access to Arm Performix configured with a remote Arm Linux target. For setup, see the [Arm Performix install guide](/install-guides/performix/)
+    - Basic understanding of C++ and compiler optimization
+    - Basic understanding of matrix multiplication
+    - Basic understanding of writing SIMD code with Neon and/or SVE.
+
+author:
+    - Kieran Hejmadi
+    - Oliver Grainge
+
+### Tags
+skilllevels: Introductory
+subjects: Performance and Architecture
+armips:
+    - Neoverse
+tools_software_languages:
+    - Arm Performix
+    - C++
+    - LLM
+    - NEON
+    - SVE
+operatingsystems:
+    - Linux
+further_reading:
+    - resource:
+        title: Arm Performix User Guide
+        link: https://developer.arm.com/documentation/110163/latest
+        type: documentation
+    - resource:
+        title: Find code hotspots with Arm Performix
+        link: /learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/
+        type: learning-path
+    - resource:
+        title: Identify code hotspots using Arm Performix through the Arm MCP Server
+        link: /learning-paths/servers-and-cloud-computing/performix-mcp-agent/
+        type: learning-path
+    - resource:
+        title: Arm MCP Server GitHub Repository
+        link: https://github.com/arm/mcp
+        type: website
+    - resource:
+        title: GPT-2 Example repository
+        link: https://github.com/arm-education/GPT-2-Example
+        type: website
+
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/...ning-paths/servers-and-cloud-computing/performix-instruction-mix/_next-steps.md b/...ning-paths/servers-and-cloud-computing/performix-instruction-mix/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # The weight controls the order of the pages. _index.md always has weight 1.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/...g-paths/servers-and-cloud-computing/performix-instruction-mix/code_hotspot.webp b/...g-paths/servers-and-cloud-computing/performix-instruction-mix/code_hotspot.webp
diff --git a/...servers-and-cloud-computing/performix-instruction-mix/code_hotspot_results.webp b/...servers-and-cloud-computing/performix-instruction-mix/code_hotspot_results.webp
diff --git a/...ervers-and-cloud-computing/performix-instruction-mix/configuring-performix.webp b/...ervers-and-cloud-computing/performix-instruction-mix/configuring-performix.webp
diff --git a/...hs/servers-and-cloud-computing/performix-instruction-mix/dynamic-functions.webp b/...hs/servers-and-cloud-computing/performix-instruction-mix/dynamic-functions.webp
diff --git a/...g-paths/servers-and-cloud-computing/performix-instruction-mix/gpt2-baseline.gif b/...g-paths/servers-and-cloud-computing/performix-instruction-mix/gpt2-baseline.gif
diff --git a/...arning-paths/servers-and-cloud-computing/performix-instruction-mix/hotspot.webp b/...arning-paths/servers-and-cloud-computing/performix-instruction-mix/hotspot.webp
diff --git a/...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-1.md b/...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-1.md
@@ -0,0 +1,44 @@
+---
+title: Background
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## What the instruction mix recipe shows
+
+The Arm Performix Instruction Mix recipe shows the types and proportions of machine instructions your workload executes at runtime and in static analysis, so you can see how efficiently your code uses Arm CPU hardware resources.
+
+The Instruction Mix recipe classifies each instruction into a group. The available groups depend on the Neoverse architecture version you are profiling. Therefore the categories you see may vary depending on the version of Arm Neoverse you are using. Typical categories include:
+
+- integer and floating-point arithmetic
+- memory loads and stores (including exclusive operations)
+- control flow instructions, such as branches and loops
+- specialized instructions, such as cryptographic operations
+- SIMD (Single Instruction, Multiple Data) instructions, including NEON (fixed 128-bit) and SVE (scalable vector length)
+
+The instruction mix result gives you two complementary views:
+
+- static analysis, which inspects compiled machine code without running it
+- dynamic analysis, which measures instruction usage during real execution
+
+Together, these views help you verify whether architecture-specific features are actually active in hot code paths.
+
+## Why instruction mix is useful
+
+Instruction mix is useful when you need to confirm that performance-critical code uses Arm CPU features effectively. This is especially helpful when you are, for example, validating the effectiveness of compiler autovectorization.
+
+For example, if a hot function is mostly scalar at runtime when you expected NEON or SVE activity, that often indicates missed vectorization opportunities. You can then focus optimization work on compiler flags, data layout, loop structure, and kernel implementation to improve throughput where it matters most.
+
+## Why use a GPT-2 workload
+
+In this Learning Path, you run the [GPT-2 Medium](https://huggingface.co/openai-community/gpt2-medium) model on a minimal C++ inference engine to analyze instruction mix and throughput. This model is available under a [modified MIT License](https://github.com/openai/gpt-2/blob/master/LICENSE). You will confirm that matrix multiplication (`matmul`) is the hot path, then compare how scalar, NEON, and SVE implementations change instruction behavior and token generation speed. 
+
+This example implements only the forward inference path, with no back propagation or training. You do not need to understand the full transformer architecture to complete this Learning Path. Familiarity with matrix multiplication is enough. For background on GPT-2, see the original 2019 paper, [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
+
+You will also try implementing your own `matmul` kernels that target NEON and SVE, then use instruction mix data to verify that these vector paths are active and improving throughput.
+
+## What you've learned and what's next
+
+In this section, you learned what instruction mix represents and why it is useful for LLM inference optimization on Arm. Next, you will set up the GPT-2 example, build the binaries, and run a baseline test.
diff --git a/...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-2.md b/...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-2.md
@@ -0,0 +1,112 @@
+---
+title: Set up and run GPT-2 baseline
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Prepare the environment
+
+Use an Arm Linux target, such as an Arm Neoverse cloud instance. The results in this Learning Path were collected on a Graviton 3 instance based on Neoverse V1 running Ubuntu 24.04 LTS. If you have not configured Arm Performix yet, complete setup and target connection using the [Arm Performix install guide](/install-guides/performix/).
+
+Install build prerequisites and clone the GPT-2 example repository:
+
+```bash
+sudo apt update
+sudo apt install -y git g++ cmake python3 python3-venv
+git clone --recurse-submodules https://github.com/arm-education/GPT-2-Example.git
+cd GPT-2-Example
+git checkout tags/v0.0.2
+```
+
+## Export GPT-2 model assets
+
+The C++ runtime expects exported model binaries. Create a Python virtual environment, install dependencies, and export GPT-2 Medium weights and vocabulary:
+
+This Learning Path uses [openai-community/gpt2-medium on Hugging Face](https://huggingface.co/openai-community/gpt2-medium), which corresponds to the GPT-2 Medium model from the original OpenAI GPT-2 release in 2019. The model has 355 million parameters, and in this workflow it runs with unquantized FP32 (32-bit floating-point) weights.
+
+```bash
+python3 -m venv venv
+source venv/bin/activate
+pip install -r src/requirements.txt
+python3 src/export_gpt2.py --model gpt2-medium
+```
+
+This creates:
+
+- `models/gpt2-medium/weights.bin`
+- `models/gpt2-medium/vocab.bin`
+
+## Review the source code
+
+The `src/gpt2.cpp` file implements the end-to-end GPT-2 inference loop. Each generated token triggers a forward pass over all 24 transformer layers. Inside each layer, `matmul` is called multiple times: for the query/key/value projection, the attention output projection, and both feed-forward layers. It is called once more at the end for logits projection over the vocabulary:
+
+```cpp
+// Attention QKV projection
+matmul(s.qkv.data(), s.xb.data(),
+       w.c_attn_w.data()+(size_t)l*3*E*E,
+       w.c_attn_b.data()+(size_t)l*3*E, E, 3*E);
+
+// FFN expand
+matmul(s.mlp_h.data(), s.xb.data(),
+       w.mlp_fc_w.data()+(size_t)l*4*E*E,
+       w.mlp_fc_b.data()+(size_t)l*4*E, E, 4*E);
+
+// Logits projection (vocab_size x n_embd)
+matmul(s.logits.data(), s.x.data(), w.wte.data(), nullptr, E, cfg.vocab_size);
+```
+
+The `matmul` dispatch in `gpt2.cpp` selects a kernel at compile time based on a preprocessor flag:
+
+```cpp
+static void matmul(float *out, const float *x, const float *W, const float *b,
+                   int n_in, int n_out) {
+#if defined(GPT2_KERNEL_NEON)
+    kernels::matmul_neon(out, x, W, b, n_in, n_out);
+#elif defined(GPT2_KERNEL_SVE)
+    kernels::matmul_sve(out, x, W, b, n_in, n_out);
+#elif defined(GPT2_KERNEL_USER)
+    kernels::matmul_user(out, x, W, b, n_in, n_out);
+#else
+    kernels::matmul_ref(out, x, W, b, n_in, n_out);
+#endif
+}
+```
+
+The baseline kernel (`src/kernels/matmul_ref.cpp`) is a straightforward scalar nested for loop: for each output row, it walks the weight matrix row and accumulates a dot product with the input vector:
+
+```cpp
+void matmul_ref(float *out, const float *x, const float *W, const float *b,
+                int n_in, int n_out) {
+    for (int i = 0; i < n_out; i++) {
+        float acc = b ? b[i] : 0.f;
+        const float *row = W + (size_t)i * n_in;
+        for (int j = 0; j < n_in; j++) acc += row[j] * x[j];
+        out[i] = acc;
+    }
+}
+```
+
+This scalar implementation can leave NEON and SVE vector units underused if the compiler cannot efficiently autovectorize it. Because `matmul` is called hundreds of times per token, explicitly optimizing this kernel guarantees SIMD execution where most of the available compute is spent.
+
+## Build and run the baseline
+
+Configure and build the project with CMake. The project uses `-O2 -g`, which keeps optimization enabled while preserving debug symbols for profiling.
+
+```bash
+cmake -S . -B build -DBUILD_USER_MATMUL=ON
+cmake --build build --parallel
+```
+
+Run the scalar baseline binary:
+
+```bash
+./build/gpt2 --model gpt2-medium "Once upon a time" -n 20
+```
+
+![Animated terminal output showing GPT-2 baseline inference running on Arm Linux, including generated text and the final tokens-per-second summary used for baseline comparison.#center](./gpt2-baseline.gif "GPT-2 baseline runtime output on Arm Linux")
+
+## What you've learned and what's next
+
+You now have a working baseline binary and model files. Next, you will use the Instruction Mix recipe in Arm Performix to inspect static disassembly and dynamic runtime behavior.
diff --git a/...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-3.md b/...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-3.md
@@ -0,0 +1,65 @@
+---
+title: Profile with instruction mix
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Find the code hotspot
+
+Before you optimize, identify where the application spends most of its time. Use the Code Hotspots recipe to periodically sample the running application and build a profile of the functions that execute most often.
+
+Open Arm Performix and select the **Code Hotspots** recipe. If this is your first run on the target, complete tool deployment as prompted.
+
+Set the launch command to your baseline binary with the number of tokens (`-n`) set to 150. This value keeps startup overhead small compared to inference time, so the profile minimizes the time taken to load the model weights:
+
+![Arm Performix Code Hotspots recipe configuration showing launch arguments for the GPT-2 baseline run with -n 150 to emphasize inference runtime.#center](./code_hotspot.webp "Code Hotspots recipe configuration for GPT-2 baseline")
+
+The results show that `kernels::matmul_ref()` is the hottest function. Double-clicking on the function with show which lines of source code the samples are mostly attributed to the accumulate step of `kernels::matmul_ref()`.
+
+![Arm Performix hotspot results table showing matmul_ref as the dominant runtime function during GPT-2 baseline inference.#center](./code_hotspot_results.webp "Hotspot results highlighting matmul_ref")
+
+This confirms that matrix multiplication is the highest-impact optimization target.
+
+## Assess compiler output
+
+We can use online tools such as [Compiler Explorer](https://godbolt.org/) to conveniently see how this function is being compiled with the `-O2 -g` flags.
+
+
+{{< godbolt width="100%" height="400px" mode="assembly" opt="-O2 -g" src="void matmul_ref(float *out, const float *x, const float *W, const float *b, int n_in, int n_out)\n{\n  for (int i = 0; i < n_out; i++) {\n    float acc = b ? b[i] : 0.f;\n    const float *row = W + (unsigned long long)i * (unsigned long long)n_in;\n    for (int j = 0; j < n_in; j++) {\n      acc += row[j] * x[j];\n    }\n    out[i] = acc;\n  }\n}" >}}
+
+This view helps you spot missed vectorization opportunities. In an optimized build, you would expect the accumulation step to use SIMD instructions, for example `fmla v0.4s, v3.4s, v2.4s` with use of the vector register (`v0->v3`). However, assembly inspection has limitations. First, you need familiarity with SIMD mnemonics to recognize vectorized code. Second, this narrow snippet does not show whether changing compiler flags introduces regressions in other parts of the codebase. Third, and most importantly, this static view does not show which instructions in this function run most often on the CPU.
+
+The Instruction Mix recipe helps fill this gap.
+
+## Configure the Instruction Mix recipe
+
+Open Arm Performix and select the **Instruction Mix** recipe. If this is your first run on the target, complete tool deployment as prompted.
+Set the launch command to your baseline binary with the same runtime arguments used for baseline testing:
+
+```output
+</path/to/GPT-2-Example>/build/gpt2 --model gpt2-medium "Once upon a time" -n 150`
+```
+
+Use the same model and prompt arguments as your baseline terminal run so the measurements are comparable.
+
+![Arm Performix recipe setup screen showing Instruction Mix recipe selected with launch settings configured for the GPT-2 baseline executable.#center](./configuring-performix.webp "Configure Arm Performix Instruction Mix recipe")
+
+### Analyze static disassembly
+
+After the run completes, review static disassembly first. This view is ordered by percentage contribution and provides a high-level profile of the application’s generated instruction stream. It can help you identify broad characteristics, such as whether the code is branch-heavy, dominated by memory operations, or making effective use of SIMD instructions. Use this static view to understand overall code generation patterns rather than to attribute performance to specific functions or source lines. Dynamic analysis is typically more relevant for optimization because it reflects the instructions that are actually executed at runtime.
+
+![Arm Performix static disassembly view showing instruction category breakdown for GPT-2 hot paths, highlighting scalar-heavy sections in baseline matmul code.#center](./static_disassembly.webp "Static disassembly instruction classification")
+
+### Dynamic analysis
+
+Then inspect dynamic analysis bar chart to see where sampled runtime work is concentrated. Dynamic data is typically more useful for optimization because it reflects actual execution behavior for your input, runtime settings, and call frequencies.
+
+![Arm Performix dynamic functions table showing most runtime samples in matmul-related functions for baseline GPT-2 inference.#center](./instruction_mix_dynamic_analysis.webp "Dynamic function sample distribution")
+
+Finally, in dynamic functions, you can break down operation types to individual functions. This is particularly useful when no single function dominates the profile, allowing you to inspect dynamic instruction patterns for specific functions.
+
+## What you've learned and what's next
+
+You used Instruction Mix to confirm that baseline runtime is dominated by scalar-heavy `matmul` execution. Next, you will compare updated instruction mix and throughput across scalar, NEON, SVE, and KleidiAI variants.
diff --git a/...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-4.md b/...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-4.md
@@ -0,0 +1,71 @@
+---
+title: Optimize
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Complete the challenge (optional)
+
+In this project, `src/kernels/matmul_user.cpp` is your editable implementation file. The baseline behavior in this file is scalar, and the build uses `-O2 -g`, so compiler optimization is enabled but vector hardware is still underused in the hot loop.
+
+Use the profiling evidence from Performix to implement your own NEON or SVE intrinsics in `src/kernels/matmul_user.cpp`, then rebuild and profile `gpt2_user`.
+
+{{% notice Hint %}}
+
+Focus on the accumulation loop in `matmul_user` (`acc += row[j] * x[j];`). Think about lane utilization, loop unrolling, and handling the tail when the input width is not an exact multiple of the vector width.
+
+{{% /notice %}}
+
+Rebuild after your edits:
+
+```bash
+cmake -S . -B build -DBUILD_USER_MATMUL=ON
+cmake --build build --parallel
+```
+
+Then profile the `build/gpt2_user` binary with the same runtime arguments and compare the Instruction Mix and throughput against baseline.
+
+Example solutions are available in:
+
+- `src/kernels/matmul_neon.cpp`
+- `src/kernels/matmul_sve.cpp`
+
+You can use `AGENTS.md` in the GPT-2 example repository for guided learning support.
+
+### Use the Arm MCP Server with Performix (optional)
+
+You can also use an MCP-compatible coding assistant, such as GitHub Copilot or Codex, with the Arm MCP Server. This gives the assistant direct tool access to run Performix recipes on your remote Arm target and create a faster feedback loop while you iterate on `matmul_user`.
+
+For setup details, see [Automate x86-to-Arm application migration using Arm MCP Server](/learning-paths/servers-and-cloud-computing/arm-mcp-server/).
+
+Install Docker if needed, then pull the MCP server image:
+
+```bash
+docker pull armlimited/arm-mcp:latest
+```
+
+To allow Performix access to remote targets from inside the container, mount your workspace plus SSH key and known hosts in your Codex MCP configuration (example `~/.codex/config.toml`):
+
+```output
+[mcp_servers.arm-mcp]
+command = "docker"
+args = [
+	"run",
+	"--rm",
+	"-i",
+	"-v", "/path/to/your/workspace:/workspace",
+	"-v", "/path/to/your/ssh/private_key:/run/keys/ssh-key.pem:ro",
+	"-v", "/path/to/your/ssh/known_hosts:/run/keys/known_hosts:ro",
+	"armlimited/arm-mcp"
+]
+```
+
+Restart your coding assistant, then prompt it to run Performix Instruction Mix and Code Hotspots on your `gpt2_user` binary and suggest Arm intrinsics improvements.
+
+![Screenshot of a coding assistant prompt configured to use Arm MCP Server tools for running Performix recipes and analyzing matmul_user optimization opportunities in the GPT-2 workload.#center](./mcp-performix-prompt.webp "Coding assistant prompt for Performix analysis through Arm MCP Server")
+
+## What you've learned and what's next
+
+In this optional section, you implemented and profiled a custom `matmul_user` kernel using the same workflow you used for baseline analysis. Next, you will compare instruction mix and throughput across scalar, NEON, SVE, and KleidiAI variants.