openxla · melissawm · May 10, 2025 · May 13, 2025 · May 13, 2025 · May 14, 2025
diff --git a/docs/XProf_Explore.ipynb b/docs/XProf_Explore.ipynb
@@ -1,40 +1,26 @@
 {
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "provenance": []
-    },
-    "kernelspec": {
-      "name": "python3",
-      "display_name": "Python 3"
-    },
-    "language_info": {
-      "name": "python"
-    }
-  },
   "cells": [
     {
       "cell_type": "markdown",
-      "source": [
-        "# Explore Xprof Tools"
-      ],
       "metadata": {
         "id": "54kErBXYL_t5"
-      }
+      },
+      "source": [
+        "# Explore Xprof Tools"
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "This is a intial version of the exploration colab, more instructions to be added."
-      ],
       "metadata": {
         "id": "KvFO8vdBMKOB"
-      }
+      },
+      "source": [
+        "This notebook demonstrates how to load the XProf tools with Tensorboard."
+      ]
     },
     {
       "cell_type": "code",
-      "execution_count": 7,
+      "execution_count": null,
       "metadata": {
         "id": "UouZW4zcr7sY"
       },
@@ -46,48 +32,62 @@
     },
     {
       "cell_type": "code",
-      "source": [
-        "# git clone the xprof repo so we have access to the demo data there\n",
-        "!git clone http://github.com/openxla/xprof"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "u6LAn2_VEysw"
       },
-      "execution_count": 8,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "# git clone the xprof repo so we have access to the demo data there\n",
+        "!git clone http://github.com/openxla/xprof"
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "# Load the TensorBoard notebook extension.\n",
-        "%load_ext tensorboard"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "talMdGsuEoGm"
       },
-      "execution_count": 8,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "# Load the TensorBoard notebook extension.\n",
+        "%load_ext tensorboard"
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "# Launch TensorBoard and navigate to the Profile tab to view performance profile\n",
-        "%tensorboard --logdir=xprof/demo"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "6vSJOEThEoNY"
       },
-      "execution_count": 8,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "# Launch TensorBoard and navigate to the Profile tab to view performance profile\n",
+        "%tensorboard --logdir=xprof/demo"
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "Once tensorboard loads the profile plugin, Use the tools drop down to explore the tools."
-      ],
       "metadata": {
         "id": "BKcLu97YMV6x"
-      }
+      },
+      "source": [
+        "Once tensorboard loads the profile plugin, use the _Tools_ drop down to select the tool you want to explore."
+      ]
     }
-  ]
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
 }
diff --git a/docs/_toc.yaml b/docs/_toc.yaml
@@ -15,6 +15,10 @@ toc:
 - heading: XProf developer guide
 - title: Getting started
   section:
+  - title: XProf
+    path: /xprof/index
+  - title: Overview Page
+    path: /xprof/overview_page
   - title: Graph Viewer
     path: /xprof/graph_viewer
   - title: HLO Op Profile
@@ -25,8 +29,8 @@ toc:
     path: /xprof/memory_profile
   - title: Memory Viewer
     path: /xprof/memory_viewer
-  - title: Overview Page
-    path: /xprof/overview_page
+  - title: GPU Kernel Stats
+    path: /xprof/gpu_kernel_stats
   - title: Trace Viewer
     path: /xprof/trace_viewer
   - title: Roofline Analysis

diff --git a/docs/gpu_kernel_stats.md b/docs/gpu_kernel_stats.md
@@ -1,54 +1,56 @@
 ## GPU Kernel Stats Tool
 
-You can use the GPU Kernel Stats tool to see performance statistics and the
+You can use the GPU Kernel Stats tool to visualize performance statistics and the
 originating framework operation for every GPU-accelerated kernel that was
 launched during a profiling session. This tool helps identify potential
 bottlenecks at a low level and surfaces optimization opportunities.
 
-### Supported Platforms
+![Tensorboard GPU Kernel Stats](images/gpu_kernel_stats.png)
 
-GPU: Supported
+### Supported Platforms
 
-TPU: Not supported
+The GPU Kernel Stats tool is only supported on GPUs.
 
 ### Interface Components
 
 The GPU Kernel Stats tool interface is a table with one row for each unique
 kernel-framework operation pair. You can click any of the column headings to
 sort the table accordingly. The default order is based on the total duration of
-the kernel-op pair.
+the kernel-op pair. Search boxes let you filter rows by GPU Kernel Name or by
+Op Name. You can export the table to a CSV file by clicking the "Export as CSV"
+button.
 
 The table includes the following information for each kernel-op pair:
 
-*   Kernel name: The name of the kernel that was launched.
-*   Registers per thread: The number of GPU
+*   **Kernel name**: The name of the kernel that was launched.
+*   **Registers per thread**: The number of GPU
     [registers](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-multithreading)
     used by the kernel per thread.
-*   Shared memory used: The total size of
+*   **Shared memory used**: The total size of
     [shared memory](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#shared-memory)
     used by the kernel in bytes.
-*   Block dimension: The dimensions of the thread block expressed as blockDim.x,
+*   **Block dimension**: The dimensions of the thread block expressed as blockDim.x,
     blockDim.y, blockDim.z.
-*   Grid dimensions: The dimensions of the grid of thread blocks expressed as
+*   **Grid dimensions**: The dimensions of the grid of thread blocks expressed as
     gridDim.x, gridDim.y, gridDim.z.
-*   Theoretical occupancy: The theoretical
+*   **Theoretical occupancy**: The theoretical
     [occupancy](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy)
     of the GPU expressed as a percentage. This indicates the ability of the
     shared memory to hide latencies.
-*   Kernel uses Tensor Core: A heuristic indicating whether the kernel itself
+*   **Kernel uses Tensor Core**: A heuristic indicating whether the kernel itself
     contains Tensor Core instructions, based on the presence of common Tensor
     Core instructions.
-*   Tensor Cores eligibility: A heuristic indicating whether the originating
+*   **Tensor Cores eligibility**: A heuristic indicating whether the originating
     framework operation is eligible to use Tensor Cores, based on commonly
     occurring ops that employ the Tensor Core.
-*   Op name: The name of the framework operation that launched this kernel.
-*   Occurrences: The number of times this specific kernel-operation pair was
+*   **Op name**: The name of the framework operation that launched this kernel.
+*   **Occurrences**: The number of times this specific kernel-operation pair was
     executed during the profiling period.
-*   Total duration (us): The cumulative sum of the execution time of all
+*   **Total duration (us)**: The cumulative sum of the execution time of all
     occurrences of this kernel-operation pair.
-*   Average duration (us): The average execution time across all occurrences of
+*   **Average duration (us)**: The average execution time across all occurrences of
     this kernel-operation pair.
-*   Minimum duration (us): The shortest execution time observed for this
+*   **Minimum duration (us)**: The shortest execution time observed for this
     kernel-operation pair.
-*   Maximum duration (us): The longest execution time observed for this
+*   **Maximum duration (us)**: The longest execution time observed for this
     kernel-operation pair.
diff --git a/docs/graph_viewer.md b/docs/graph_viewer.md
@@ -3,55 +3,55 @@
 You can use Graph Viewer to visualize the graph structure of your XLA program.
 It displays the High Level Operations (HLO) graph.
 
-### Supported Platforms
+![Graph Viewer](images/graph_viewer.png)
 
-TPU: supported
+### Supported Platforms
 
-GPU: supported
+Both TPU and GPU are supported.
 
 ### Interface Components
 
 Graph Viewer can be controlled with the following options:
-1. A Select XLA
-Module dropdown menu, from which you choose an HLO module from the available
-list for visualization.
 
-2. An Op Name box, where you enter the name of an
+1. An **Op Name** box, where you enter the name of an
 operation to select it as the primary node and visualize its neighbors. Pressing
 enter after typing the op name will visualize the surrounding nodes.
 
-3. A Graph
-Width selector, which controls the maximum distance of nodes from the primary
-node that is included in the visualization.
+1. A **Graph Width** selector, which
+controls the maximum distance of nodes from the primary node that is included in
+the visualization.
 
-4. A Metadata selector, which you
+1. A **Metadata** selector, which you
 use to toggle the display of operation metadata included by the compiler, if
 any. Note that you need to click “Search Graph” again after selecting this
 checkbox.
 
-5. A Merge Fusion checkbox that you can use to collapse or expand the
+1. A **Merge Fusion** checkbox that you can use to collapse or expand the
 components of fused ops. Note that you will need to click “Search Graph” again
 after toggling this checkbox.
 
-6. Zoom in/out buttons or keys, though you can
+1. Zoom in/out buttons or keys, though you can
 also use your mouse to zoom.
 
-7. Links to download the graph as SVG or dot files.
+1. Links to download the graph as SVG or dot files.
 
-8. By using the search functionality, you can quickly locate and focus on
+1. By using the search functionality, you can quickly locate and focus on
 specific operations within the potentially large graph.
 
-9. To freeze the runtime
+1. To freeze the runtime
 data panel on a particular op, right click on the op.
 
 ### Using Graph Viewer
 
 By default, no operations are selected in Graph Viewer, so you will see an empty
 screen. Search for the operation of interest to zoom into it and begin
-traversing the graph. Often, you start with one of the other tools (say, Op
-Profile to identify the most time consuming op, or Trace Viewer to identify the
-cause of a pipeline bubble). Clicking the op in those tools will give you a
-direct link into the same op within Graph Viewer.
+traversing the graph. Often, you start with one of the other tools (say,
+[Op Profile](hlo_op_profile.md) to identify the most time consuming op, or
+[Trace Viewer](trace_viewer.md) to identify the cause of a pipeline bubble).
+Clicking the op in those tools will give you a direct link into the same op
+within Graph Viewer.
+
+![Graph Viewer button appears when the reduce.111 operation is clicked on the HLO Op Profile tool](images/graph_viewer_from_op_profile.png)
 
 By hovering over a box (representing an HLO operation), you might often see the
 line of your user code corresponding to the op where the function was defined.

diff --git a/docs/hlo_op_profile.md b/docs/hlo_op_profile.md
@@ -4,11 +4,11 @@ You can use the HLO Op Profile tool to understand hardware performance for
 different categories of High Level Operation (HLO) ops executed during the
 profiling period.
 
-### Supported Platforms
+![HLO Op Profile](images/hlo_op_profile.png)
 
-TPU: Supported
+### Supported Platforms
 
-GPU: Supported
+Both TPU and GPU are supported.
 
 ### Using the HLO Op Profile tool
 
@@ -36,7 +36,7 @@ HLO Op Profile has the following components:
     displaying more details about the op or the operation category. Clicking a
     table entry pins the op detail card. These cards typically include the
     following, as appropriate:
-    *   A link to the op in the Graph Viewer tool.
+    *   A link to the op in the [Graph Viewer](graph_viewer.md) tool.
     *   Average execution time.
     *   Absolute rates of usage (in contrast with utilization reported in the
         table) of the compute (TFLOP/s), HBM bandwidth (GB/s), and on-chip read
@@ -47,6 +47,8 @@ HLO Op Profile has the following components:
     *   The number of occurrences of the op, and total time spent on the op in
         aggregate.
 
+![HLO Op Profile details for a loop fusion op](images/hlo_op_profile_details.png)
+
 Note that raw bandwidths (GB/s) or compute rates (TFLOP/s) are computed by
 combining static compiler data on FLOPs or bytes required for the op (the
 numerator) with duration information per op from the profile (the denominator).