Skip to content

Documentation review #1451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 46 additions & 46 deletions docs/XProf_Explore.ipynb
Original file line number Diff line number Diff line change
@@ -1,40 +1,26 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Explore Xprof Tools"
],
"metadata": {
"id": "54kErBXYL_t5"
}
},
"source": [
"# Explore Xprof Tools"
]
},
{
"cell_type": "markdown",
"source": [
"This is a intial version of the exploration colab, more instructions to be added."
],
"metadata": {
"id": "KvFO8vdBMKOB"
}
},
"source": [
"This notebook demonstrates how to load the XProf tools with Tensorboard."
]
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {
"id": "UouZW4zcr7sY"
},
Expand All @@ -46,48 +32,62 @@
},
{
"cell_type": "code",
"source": [
"# git clone the xprof repo so we have access to the demo data there\n",
"!git clone http://github.com/openxla/xprof"
],
"execution_count": null,
"metadata": {
"id": "u6LAn2_VEysw"
},
"execution_count": 8,
"outputs": []
"outputs": [],
"source": [
"# git clone the xprof repo so we have access to the demo data there\n",
"!git clone http://github.com/openxla/xprof"
]
},
{
"cell_type": "code",
"source": [
"# Load the TensorBoard notebook extension.\n",
"%load_ext tensorboard"
],
"execution_count": null,
"metadata": {
"id": "talMdGsuEoGm"
},
"execution_count": 8,
"outputs": []
"outputs": [],
"source": [
"# Load the TensorBoard notebook extension.\n",
"%load_ext tensorboard"
]
},
{
"cell_type": "code",
"source": [
"# Launch TensorBoard and navigate to the Profile tab to view performance profile\n",
"%tensorboard --logdir=xprof/demo"
],
"execution_count": null,
"metadata": {
"id": "6vSJOEThEoNY"
},
"execution_count": 8,
"outputs": []
"outputs": [],
"source": [
"# Launch TensorBoard and navigate to the Profile tab to view performance profile\n",
"%tensorboard --logdir=xprof/demo"
]
},
{
"cell_type": "markdown",
"source": [
"Once tensorboard loads the profile plugin, Use the tools drop down to explore the tools."
],
"metadata": {
"id": "BKcLu97YMV6x"
}
},
"source": [
"Once tensorboard loads the profile plugin, use the _Tools_ drop down to select the tool you want to explore."
]
}
]
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
8 changes: 6 additions & 2 deletions docs/_toc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ toc:
- heading: XProf developer guide
- title: Getting started
section:
- title: XProf
path: /xprof/index
- title: Overview Page
path: /xprof/overview_page
- title: Graph Viewer
path: /xprof/graph_viewer
- title: HLO Op Profile
Expand All @@ -25,8 +29,8 @@ toc:
path: /xprof/memory_profile
- title: Memory Viewer
path: /xprof/memory_viewer
- title: Overview Page
path: /xprof/overview_page
- title: GPU Kernel Stats
path: /xprof/gpu_kernel_stats
- title: Trace Viewer
path: /xprof/trace_viewer
- title: Roofline Analysis
Expand Down
40 changes: 21 additions & 19 deletions docs/gpu_kernel_stats.md
Original file line number Diff line number Diff line change
@@ -1,54 +1,56 @@
## GPU Kernel Stats Tool

You can use the GPU Kernel Stats tool to see performance statistics and the
You can use the GPU Kernel Stats tool to visualize performance statistics and the
originating framework operation for every GPU-accelerated kernel that was
launched during a profiling session. This tool helps identify potential
bottlenecks at a low level and surfaces optimization opportunities.

### Supported Platforms
![Tensorboard GPU Kernel Stats](images/gpu_kernel_stats.png)

GPU: Supported
### Supported Platforms

TPU: Not supported
The GPU Kernel Stats tool is only supported on GPUs.

### Interface Components

The GPU Kernel Stats tool interface is a table with one row for each unique
kernel-framework operation pair. You can click any of the column headings to
sort the table accordingly. The default order is based on the total duration of
the kernel-op pair.
the kernel-op pair. Search boxes let you filter rows by GPU Kernel Name or by
Op Name. You can export the table to a CSV file by clicking the "Export as CSV"
button.

The table includes the following information for each kernel-op pair:

* Kernel name: The name of the kernel that was launched.
* Registers per thread: The number of GPU
* **Kernel name**: The name of the kernel that was launched.
* **Registers per thread**: The number of GPU
[registers](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-multithreading)
used by the kernel per thread.
* Shared memory used: The total size of
* **Shared memory used**: The total size of
[shared memory](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#shared-memory)
used by the kernel in bytes.
* Block dimension: The dimensions of the thread block expressed as blockDim.x,
* **Block dimension**: The dimensions of the thread block expressed as blockDim.x,
blockDim.y, blockDim.z.
* Grid dimensions: The dimensions of the grid of thread blocks expressed as
* **Grid dimensions**: The dimensions of the grid of thread blocks expressed as
gridDim.x, gridDim.y, gridDim.z.
* Theoretical occupancy: The theoretical
* **Theoretical occupancy**: The theoretical
[occupancy](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy)
of the GPU expressed as a percentage. This indicates the ability of the
shared memory to hide latencies.
* Kernel uses Tensor Core: A heuristic indicating whether the kernel itself
* **Kernel uses Tensor Core**: A heuristic indicating whether the kernel itself
contains Tensor Core instructions, based on the presence of common Tensor
Core instructions.
* Tensor Cores eligibility: A heuristic indicating whether the originating
* **Tensor Cores eligibility**: A heuristic indicating whether the originating
framework operation is eligible to use Tensor Cores, based on commonly
occurring ops that employ the Tensor Core.
* Op name: The name of the framework operation that launched this kernel.
* Occurrences: The number of times this specific kernel-operation pair was
* **Op name**: The name of the framework operation that launched this kernel.
* **Occurrences**: The number of times this specific kernel-operation pair was
executed during the profiling period.
* Total duration (us): The cumulative sum of the execution time of all
* **Total duration (us)**: The cumulative sum of the execution time of all
occurrences of this kernel-operation pair.
* Average duration (us): The average execution time across all occurrences of
* **Average duration (us)**: The average execution time across all occurrences of
this kernel-operation pair.
* Minimum duration (us): The shortest execution time observed for this
* **Minimum duration (us)**: The shortest execution time observed for this
kernel-operation pair.
* Maximum duration (us): The longest execution time observed for this
* **Maximum duration (us)**: The longest execution time observed for this
kernel-operation pair.
40 changes: 20 additions & 20 deletions docs/graph_viewer.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,55 +3,55 @@
You can use Graph Viewer to visualize the graph structure of your XLA program.
It displays the High Level Operations (HLO) graph.

### Supported Platforms
![Graph Viewer](images/graph_viewer.png)

TPU: supported
### Supported Platforms

GPU: supported
Both TPU and GPU are supported.

### Interface Components

Graph Viewer can be controlled with the following options:
1. A Select XLA
Module dropdown menu, from which you choose an HLO module from the available
list for visualization.

2. An Op Name box, where you enter the name of an
1. An **Op Name** box, where you enter the name of an
operation to select it as the primary node and visualize its neighbors. Pressing
enter after typing the op name will visualize the surrounding nodes.

3. A Graph
Width selector, which controls the maximum distance of nodes from the primary
node that is included in the visualization.
1. A **Graph Width** selector, which
controls the maximum distance of nodes from the primary node that is included in
the visualization.

4. A Metadata selector, which you
1. A **Metadata** selector, which you
use to toggle the display of operation metadata included by the compiler, if
any. Note that you need to click “Search Graph” again after selecting this
checkbox.

5. A Merge Fusion checkbox that you can use to collapse or expand the
1. A **Merge Fusion** checkbox that you can use to collapse or expand the
components of fused ops. Note that you will need to click “Search Graph” again
after toggling this checkbox.

6. Zoom in/out buttons or keys, though you can
1. Zoom in/out buttons or keys, though you can
also use your mouse to zoom.

7. Links to download the graph as SVG or dot files.
1. Links to download the graph as SVG or dot files.

8. By using the search functionality, you can quickly locate and focus on
1. By using the search functionality, you can quickly locate and focus on
specific operations within the potentially large graph.

9. To freeze the runtime
1. To freeze the runtime
data panel on a particular op, right click on the op.

### Using Graph Viewer

By default, no operations are selected in Graph Viewer, so you will see an empty
screen. Search for the operation of interest to zoom into it and begin
traversing the graph. Often, you start with one of the other tools (say, Op
Profile to identify the most time consuming op, or Trace Viewer to identify the
cause of a pipeline bubble). Clicking the op in those tools will give you a
direct link into the same op within Graph Viewer.
traversing the graph. Often, you start with one of the other tools (say,
[Op Profile](hlo_op_profile.md) to identify the most time consuming op, or
[Trace Viewer](trace_viewer.md) to identify the cause of a pipeline bubble).
Clicking the op in those tools will give you a direct link into the same op
within Graph Viewer.

![Graph Viewer button appears when the reduce.111 operation is clicked on the HLO Op Profile tool](images/graph_viewer_from_op_profile.png)

By hovering over a box (representing an HLO operation), you might often see the
line of your user code corresponding to the op where the function was defined.
Expand Down
10 changes: 6 additions & 4 deletions docs/hlo_op_profile.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ You can use the HLO Op Profile tool to understand hardware performance for
different categories of High Level Operation (HLO) ops executed during the
profiling period.

### Supported Platforms
![HLO Op Profile](images/hlo_op_profile.png)

TPU: Supported
### Supported Platforms

GPU: Supported
Both TPU and GPU are supported.

### Using the HLO Op Profile tool

Expand Down Expand Up @@ -36,7 +36,7 @@ HLO Op Profile has the following components:
displaying more details about the op or the operation category. Clicking a
table entry pins the op detail card. These cards typically include the
following, as appropriate:
* A link to the op in the Graph Viewer tool.
* A link to the op in the [Graph Viewer](graph_viewer.md) tool.
* Average execution time.
* Absolute rates of usage (in contrast with utilization reported in the
table) of the compute (TFLOP/s), HBM bandwidth (GB/s), and on-chip read
Expand All @@ -47,6 +47,8 @@ HLO Op Profile has the following components:
* The number of occurrences of the op, and total time spent on the op in
aggregate.

![HLO Op Profile details for a loop fusion op](images/hlo_op_profile_details.png)

Note that raw bandwidths (GB/s) or compute rates (TFLOP/s) are computed by
combining static compiler data on FLOPs or bytes required for the op (the
numerator) with duration information per op from the profile (the denominator).
Expand Down
Loading