NVIDIA · shangxiaokang · Apr 15, 2026 · Apr 15, 2026 · Apr 15, 2026
diff --git a/docs/debug/1_getting_started.rst b/docs/debug/1_getting_started.rst
@@ -149,10 +149,11 @@ Inspecting the logs
 -------------------
 
 
-Let's look at the files with the logs. Two files will be created:
+Let's look at the files with the logs. At least two files will be created:
 
 1. debug logs.
 2. statistics logs.
+3. optional feature-specific logs (for example AutoswitchGemm metrics).
 
 Let's look inside them!
 
@@ -214,6 +215,51 @@ The second log file (``nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-
     INFO - transformer_layer.self_attention.layernorm_qkv_activation_std                 iteration=000004                  value=0.9996
     INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm             iteration=000004                  value=130776.7969
 
+AutoswitchGemm quick guide
+--------------------------
+
+``AutoswitchGemm`` monitors quantization quality and can dynamically switch selected GEMMs
+to high precision when thresholds are exceeded.
+
+Minimal config example:
+
+.. code-block:: yaml
+
+    autoswitch_fc_layers:
+      enabled: True
+      layers:
+        layer_types: [fc1, fc2]
+      transformer_engine:
+        AutoswitchGemm:
+          enabled: True
+          gemms: [fprop, dgrad, wgrad]
+          underflow_threshold_pct: 1.0
+          mse_threshold: 1.0e-4
+          # Needed only if the layer uses fp8 model parameters and
+          # you want fprop/dgrad to be able to switch to high precision.
+          allow_fp8_model_params_dequantized_weight: False
+          freq: 1
+
+Behavior summary:
+
+1. For each ``(layer, gemm)``, AutoswitchGemm tracks the latest tensor metrics and applies
+   OR logic across monitored tensors: if any tensor breaches thresholds, that GEMM switches.
+2. Metrics computed in iteration ``n`` are consumed in iteration ``n`` only.
+3. If thresholds are not breached in the current iteration, the GEMM stays quantized.
+
+When AutoswitchGemm is enabled, an additional directory is created under ``log_dir``:
+
+``nvdlfw_inspect_autoswitchgemm_logs/nvdlfw_inspect_globalrank-<rank>.log``
+
+It contains per-rank, per-iteration metrics such as:
+
+- ``<layer>_<gemm>_<tensor>_underflow_pct``
+- ``<layer>_<gemm>_<tensor>_mse``
+- ``<layer>_<gemm>_quantized_enabled``
+- ``<layer>_<gemm>_disable_until_iter``
+- ``<layer>_<gemm>_switch_blocked_fp8_model_params``
+- ``<layer>_<gemm>_fp8_model_params_dequantized_fallback``
+
 Logging using TensorBoard
 -------------------------
 

diff --git a/docs/debug/2_config_file_structure.rst b/docs/debug/2_config_file_structure.rst
@@ -220,6 +220,28 @@ We can use both structs for tensors and GEMMs. The tensors_struct should be nest
             tensor_feature_param2: value
           gemm_feature_param1: value
 
+AutoswitchGemm notes
+--------------------
+
+``AutoswitchGemm`` supports both global and per-GEMM configuration.
+
+- Use ``gemms: [...]`` for one shared policy.
+- Use ``gemms_struct`` to set per-GEMM thresholds.
+
+If ``tensors``/``tensors_struct`` are omitted, monitored tensors are inferred from GEMMs:
+
+- ``fprop`` -> ``activation``, ``weight``
+- ``dgrad`` -> ``gradient``, ``weight``
+- ``wgrad`` -> ``activation``, ``gradient``
+
+Other important keys:
+
+- ``underflow_threshold_pct``: switch trigger based on underflow percentage.
+- ``mse_threshold``: switch trigger based on quantization MSE.
+- metrics are consumed in the same iteration where they are computed.
+- ``allow_fp8_model_params_dequantized_weight``: allows ``fprop``/``dgrad`` switching
+  for layers with FP8 model parameters by using dequantized temporary weights.
+
 Enabling or Disabling Sections and Features
 -------------------------------------------
 

diff --git a/docs/debug/3_api_features.rst b/docs/debug/3_api_features.rst
@@ -10,6 +10,7 @@ Debug features
 .. autoapiclass:: transformer_engine.debug.features.log_fp8_tensor_stats.LogFp8TensorStats
 .. autoapiclass:: transformer_engine.debug.features.log_nvfp4_tensor_stats.LogNvfp4TensorStats
 .. autoapiclass:: transformer_engine.debug.features.disable_quantization_gemm.DisableQuantizationGEMM
+.. autoapiclass:: transformer_engine.debug.features.autoswitch_gemm.AutoswitchGemm
 .. autoapiclass:: transformer_engine.debug.features.disable_quantization_layer.DisableQuantizationLayer
 .. autoapiclass:: transformer_engine.debug.features.per_tensor_scaling.PerTensorScaling
 .. autoapiclass:: transformer_engine.debug.features.fake_quant.FakeQuant

diff --git a/docs/debug/autoswitch_gemm_example.yaml b/docs/debug/autoswitch_gemm_example.yaml
@@ -0,0 +1,72 @@
+# Example config for transformer_engine.debug.features.autoswitch_gemm.AutoswitchGemm
+#
+# Usage:
+#   import nvdlfw_inspect.api as debug_api
+#   debug_api.initialize(
+#       config_file="docs/debug/autoswitch_gemm_example.yaml",
+#       feature_dirs=["transformer_engine/debug/features"],
+#       log_dir="./log",
+#   )
+#   ...
+#   debug_api.step()  # call once per training step
+
+autoswitch_attention_blocks:
+  enabled: True
+  layers:
+    # Match attention linear layers, e.g. *.qkv / *.proj
+    layer_name_regex_pattern: ".*(qkv|proj).*"
+  transformer_engine:
+    AutoswitchGemm:
+      enabled: True
+
+      # Optional. If omitted, tensors are inferred from selected gemms:
+      # fprop -> [activation, weight], dgrad -> [gradient, weight],
+      # wgrad -> [activation, gradient].
+      tensors: [activation, weight, gradient]
+
+      # Per-GEMM switching policy.
+      gemms_struct:
+        - gemm: fprop
+          underflow_threshold_pct: 1.0
+          mse_threshold: 1.0e-4
+        - gemm: dgrad
+          underflow_threshold_pct: 1.5
+          mse_threshold: 1.5e-4
+        - gemm: wgrad
+          underflow_threshold_pct: 2.0
+          mse_threshold: 2.0e-4
+
+      # For layers with fp8 model parameters:
+      # - False: keep fprop/dgrad quantized
+      # - True: allow high-precision switch via temporary dequantized weights
+      allow_fp8_model_params_dequantized_weight: False
+
+      # Collect metrics every step after warmup.
+      freq: 1
+      start_step: 10
+      end_step: 5000
+
+
+autoswitch_mlp_blocks:
+  enabled: True
+  layers:
+    layer_types: [fc1, fc2]
+  transformer_engine:
+    AutoswitchGemm:
+      enabled: True
+
+      # Simpler global policy (shared by selected GEMMs).
+      gemms: [fprop, wgrad]
+      tensors: [activation, weight, gradient]
+
+      underflow_threshold_pct: 3.0
+      mse_threshold: 3.0e-4
+
+      # Example sparse monitoring windows.
+      freq: 2
+      start_end_list:
+        - [0, 300]
+        - [800, 3000]
+
+# Autoswitch per-rank metrics are written to:
+#   <log_dir>/nvdlfw_inspect_autoswitchgemm_logs/nvdlfw_inspect_globalrank-<rank>.log