Bug Fix - Fix NVBandwidth benchmark results parsing bug (#748)#782
Bug Fix - Fix NVBandwidth benchmark results parsing bug (#748)#782NJX-njx wants to merge 2 commits into
Conversation
…ency_sm_sum_None - Add validation before adding SUM results: only add when test_name and benchmark_type are valid - Use test name from SUM line as fallback when parse_status is stale (e.g., after waived tests) - Infer benchmark_type from test name when not set (latency tests -> lat, else bw) Fixes microsoft#748 Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Fixes NVBandwidth SUM-line parsing to prevent invalid metric keys (e.g., _sum_None) by making SUM parsing more resilient when parsing state is missing or stale.
Changes:
- Use the SUM line’s captured test name as a fallback when
parse_status['test_name']is empty. - Infer
benchmark_typefrom the test name when it wasn’t detected from the matrix header. - Guard SUM metric emission to only occur when both
test_nameandbenchmark_typeare valid.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if self.re_summary_pattern.match(line): | ||
| value = self.re_summary_pattern.match(line).group(2) | ||
| test_name = parse_status['test_name'] | ||
| match = self.re_summary_pattern.match(line) | ||
| value = match.group(2) |
There was a problem hiding this comment.
re_summary_pattern.match(line) is evaluated twice (once in the if and again to assign match). Consider assigning once (e.g., match = ...; if match:) to avoid duplicated work and keep the control flow simpler.
There was a problem hiding this comment.
@NJX-njx Could you please resolve the comments from Copilot? Thanks.
| # Use test_name from parse_status, fallback to group(1) from SUM line | ||
| test_name = parse_status['test_name'] or match.group(1).lower() | ||
| benchmark_type = parse_status['benchmark_type'] | ||
| parse_status['results'][f'{test_name}_sum_{benchmark_type}'] = float(value) | ||
| # Infer benchmark_type from test_name if not set (e.g., after waived tests) | ||
| if benchmark_type is None: | ||
| benchmark_type = 'lat' if 'latency' in test_name else 'bw' | ||
| # Only add result when we have valid metric name (avoid _sum_None or sum_None) | ||
| if test_name and benchmark_type: | ||
| parse_status['results'][f'{test_name}_sum_{benchmark_type}'] = float(value) |
There was a problem hiding this comment.
The new fallback/inference paths for SUM parsing (using match.group(1) when parse_status['test_name'] is empty, and inferring benchmark_type when it's None) aren’t covered by existing nvbandwidth tests (current tests only exercise the full header+matrix path). Adding a unit test with a minimal raw output containing a SUM ... line but no prior matrix header (and/or no preceding Running ...) would help prevent regressions of the _sum_None metric-name bug.
| @@ -190,10 +190,17 @@ def _process_raw_line(self, line, parse_status): | |||
|
|
|||
There was a problem hiding this comment.
re_summary_pattern.match(line) is still evaluated twice — once in this if, then again on the line below to assign match. Copilot raised this in #discussion_r2878517701 and @guoshzhao explicitly asked for it to be addressed in #discussion_r3070415674; please collapse the two calls:
match = self.re_summary_pattern.match(line)
if match:
value = match.group(2)
...| test_name = parse_status['test_name'] or match.group(1).lower() | ||
| benchmark_type = parse_status['benchmark_type'] | ||
| parse_status['results'][f'{test_name}_sum_{benchmark_type}'] = float(value) | ||
| # Infer benchmark_type from test_name if not set (e.g., after waived tests) |
There was a problem hiding this comment.
This fallback fixes the SUM symptom but masks the true root cause of device_to_device_latency_sm_sum_None from issue #748.
Upstream nvbandwidth prints the matrix header for this test as:
Device to Device Latency SM GPU(row) <-> GPU(column) (ns)
(see third_party/nvbandwidth/testcases_sm.cpp:141). That line does NOT start with memcpy or memory latency, so re_matrix_header_line at line 19 never fires for this test. As a consequence:
parse_status['benchmark_type']staysNone.metrix_row/metrix_colare never extracted.- The matrix-row branch at line 167 is guarded by
parse_status['benchmark_type']and therefore silently skips all per-cell results fordevice_to_device_latency_sm.
With this PR, the SUM is rescued via the 'latency' in test_name heuristic, but the per-cell device_to_device_latency_sm_gpu0_gpu1_lat (etc.) results are still lost — users will only see the SUM, not the matrix.
A more complete fix is to broaden the header detector itself, e.g.:
re_matrix_header_line = re.compile(r'^(memcpy|memory latency|Device to Device Latency)', re.IGNORECASE)The existing branch 'bw' if 'bandwidth' in line else 'lat' already classifies correctly for both new and old header lines, so the fallback + inference here can stay as a defense-in-depth net while the regex change recovers the lost matrix data.
| parse_status['results'][f'{test_name}_sum_{benchmark_type}'] = float(value) | ||
| # Infer benchmark_type from test_name if not set (e.g., after waived tests) | ||
| if benchmark_type is None: | ||
| benchmark_type = 'lat' if 'latency' in test_name else 'bw' |
There was a problem hiding this comment.
When this guard drops a SUM line (either test_name is empty or benchmark_type is still falsy after inference), nothing is logged. That trades a malformed metric name for silent data loss, which makes the next regression (a new upstream output format change) hard to detect from a passing run.
Please emit a warning here so the drop is observable, e.g.:
if test_name and benchmark_type:
parse_status['results'][f'{test_name}_sum_{benchmark_type}'] = float(value)
else:
logger.warning(
f'nvbandwidth: skipping SUM line, incomplete parse state '
f'(test_name={test_name!r}, benchmark_type={benchmark_type!r}, line={line!r})'
)The warning also gives users a clear signal in the runner logs when a future nvbandwidth release adds a test whose header line doesn't match the existing detection patterns.
| value = self.re_summary_pattern.match(line).group(2) | ||
| test_name = parse_status['test_name'] | ||
| match = self.re_summary_pattern.match(line) | ||
| value = match.group(2) |
There was a problem hiding this comment.
None of the three new behaviors here (or match.group(1).lower() fallback, 'latency' in test_name inference, if test_name and benchmark_type guard) is exercised by the existing test suite — tests/data/nvbandwidth_results.log and test_nvbandwidth.py::test_nvbandwidth_result_parsing_real_output only cover the happy memcpy ... / memory latency ... header path. Without a regression test, a future refactor can silently undo the fix from issue #748.
Copilot already flagged this in #discussion_r2878517758; please add a unit test using a synthetic raw output that triggers each new path. Minimal example exercising the device_to_device_latency_sm regression:
def test_nvbandwidth_sum_fallback_for_unknown_header(self):
benchmark_name = 'nvbandwidth'
(cls, _) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, Platform.CUDA)
benchmark = cls(benchmark_name, parameters='')
assert benchmark._preprocess()
raw_output = (
'Running device_to_device_latency_sm.\n'
'Device to Device Latency SM GPU(row) <-> GPU(column) (ns)\n'
' 0 1\n'
' 0 772.58 810.10\n'
' 1 810.10 772.58\n'
'\n'
'SUM device_to_device_latency_sm 3165.36\n'
)
assert benchmark._process_raw_result(0, raw_output)
assert benchmark.result['device_to_device_latency_sm_sum_lat'][0] == 3165.36
# \u2191 with the suggested regex broadening (see [comment on L198](https://github.com/microsoft/superbenchmark/pull/782#discussion_r3326258884)), per-cell metrics should also be present.A second case covering the test_name == '' and benchmark_type is None drop path (asserting the SUM is skipped and a warning is logged) would also be valuable, especially if the logger.warning suggestion on L200 is adopted.
|
Hi @NJX-njx, could address the current comments? We are planning to merge this PR. Thanks! |
Summary
Fixes #748 - NVBandwidth benchmark results parsing bug causing invalid metric names like _sum_None\ and \device_to_device_latency_sm_sum_None.
Root Cause
When parsing SUM lines, \parse_status['test_name']\ or \parse_status['benchmark_type']\ could be empty/None (e.g., after waived tests or when header parsing fails), leading to invalid metric keys.
Changes
Testing