add monitor#1360
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces lightllm_monitor.py, a real-time terminal dashboard for monitoring LightLLM metrics using the rich library. Feedback on the implementation highlights a critical bug where filtering metrics ending in _count or _sum inadvertently discards independent counters like lightllm_request_count and lightllm_batch_inference_count. Additionally, reviewers recommended using a regex-based parser to robustly handle commas in Prometheus label values, and extracting a shared calculate_rate helper function to eliminate duplicate rate calculation logic and handle counter resets gracefully.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if name.endswith("_bucket"): | ||
| base = name[: -len("_bucket")] | ||
| le = labels.get("le", "+Inf") | ||
| le_f = float("inf") if le == "+Inf" else float(le) | ||
| hist[base][le_f] = hist[base].get(le_f, 0.0) + value | ||
| elif name.endswith("_sum") or name.endswith("_count"): | ||
| continue # 分位数用 bucket 算, rate 用 base counter, 不需要 sum/count | ||
| else: | ||
| scalars[name] += value |
There was a problem hiding this comment.
Critical Bug: The check name.endswith("_sum") or name.endswith("_count") incorrectly filters out actual independent counters like lightllm_request_count (displayed as req total) and lightllm_batch_inference_count (displayed as infer steps). As a result, these metrics will always display as — on the dashboard.
Since keeping the histogram _sum and _count metrics in the scalars dictionary is harmless (they are simply ignored during rendering), you can safely remove this elif block entirely to ensure all independent counters are correctly aggregated.
if name.endswith("_bucket"):
base = name[: -len("_bucket")]
le = labels.get("le", "+Inf")
le_f = float("inf") if le == "+Inf" else float(le)
hist[base][le_f] = hist[base].get(le_f, 0.0) + value
else:
scalars[name] += value| labels = {} | ||
| if labels_str: | ||
| for kv in labels_str.split(","): | ||
| if "=" in kv: | ||
| k, v = kv.split("=", 1) | ||
| labels[k.strip()] = v.strip().strip('"') |
There was a problem hiding this comment.
Robustness Issue: Splitting labels_str by , to parse Prometheus labels will fail if any label value contains a comma (e.g., a prompt or metadata string containing a comma).
Using a regular expression to extract key="value" pairs is much more robust and correctly handles commas and escaped quotes within label values.
labels = {}
if labels_str:
import re
for k, v in re.findall(r'([a-zA-Z_][a-zA-Z0-9_]*)\s*=\s*"((?:[^"\\]|\\.)*)"', labels_str):
labels[k] = v.replace('\\"', '"').replace('\\\\', '\\')| def fmt_float(x, prec=1): | ||
| if x is None: | ||
| return "[dim]—[/dim]" | ||
| return f"{x:.{prec}f}" | ||
|
|
There was a problem hiding this comment.
Improvement: To avoid duplicating the rate calculation logic between build_panel and main(), and to robustly handle counter resets (e.g., when the LightLLM server restarts and counters reset to 0), we can introduce a shared calculate_rate helper function.
| def fmt_float(x, prec=1): | |
| if x is None: | |
| return "[dim]—[/dim]" | |
| return f"{x:.{prec}f}" | |
| def fmt_float(x, prec=1): | |
| if x is None: | |
| return "[dim]—[/dim]" | |
| return f"{x:.{prec}f}" | |
| def calculate_rate(name, prev, scalars, now): | |
| if name in prev and name in scalars: | |
| pv, pt = prev[name] | |
| dt = now - pt | |
| if dt > 0: | |
| diff = scalars[name] - pv | |
| return diff / dt if diff >= 0 else 0.0 | |
| return None |
| # —— rate helper (相邻两次 counter 差值 / dt) —— | ||
| def rate(name): | ||
| if name in prev and name in scalars: | ||
| pv, pt = prev[name] | ||
| dt = now - pt | ||
| if dt > 0: | ||
| return (scalars[name] - pv) / dt | ||
| return None | ||
|
|
||
| gen_tps = rate("lightllm_generation_tokens_total") | ||
| in_tps = rate("lightllm_prompt_tokens_total") | ||
| tpm = gen_tps * 60 if gen_tps is not None else None |
There was a problem hiding this comment.
Improvement: Use the newly introduced calculate_rate helper function to simplify the code and remove the nested rate function.
gen_tps = calculate_rate("lightllm_generation_tokens_total", prev, scalars, now)
in_tps = calculate_rate("lightllm_prompt_tokens_total", prev, scalars, now)
tpm = gen_tps * 60 if gen_tps is not None else None| # 瞬时 gen tok/s 用于趋势 | ||
| gen_tps = None | ||
| if "lightllm_generation_tokens_total" in prev: | ||
| pv, pt = prev["lightllm_generation_tokens_total"] | ||
| dt = now - pt | ||
| if dt > 0: | ||
| gen_tps = (scalars["lightllm_generation_tokens_total"] - pv) / dt |
There was a problem hiding this comment.
Improvement: Use the shared calculate_rate helper function here to eliminate duplicate rate calculation logic.
| # 瞬时 gen tok/s 用于趋势 | |
| gen_tps = None | |
| if "lightllm_generation_tokens_total" in prev: | |
| pv, pt = prev["lightllm_generation_tokens_total"] | |
| dt = now - pt | |
| if dt > 0: | |
| gen_tps = (scalars["lightllm_generation_tokens_total"] - pv) / dt | |
| # 瞬时 gen tok/s 用于趋势 | |
| gen_tps = calculate_rate("lightllm_generation_tokens_total", prev, scalars, now) |
python ./tools/lightllm_monitor.py