Skip to content

add monitor#1360

Open
sufubao wants to merge 3 commits into
ModelTC:mainfrom
sufubao:QPS
Open

add monitor#1360
sufubao wants to merge 3 commits into
ModelTC:mainfrom
sufubao:QPS

Conversation

@sufubao

@sufubao sufubao commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

python ./tools/lightllm_monitor.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces lightllm_monitor.py, a real-time terminal dashboard for monitoring LightLLM metrics using the rich library. Feedback on the implementation highlights a critical bug where filtering metrics ending in _count or _sum inadvertently discards independent counters like lightllm_request_count and lightllm_batch_inference_count. Additionally, reviewers recommended using a regex-based parser to robustly handle commas in Prometheus label values, and extracting a shared calculate_rate helper function to eliminate duplicate rate calculation logic and handle counter resets gracefully.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tools/lightllm_monitor.py
Comment on lines +73 to +81
if name.endswith("_bucket"):
base = name[: -len("_bucket")]
le = labels.get("le", "+Inf")
le_f = float("inf") if le == "+Inf" else float(le)
hist[base][le_f] = hist[base].get(le_f, 0.0) + value
elif name.endswith("_sum") or name.endswith("_count"):
continue # 分位数用 bucket 算, rate 用 base counter, 不需要 sum/count
else:
scalars[name] += value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Critical Bug: The check name.endswith("_sum") or name.endswith("_count") incorrectly filters out actual independent counters like lightllm_request_count (displayed as req total) and lightllm_batch_inference_count (displayed as infer steps). As a result, these metrics will always display as on the dashboard.

Since keeping the histogram _sum and _count metrics in the scalars dictionary is harmless (they are simply ignored during rendering), you can safely remove this elif block entirely to ensure all independent counters are correctly aggregated.

        if name.endswith("_bucket"):
            base = name[: -len("_bucket")]
            le = labels.get("le", "+Inf")
            le_f = float("inf") if le == "+Inf" else float(le)
            hist[base][le_f] = hist[base].get(le_f, 0.0) + value
        else:
            scalars[name] += value

Comment thread tools/lightllm_monitor.py
Comment on lines +54 to +59
labels = {}
if labels_str:
for kv in labels_str.split(","):
if "=" in kv:
k, v = kv.split("=", 1)
labels[k.strip()] = v.strip().strip('"')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Robustness Issue: Splitting labels_str by , to parse Prometheus labels will fail if any label value contains a comma (e.g., a prompt or metadata string containing a comma).

Using a regular expression to extract key="value" pairs is much more robust and correctly handles commas and escaped quotes within label values.

        labels = {}
        if labels_str:
            import re
            for k, v in re.findall(r'([a-zA-Z_][a-zA-Z0-9_]*)\s*=\s*"((?:[^"\\]|\\.)*)"', labels_str):
                labels[k] = v.replace('\\"', '"').replace('\\\\', '\\')

Comment thread tools/lightllm_monitor.py
Comment on lines +136 to +140
def fmt_float(x, prec=1):
if x is None:
return "[dim]—[/dim]"
return f"{x:.{prec}f}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Improvement: To avoid duplicating the rate calculation logic between build_panel and main(), and to robustly handle counter resets (e.g., when the LightLLM server restarts and counters reset to 0), we can introduce a shared calculate_rate helper function.

Suggested change
def fmt_float(x, prec=1):
if x is None:
return "[dim]—[/dim]"
return f"{x:.{prec}f}"
def fmt_float(x, prec=1):
if x is None:
return "[dim]—[/dim]"
return f"{x:.{prec}f}"
def calculate_rate(name, prev, scalars, now):
if name in prev and name in scalars:
pv, pt = prev[name]
dt = now - pt
if dt > 0:
diff = scalars[name] - pv
return diff / dt if diff >= 0 else 0.0
return None

Comment thread tools/lightllm_monitor.py
Comment on lines +168 to +179
# —— rate helper (相邻两次 counter 差值 / dt) ——
def rate(name):
if name in prev and name in scalars:
pv, pt = prev[name]
dt = now - pt
if dt > 0:
return (scalars[name] - pv) / dt
return None

gen_tps = rate("lightllm_generation_tokens_total")
in_tps = rate("lightllm_prompt_tokens_total")
tpm = gen_tps * 60 if gen_tps is not None else None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Improvement: Use the newly introduced calculate_rate helper function to simplify the code and remove the nested rate function.

    gen_tps = calculate_rate("lightllm_generation_tokens_total", prev, scalars, now)
    in_tps = calculate_rate("lightllm_prompt_tokens_total", prev, scalars, now)
    tpm = gen_tps * 60 if gen_tps is not None else None

Comment thread tools/lightllm_monitor.py
Comment on lines +262 to +268
# 瞬时 gen tok/s 用于趋势
gen_tps = None
if "lightllm_generation_tokens_total" in prev:
pv, pt = prev["lightllm_generation_tokens_total"]
dt = now - pt
if dt > 0:
gen_tps = (scalars["lightllm_generation_tokens_total"] - pv) / dt

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Improvement: Use the shared calculate_rate helper function here to eliminate duplicate rate calculation logic.

Suggested change
# 瞬时 gen tok/s 用于趋势
gen_tps = None
if "lightllm_generation_tokens_total" in prev:
pv, pt = prev["lightllm_generation_tokens_total"]
dt = now - pt
if dt > 0:
gen_tps = (scalars["lightllm_generation_tokens_total"] - pv) / dt
# 瞬时 gen tok/s 用于趋势
gen_tps = calculate_rate("lightllm_generation_tokens_total", prev, scalars, now)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant