Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion lightllm/server/httpserver/manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -704,7 +704,7 @@ async def _wait_to_token_package(
prompt_cache_len = metadata.pop("prompt_cache_len", 0)
cpu_prompt_cache_len = metadata.pop("cpu_prompt_cache_len", 0)
disk_prompt_cache_len = metadata.pop("disk_prompt_cache_len", 0)
metadata["prompt_cache_len"] = prompt_cache_len
metadata["prompt_cache_len"] = prompt_cache_len + cpu_prompt_cache_len + disk_prompt_cache_len
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The update to metadata["prompt_cache_len"] correctly calculates the total cache length by summing GPU, CPU, and disk cache lengths. However, this change introduces an inconsistency with the internal metrics and logging found later in this function.

Specifically, the local variable prompt_cache_len (which still holds only the GPU cache length) is used for:

  • Calculating prompt_cache_ratio (line 731).
  • Logging gpu_prompt_cache_len (line 747).
  • Reporting the lightllm_cache_length metric (line 760).

If the goal is to report the total cache length consistently across the API and monitoring, these other locations should also be updated to use the sum. Note that if you update the local variable prompt_cache_len to be the sum, you should also update the log label at line 747 to avoid mislabeling the total as 'gpu'.

sub_req_id_to_mtp_accepted_token_num[sub_req_id] = metadata.get("mtp_accepted_token_num", 0)

if is_first_token:
Expand Down
Loading