Skip to content

Add GPU support and enhance monitoring features for jobs#2

Merged
DaniloBueno merged 12 commits intovios-s:mainfrom
stogiannidis:develop
Mar 2, 2026
Merged

Add GPU support and enhance monitoring features for jobs#2
DaniloBueno merged 12 commits intovios-s:mainfrom
stogiannidis:develop

Conversation

@stogiannidis
Copy link
Copy Markdown
Contributor

Introduce GPU support in job data generation and monitoring, improve GPU resource management, and refine error handling. Enhance the log viewer with an auto-refresh feature for better usability.

Copilot AI review requested due to automatic review settings January 21, 2026 17:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds GPU support and monitoring capabilities to a Kubernetes job monitoring tool. It introduces GPU resource tracking at both the cluster and job level, and adds an interactive log viewer with auto-refresh functionality.

Changes:

  • Added GPU information collection from cluster nodes and pods, including GPU type identification and allocation tracking
  • Enhanced job display to show GPU requests and types, with color-coded resource utilization indicators
  • Introduced interactive log viewer with keyboard navigation, syntax highlighting, and auto-refresh capability

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 18 comments.

File Description
monitor.py Added GPU tracking functions (get_gpu_info, _shorten_gpu_name), enhanced job/pod data collection with GPU information, added interactive log viewer with navigation, and updated UI to display GPU resources and selection indicators
mock_data.py Added GPU field to all job definitions, implemented GPU resource requests in job specs, and added GPU-aware node assignment logic for mock pods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread monitor.py Outdated
Comment thread monitor.py Outdated
Comment thread monitor.py Outdated
Comment thread monitor.py Outdated
Comment thread monitor.py Outdated
Comment thread monitor.py Outdated
Comment thread monitor.py
Comment thread monitor.py
Comment thread monitor.py
Comment thread monitor.py
Comment thread monitor.py Outdated
Comment thread monitor.py Outdated
Comment thread monitor.py Outdated
Comment thread monitor.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
stogiannidis and others added 6 commits January 26, 2026 11:50
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

@stogiannidis stogiannidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added all of copilot's suggestion and @DaniloBueno's comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 16 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mock_data.py
if gpu_req > 0:
# Assign to a GPU node (prefer A100 for larger jobs)
if gpu_req >= 4:
node_name = 'gpu-node-01' # A100
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment indicates "# A100" but gpu-node-01 is actually defined as 'H100-80GB' in the _generate_gpu_info function. Update the comment to "# H100" to match the actual GPU type.

Suggested change
node_name = 'gpu-node-01' # A100
node_name = 'gpu-node-01' # H100

Copilot uses AI. Check for mistakes.
Comment thread monitor.py
Comment on lines +197 to +199
for node in gpu_info['nodes']:
if node['name'] == node_name:
node['allocated'] += gpu_req
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested loop iterates through all GPU nodes for each container requesting GPUs, resulting in O(n*m) complexity where n is the number of containers with GPU requests and m is the number of GPU nodes. Consider using a dictionary to map node names to their corresponding entries in gpu_info['nodes'] for O(1) lookup, improving performance when there are many GPU nodes.

Copilot uses AI. Check for mistakes.
Comment thread monitor.py
table.add_column("Comp", justify="right")
table.add_column("Duration", justify="right")

all_rows = build_row_index(jobs)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build_row_index function is called twice per render cycle: once in the main loop (line 861) and once inside generate_table (line 547). This duplicates work processing all jobs and pods. Consider passing the all_rows result from line 861 as a parameter to generate_table to avoid recalculating it.

Copilot uses AI. Check for mistakes.
Comment thread monitor.py
start_line = min(scroll_offset + 1, total_lines)
end_line = min(scroll_offset + (max_lines or total_lines), total_lines)
scroll_info = f" ({start_line}-{end_line}/{total_lines})"

Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line has trailing whitespace which should be removed for consistency with code style guidelines.

Copilot uses AI. Check for mistakes.
Comment thread mock_data.py
Comment on lines +433 to +434
for i, msg in enumerate(log_messages[:tail_lines]):
timestamp = f"2026-01-21T10:{i:02d}:00Z"
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded timestamp "2026-01-21" in mock logs is in the past (current date is January 26, 2026). Consider using dynamic timestamps generated relative to the current time, similar to how job timestamps are generated using the _time_ago function. This would make mock logs more realistic and avoid confusion.

Suggested change
for i, msg in enumerate(log_messages[:tail_lines]):
timestamp = f"2026-01-21T10:{i:02d}:00Z"
now = datetime.now(timezone.utc)
for i, msg in enumerate(log_messages[:tail_lines]):
log_time = now - timedelta(minutes=(tail_lines - 1 - i))
timestamp = log_time.strftime("%Y-%m-%dT%H:%M:%SZ")

Copilot uses AI. Check for mistakes.
Comment thread monitor.py
return mock_data['logs']
# Fallback if no mock_data provided
return "Mock logs not available"

Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespace and pod_name parameters are used directly in a shell command without sanitization. Although pod_name comes from Kubernetes API responses (which should be safe), and namespace is validated by argparse, using shell=True with these parameters creates a potential command injection risk. Consider adding input validation to ensure ns and pod_name contain only safe characters (alphanumeric, hyphens, underscores), or use subprocess with shell=False and pass arguments as a list: ['kubectl', '-n', ns, 'logs', pod_name, f'--tail={tail_lines}'].

Suggested change
# Validate namespace and pod name to prevent command injection when using shell commands.
if not re.fullmatch(r"[A-Za-z0-9_-]+", str(ns)):
raise ValueError(f"Invalid namespace value: {ns!r}")
if not re.fullmatch(r"[A-Za-z0-9_-]+", str(pod_name)):
raise ValueError(f"Invalid pod name value: {pod_name!r}")

Copilot uses AI. Check for mistakes.
Comment thread monitor.py
# Create scroll indicator
has_real_logs = bool(logs and logs.strip())
if not has_real_logs:
total_lines = 0
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable total_lines is not used.

Suggested change
total_lines = 0

Copilot uses AI. Check for mistakes.
Comment thread monitor.py
if 'gpu' in key.lower():
try:
gpu_count = int(capacity[key])
except (ValueError, TypeError):
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except (ValueError, TypeError):
except (ValueError, TypeError):
# If the GPU capacity value is not a valid integer, ignore it
# and leave gpu_count at its default value of 0.

Copilot uses AI. Check for mistakes.
Comment thread monitor.py
gpu_info['total_gpus'] += gpu_count
gpu_info['gpu_types'].add(gpu_type)
gpu_info['node_gpu_map'][node_name] = gpu_type
except:
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
except:
# Best-effort: if node/GPU information cannot be parsed or is unavailable,
# skip populating GPU node details and continue without this optional data.

Copilot uses AI. Check for mistakes.
Comment thread monitor.py
for node in gpu_info['nodes']:
if node['name'] == node_name:
node['allocated'] += gpu_req
except (ValueError, TypeError):
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except (ValueError, TypeError):
except (ValueError, TypeError):
# Ignore non-integer or otherwise invalid GPU request values
# and continue processing remaining containers.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@DaniloBueno DaniloBueno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, @stogiannidis!
This looks good to me as it is. There are a few improvement suggestions from Copilot, but I don't think they're critical... feel free to tackle them if you want 🙂

Comment thread mock_data.py
log_messages.extend(log_messages)

for i, msg in enumerate(log_messages[:tail_lines]):
timestamp = f"2026-01-21T10:{i:02d}:00Z"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can generate invalid timestamps, like:
2026-01-21T10💯00Z INFO: Initializing worker threads...

Comment thread monitor.py
return Panel(grid, title="Cluster Quota", border_style="blue")


def generate_log_viewer(logs, pod_name, scroll_offset=0, max_lines=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we should sort log messages by timestamp DESC instead of ASC. Also, we may want a "page up/down" feature in the future.

@DaniloBueno DaniloBueno merged commit 096fc73 into vios-s:main Mar 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants