Add GPU support and enhance monitoring features for jobs by stogiannidis · Pull Request #2 · vios-s/kubmonitor_cli

stogiannidis · 2026-01-21T17:25:30Z

Introduce GPU support in job data generation and monitoring, improve GPU resource management, and refine error handling. Enhance the log viewer with an auto-refresh feature for better usability.

…d handling

…ature

Copilot

Pull request overview

This pull request adds GPU support and monitoring capabilities to a Kubernetes job monitoring tool. It introduces GPU resource tracking at both the cluster and job level, and adds an interactive log viewer with auto-refresh functionality.

Changes:

Added GPU information collection from cluster nodes and pods, including GPU type identification and allocation tracking
Enhanced job display to show GPU requests and types, with color-coded resource utilization indicators
Introduced interactive log viewer with keyboard navigation, syntax highlighting, and auto-refresh capability

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 18 comments.

File	Description
monitor.py	Added GPU tracking functions (get_gpu_info, _shorten_gpu_name), enhanced job/pod data collection with GPU information, added interactive log viewer with navigation, and updated UI to display GPU resources and selection indicators
mock_data.py	Added GPU field to all job definitions, implemented GPU resource requests in job specs, and added GPU-aware node assignment logic for mock pods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

stogiannidis

Added all of copilot's suggestion and @DaniloBueno's comment

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 16 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-26T12:34:41Z

+        if gpu_req > 0:
+            # Assign to a GPU node (prefer A100 for larger jobs)
+            if gpu_req >= 4:
+                node_name = 'gpu-node-01'  # A100


The comment indicates "# A100" but gpu-node-01 is actually defined as 'H100-80GB' in the _generate_gpu_info function. Update the comment to "# H100" to match the actual GPU type.

Suggested change

node_name = 'gpu-node-01' # A100

node_name = 'gpu-node-01' # H100

Copilot · 2026-01-26T12:34:41Z

+                            for node in gpu_info['nodes']:
+                                if node['name'] == node_name:
+                                    node['allocated'] += gpu_req


The nested loop iterates through all GPU nodes for each container requesting GPUs, resulting in O(n*m) complexity where n is the number of containers with GPU requests and m is the number of GPU nodes. Consider using a dictionary to map node names to their corresponding entries in gpu_info['nodes'] for O(1) lookup, improving performance when there are many GPU nodes.

Copilot · 2026-01-26T12:34:42Z

+    table.add_column("Comp", justify="right")
+    table.add_column("Duration", justify="right")
+
+    all_rows = build_row_index(jobs)


The build_row_index function is called twice per render cycle: once in the main loop (line 861) and once inside generate_table (line 547). This duplicates work processing all jobs and pods. Consider passing the all_rows result from line 861 as a parameter to generate_table to avoid recalculating it.

Copilot · 2026-01-26T12:34:42Z

+        start_line = min(scroll_offset + 1, total_lines)
+        end_line = min(scroll_offset + (max_lines or total_lines), total_lines)
+        scroll_info = f" ({start_line}-{end_line}/{total_lines})"
+


This line has trailing whitespace which should be removed for consistency with code style guidelines.

Copilot · 2026-01-26T12:34:42Z

+    for i, msg in enumerate(log_messages[:tail_lines]):
+        timestamp = f"2026-01-21T10:{i:02d}:00Z"


The hardcoded timestamp "2026-01-21" in mock logs is in the past (current date is January 26, 2026). Consider using dynamic timestamps generated relative to the current time, similar to how job timestamps are generated using the _time_ago function. This would make mock logs more realistic and avoid confusion.

Suggested change

for i, msg in enumerate(log_messages[:tail_lines]):

timestamp = f"2026-01-21T10:{i:02d}:00Z"

now = datetime.now(timezone.utc)

for i, msg in enumerate(log_messages[:tail_lines]):

log_time = now - timedelta(minutes=(tail_lines - 1 - i))

timestamp = log_time.strftime("%Y-%m-%dT%H:%M:%SZ")

Copilot · 2026-01-26T12:34:43Z

+            return mock_data['logs']
+        # Fallback if no mock_data provided
+        return "Mock logs not available"
+


The namespace and pod_name parameters are used directly in a shell command without sanitization. Although pod_name comes from Kubernetes API responses (which should be safe), and namespace is validated by argparse, using shell=True with these parameters creates a potential command injection risk. Consider adding input validation to ensure ns and pod_name contain only safe characters (alphanumeric, hyphens, underscores), or use subprocess with shell=False and pass arguments as a list: ['kubectl', '-n', ns, 'logs', pod_name, f'--tail={tail_lines}'].

Suggested change

# Validate namespace and pod name to prevent command injection when using shell commands.

if not re.fullmatch(r"[A-Za-z0-9_-]+", str(ns)):

raise ValueError(f"Invalid namespace value: {ns!r}")

if not re.fullmatch(r"[A-Za-z0-9_-]+", str(pod_name)):

raise ValueError(f"Invalid pod name value: {pod_name!r}")

Copilot · 2026-01-26T12:34:44Z

+    # Create scroll indicator
+    has_real_logs = bool(logs and logs.strip())
+    if not has_real_logs:
+        total_lines = 0


Variable total_lines is not used.

Suggested change

total_lines = 0

Copilot · 2026-01-26T12:34:44Z

+                if 'gpu' in key.lower():
+                    try:
+                        gpu_count = int(capacity[key])
+                    except (ValueError, TypeError):


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except (ValueError, TypeError):

except (ValueError, TypeError):

# If the GPU capacity value is not a valid integer, ignore it

# and leave gpu_count at its default value of 0.

Copilot · 2026-01-26T12:34:44Z

+                gpu_info['total_gpus'] += gpu_count
+                gpu_info['gpu_types'].add(gpu_type)
+                gpu_info['node_gpu_map'][node_name] = gpu_type
+    except:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

except:

# Best-effort: if node/GPU information cannot be parsed or is unavailable,

# skip populating GPU node details and continue without this optional data.

Copilot · 2026-01-26T12:34:44Z

+                            for node in gpu_info['nodes']:
+                                if node['name'] == node_name:
+                                    node['allocated'] += gpu_req
+                        except (ValueError, TypeError):


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except (ValueError, TypeError):

except (ValueError, TypeError):

# Ignore non-integer or otherwise invalid GPU request values

# and continue processing remaining containers.

DaniloBueno

Great job, @stogiannidis!
This looks good to me as it is. There are a few improvement suggestions from Copilot, but I don't think they're critical... feel free to tackle them if you want 🙂

DaniloBueno · 2026-01-27T11:00:22Z

+        log_messages.extend(log_messages)
+
+    for i, msg in enumerate(log_messages[:tail_lines]):
+        timestamp = f"2026-01-21T10:{i:02d}:00Z"


This can generate invalid timestamps, like:
2026-01-21T10💯00Z INFO: Initializing worker threads...

DaniloBueno · 2026-01-27T11:04:06Z

    return Panel(grid, title="Cluster Quota", border_style="blue")


+def generate_log_viewer(logs, pod_name, scroll_offset=0, max_lines=None):


I was wondering if we should sort log messages by timestamp DESC instead of ASC. Also, we may want a "page up/down" feature in the future.

stogiannidis added 4 commits January 21, 2026 16:38

Add GPU support to job data generation and monitoring features

19ea10a

Enhance GPU resource management and monitoring features in job and po…

380a339

…d handling

Merge branch 'features/gpu_logs' into develop

2ca1565

Refine GPU error handling and enhance log viewer with auto-refresh fe…

df228df

…ature

Copilot AI review requested due to automatic review settings January 21, 2026 17:25

Copilot started reviewing on behalf of stogiannidis January 21, 2026 17:26 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

DaniloBueno reviewed Jan 23, 2026

View reviewed changes

Comment thread monitor.py Outdated

Comment thread monitor.py Outdated

Comment thread monitor.py Outdated

Comment thread monitor.py

Update monitor.py

6c34174

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

stogiannidis requested a review from YuyangXueEd as a code owner January 26, 2026 11:50

stogiannidis and others added 6 commits January 26, 2026 11:50

Update monitor.py

5ff8628

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update monitor.py

00b0c5d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update monitor.py

6bc6205

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Enhance GPU monitoring and logging features with mock data generation

70f22cd

Merge branch 'develop' into develop_conflict

626d075

Merge branch 'develop_conflict' into develop

6ea3368

stogiannidis commented Jan 26, 2026

View reviewed changes

stogiannidis requested review from DaniloBueno and Copilot January 26, 2026 12:27

Copilot started reviewing on behalf of stogiannidis January 26, 2026 12:27 View session

Refactor comments for clarity and improve layout handling in log viewer

cfe4c44

Copilot AI reviewed Jan 26, 2026

View reviewed changes

DaniloBueno approved these changes Jan 27, 2026

View reviewed changes

DaniloBueno merged commit 096fc73 into vios-s:main Mar 2, 2026
1 check passed

	node_name = 'gpu-node-01' # A100
	node_name = 'gpu-node-01' # H100

		for i, msg in enumerate(log_messages[:tail_lines]):
		timestamp = f"2026-01-21T10:{i:02d}:00Z"

+    # Validate namespace and pod name to prevent command injection when using shell commands.
+    if not re.fullmatch(r"[A-Za-z0-9_-]+", str(ns)):
+        raise ValueError(f"Invalid namespace value: {ns!r}")
+    if not re.fullmatch(r"[A-Za-z0-9_-]+", str(pod_name)):
+        raise ValueError(f"Invalid pod name value: {pod_name!r}")

-    except:
+    except:
+        # Best-effort: if node/GPU information cannot be parsed or is unavailable,
+        # skip populating GPU node details and continue without this optional data.

		return Panel(grid, title="Cluster Quota", border_style="blue")


		def generate_log_viewer(logs, pod_name, scroll_offset=0, max_lines=None):

Conversation

stogiannidis commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stogiannidis left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

DaniloBueno left a comment

Choose a reason for hiding this comment

Uh oh!

DaniloBueno Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

DaniloBueno Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants