Add GPU support and enhance monitoring features for jobs#2
Add GPU support and enhance monitoring features for jobs#2DaniloBueno merged 12 commits intovios-s:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds GPU support and monitoring capabilities to a Kubernetes job monitoring tool. It introduces GPU resource tracking at both the cluster and job level, and adds an interactive log viewer with auto-refresh functionality.
Changes:
- Added GPU information collection from cluster nodes and pods, including GPU type identification and allocation tracking
- Enhanced job display to show GPU requests and types, with color-coded resource utilization indicators
- Introduced interactive log viewer with keyboard navigation, syntax highlighting, and auto-refresh capability
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 18 comments.
| File | Description |
|---|---|
| monitor.py | Added GPU tracking functions (get_gpu_info, _shorten_gpu_name), enhanced job/pod data collection with GPU information, added interactive log viewer with navigation, and updated UI to display GPU resources and selection indicators |
| mock_data.py | Added GPU field to all job definitions, implemented GPU resource requests in job specs, and added GPU-aware node assignment logic for mock pods |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
stogiannidis
left a comment
There was a problem hiding this comment.
Added all of copilot's suggestion and @DaniloBueno's comment
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 16 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if gpu_req > 0: | ||
| # Assign to a GPU node (prefer A100 for larger jobs) | ||
| if gpu_req >= 4: | ||
| node_name = 'gpu-node-01' # A100 |
There was a problem hiding this comment.
The comment indicates "# A100" but gpu-node-01 is actually defined as 'H100-80GB' in the _generate_gpu_info function. Update the comment to "# H100" to match the actual GPU type.
| node_name = 'gpu-node-01' # A100 | |
| node_name = 'gpu-node-01' # H100 |
| for node in gpu_info['nodes']: | ||
| if node['name'] == node_name: | ||
| node['allocated'] += gpu_req |
There was a problem hiding this comment.
The nested loop iterates through all GPU nodes for each container requesting GPUs, resulting in O(n*m) complexity where n is the number of containers with GPU requests and m is the number of GPU nodes. Consider using a dictionary to map node names to their corresponding entries in gpu_info['nodes'] for O(1) lookup, improving performance when there are many GPU nodes.
| table.add_column("Comp", justify="right") | ||
| table.add_column("Duration", justify="right") | ||
|
|
||
| all_rows = build_row_index(jobs) |
There was a problem hiding this comment.
The build_row_index function is called twice per render cycle: once in the main loop (line 861) and once inside generate_table (line 547). This duplicates work processing all jobs and pods. Consider passing the all_rows result from line 861 as a parameter to generate_table to avoid recalculating it.
| start_line = min(scroll_offset + 1, total_lines) | ||
| end_line = min(scroll_offset + (max_lines or total_lines), total_lines) | ||
| scroll_info = f" ({start_line}-{end_line}/{total_lines})" | ||
|
|
There was a problem hiding this comment.
This line has trailing whitespace which should be removed for consistency with code style guidelines.
| for i, msg in enumerate(log_messages[:tail_lines]): | ||
| timestamp = f"2026-01-21T10:{i:02d}:00Z" |
There was a problem hiding this comment.
The hardcoded timestamp "2026-01-21" in mock logs is in the past (current date is January 26, 2026). Consider using dynamic timestamps generated relative to the current time, similar to how job timestamps are generated using the _time_ago function. This would make mock logs more realistic and avoid confusion.
| for i, msg in enumerate(log_messages[:tail_lines]): | |
| timestamp = f"2026-01-21T10:{i:02d}:00Z" | |
| now = datetime.now(timezone.utc) | |
| for i, msg in enumerate(log_messages[:tail_lines]): | |
| log_time = now - timedelta(minutes=(tail_lines - 1 - i)) | |
| timestamp = log_time.strftime("%Y-%m-%dT%H:%M:%SZ") |
| return mock_data['logs'] | ||
| # Fallback if no mock_data provided | ||
| return "Mock logs not available" | ||
|
|
There was a problem hiding this comment.
The namespace and pod_name parameters are used directly in a shell command without sanitization. Although pod_name comes from Kubernetes API responses (which should be safe), and namespace is validated by argparse, using shell=True with these parameters creates a potential command injection risk. Consider adding input validation to ensure ns and pod_name contain only safe characters (alphanumeric, hyphens, underscores), or use subprocess with shell=False and pass arguments as a list: ['kubectl', '-n', ns, 'logs', pod_name, f'--tail={tail_lines}'].
| # Validate namespace and pod name to prevent command injection when using shell commands. | |
| if not re.fullmatch(r"[A-Za-z0-9_-]+", str(ns)): | |
| raise ValueError(f"Invalid namespace value: {ns!r}") | |
| if not re.fullmatch(r"[A-Za-z0-9_-]+", str(pod_name)): | |
| raise ValueError(f"Invalid pod name value: {pod_name!r}") |
| # Create scroll indicator | ||
| has_real_logs = bool(logs and logs.strip()) | ||
| if not has_real_logs: | ||
| total_lines = 0 |
There was a problem hiding this comment.
Variable total_lines is not used.
| total_lines = 0 |
| if 'gpu' in key.lower(): | ||
| try: | ||
| gpu_count = int(capacity[key]) | ||
| except (ValueError, TypeError): |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except (ValueError, TypeError): | |
| except (ValueError, TypeError): | |
| # If the GPU capacity value is not a valid integer, ignore it | |
| # and leave gpu_count at its default value of 0. |
| gpu_info['total_gpus'] += gpu_count | ||
| gpu_info['gpu_types'].add(gpu_type) | ||
| gpu_info['node_gpu_map'][node_name] = gpu_type | ||
| except: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| except: | |
| # Best-effort: if node/GPU information cannot be parsed or is unavailable, | |
| # skip populating GPU node details and continue without this optional data. |
| for node in gpu_info['nodes']: | ||
| if node['name'] == node_name: | ||
| node['allocated'] += gpu_req | ||
| except (ValueError, TypeError): |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except (ValueError, TypeError): | |
| except (ValueError, TypeError): | |
| # Ignore non-integer or otherwise invalid GPU request values | |
| # and continue processing remaining containers. |
DaniloBueno
left a comment
There was a problem hiding this comment.
Great job, @stogiannidis!
This looks good to me as it is. There are a few improvement suggestions from Copilot, but I don't think they're critical... feel free to tackle them if you want 🙂
| log_messages.extend(log_messages) | ||
|
|
||
| for i, msg in enumerate(log_messages[:tail_lines]): | ||
| timestamp = f"2026-01-21T10:{i:02d}:00Z" |
There was a problem hiding this comment.
This can generate invalid timestamps, like:
2026-01-21T10💯00Z INFO: Initializing worker threads...
| return Panel(grid, title="Cluster Quota", border_style="blue") | ||
|
|
||
|
|
||
| def generate_log_viewer(logs, pod_name, scroll_offset=0, max_lines=None): |
There was a problem hiding this comment.
I was wondering if we should sort log messages by timestamp DESC instead of ASC. Also, we may want a "page up/down" feature in the future.
Introduce GPU support in job data generation and monitoring, improve GPU resource management, and refine error handling. Enhance the log viewer with an auto-refresh feature for better usability.