Skip to content

Conversation

@ikegami-t
Copy link
Contributor

This is to monitor the system changes.

@ikegami-t
Copy link
Contributor Author

Will do consider to change the PR changes to add --continue or --delay option instead of the nvme top command. (Or just rename to nvme watch command instead.)

@ikegami-t ikegami-t force-pushed the nvme-top branch 2 times, most recently from 2a0fb7a to 50ece42 Compare December 29, 2025 07:23
@ikegami-t
Copy link
Contributor Author

Hi @igaw, @shroffni and @martin-belanger,

Just updated the PR changes to add --delay option instead of nvme top command (but not --continuous option) as mentioned by the matrix chat.

@shroffni
Copy link
Contributor

shroffni commented Jan 7, 2026

After looking at the code, it does not seem appropriate to add a --delay option at the global configuration level.

The main issue is that a global --delay does not make sense for all nvme-cli commands. For example, commands such as create-ns or delete-ns should not accept a --delay option. If create-ns were invoked with a non-zero delay, it could repeatedly create namespaces in a loop until the device runs out of space. Similarly, commands like delete-ns, set-feature, format, and several others have no meaningful semantics when executed periodically with a delay.

If the goal is simply to monitor or repeatedly execute a command, the existing watch utility already provides this functionality. While watch may not be the most efficient solution, it is sufficient for user-space tooling and works reasonably well today.

If we still believe that a --delay option is necessary, then it should be added selectively, only to those nvme-cli sub-commands where periodic execution actually makes sense. One such example is show-topology, and this could also be useful for upcoming tools like nvme-top.

In summary, rather than introducing a global --delay option, it would be better to:

  • Identify the specific sub-commands where delayed or periodic execution is meaningful, and
  • Add support for --delay only to those commands.

@igaw
Copy link
Collaborator

igaw commented Jan 7, 2026

I agree with @shroffni's arguments.

@ikegami-t
Copy link
Contributor Author

Just changed the PR commit to add the option only for the show-topology command at first. Thank you.

@shroffni
Copy link
Contributor

shroffni commented Jan 9, 2026

It’s good that the --delay option is now limited to show-topology. However, the current implementation lacks scrolling support, which makes it difficult to use in systems with many subsystems or namespaces. In such cases, the show-topology output may not fit on a single screen, and with the current behavior it becomes inconvenient—or effectively impossible—to view the full output.

For comparison, it would be useful to look at how tools like top handle scrolling and screen updates.

I added your patch on my system, and found that the screen flickers noticeably, and the rendering does not appear smooth when the output is repainted on every --delay interval. This also needs improvement. One possible optimization would be to compare the output between two intervals and avoid repainting the screen if the output has not changed.

To further minimize flickering, it would also help to collect the entire command output into a single buffer and render it in one operation, rather than printing each line individually as it is generated.

There would be some other alternatives as well to avoid flickering. I'd suggest review implementation of tools like watch/top etc.

@ikegami-t
Copy link
Contributor Author

Just refixed the patch as mentioned so please review again. Thank you.

@shroffni
Copy link
Contributor

The recent change that compares the current and previous output buffers and only prints the output when they differ is a good improvement. It significantly reduces screen flickering by avoiding unnecessary repaints, since the output is refreshed only when something has changed during the last interval.

I pulled this change and rebuilt nvme-cli locally. On my system, I have multiple NVMe disks with several namespaces, so the show-topology output spans multiple screens. When running show-topology with the --delay option, the output naturally extends beyond one screen.

With the current change, each time the output is refreshed it automatically scrolls to the last screen/page. If I want to inspect the first screen, I have to scroll back manually, which is acceptable initially. However, this behavior differs from tools such as top. In top, even when the output spans multiple screens, the display remains anchored at the first screen by default. If the user navigates using the arrow keys or Page Up/Page Down, top stays at the selected position and continues updating only the visible portion of the output, rather than jumping back to the end on each refresh.

In contrast, with show-topology, even if I scroll back to the first screen/page, the next refresh (when the output changes) automatically scrolls me back to the last screen/page. This makes it difficult to focus on a specific portion of the output, as the view keeps jumping away from where I am looking. From a usability standpoint, this behavior is inconvenient and should be addressed.

Additionally, while testing this change, I noticed that some junk characters occasionally appear or get appended—typically on the last screen—when show-topology detects a mismatch between the previous and current output and dumps the updated output.

I would strongly encourage testing this change on a system with multiple disks and namespaces, where show-topology produces multiple screens of output. That setup makes both of these issues—auto-scrolling and stray characters—much easier to observe and debug.

I may be asking too much here but we have to make this usable for the end user so please bear with me.

@ikegami-t
Copy link
Contributor Author

Understood. Will do study more as refering top and watch, etc. Thank you.

@igaw
Copy link
Collaborator

igaw commented Jan 13, 2026

If it helps you can create a bunch of nvmet-tcp targets on your system and connect to them. Maybe we should add a test setup for this from the beginning, e.g. having a script which setups a bunch of nvmet targets against null device backends etc. Just as idea.

@ikegami-t
Copy link
Contributor Author

Thank you. Noted.

This is to monitor the system changes.

Signed-off-by: Tokunori Ikegami <ikegami.t@gmail.com>
@ikegami-t
Copy link
Contributor Author

Just refixed the patch again so please review again and if any comment or advice please let me know.
Note: Also tested the delay option on locally by changing the show-topology command as simulated.

@shroffni
Copy link
Contributor

I just applied your patch on my system and reran the show-topology command with --delay option. Below are few of my observations and then code review comments:

Observations:

  1. Arrow key input is still leaking to the screen:
    While navigating using the up/down arrow keys, rendering is not smooth and I occasionally see literal ASCII escape sequences (^[[B for key-down and ^[[A for key-up) printed on the screen. For example, the last line shows the escape sequence instead of being handled as a navigation event:
nvme-subsys2 - NQN=nqn.2019-10.com.kioxia:KCM6DRUL1T92:22C0A031TGV8
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=numa
\
 +- ns 1
 \
  +- nvme2 pcie 0525:48:00.0 live optimized
 +- ns 2
 \
  +- nvme2 pcie 0525:48:00.0 live optimized
 +- ns 3
^[[B

And then this one:

nvme-subsys3 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=numa
\
 +- ns 1
 \
  +- nvme3 pcie 052e:78:00.0 live optimized
  +- nvme7 pcie 058e:78:00.0 live optimized
 +- ns 2
 \
  +- nvme3 pcie 052e:78:00.0 live optimized
^[[A
  1. Navigation sometimes becomes unresponsive:
    While the output is running, resizing the terminal window occasionally makes it impossible to navigate using arrow keys. In some cases, the dashboard misses key events, requiring multiple key presses before navigation is registered.

  2. reen rendering during navigation is still not smooth:
    Even when key events are handled correctly, the overall rendering experience while scrolling up/down remains visually uneven.

Code review comments:

  1. Controlling terminal handling
    The terminal input/output handling currently uses STDIN_FILENO as the controlling terminal file descriptor. I would suggest using ctermid() instead to explicitly determine the controlling terminal of the process. This is particularly important when standard input is redirected.
    For example, please try running, "nvme show-topology --delay 1 < /dev/null" and see what happens.

  2. Screen redraw strategy
    The delay_print() function currently clears the entire screen and repaints everything in one go. This can contribute to rendering artifacts and flicker.
    Instead, I suggest updating the screen line-by-line using the following approach:

  • Move the cursor to the target line
  • Clear that line
  • Print the new content
  • Repeat for each line
  1. Potential race condition in window resize handling:
    There appears to be a race window when handling terminal resize events. For example, if a SIGWINCH arrives just before pselect() is called—but after the dashboard is already rendered—the resize event may be missed.
    To handle window size changes robustly while the dashboard is running, I suggest the following sequence using pselect():
  • Block SIGWINCH
  • Read the current window size
  • Render the dashboard based on the current dimensions
  • Atomically unblock SIGWINCH and wait for either user input or a resize event via pselect()

This approach shall ensure that window resize events are not lost.

  1. Use of static variables for state tracking
    I noticed several static variables being used to track state such as the last row, column count, and navigation offsets across delay intervals. Instead, it would be cleaner and more maintainable to encapsulate this state in a dedicated context structure and pass it explicitly where needed. Relying on static variables for cross-event state makes the code harder to reason about and maintain.

  2. Avoid unnecessary file I/O for topology comparison
    The current implementation writes the topology output to file and then copies it into a memory buffer for comparison against the previous output before rendering. While the logic is good, it introduces unnecessary file I/O.
    I suggest using open_memstream() instead. A memstream provides a stdio-compatible interface (with file position, offsets, and fflush() semantics) while keeping all data in memory, which would significantly reduce I/O overhead and simplify the code.

Having said all of the above, as you may know, I have been spending some time now finalizing the design of nvme-top, and during that work I ran into—and addressed—the same issues listed earlier. I am preparing the implementation in a way that it can be reused for other nvme-cli commands that may need to operate as continuous dashboards.
At this point, the nvme-top changes are in the final stage of unit testing, and I plan to send them out for formal review soon. While I do not claim that the implementation will be perfect, I believe it would be useful to pause further parallel work for now so we can avoid duplicating effort. Once the nvme-top changes are available, we can collaborate again to review them together and evaluate how best to integrate the shared dashboard infrastructure with the show-topology dashboard.
That said, I don’t want to block you from continuing your current work on the show-topology dashboard if you’d prefer to pursue it further before the nvme-top changes are formally published.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants