In this issue, I propose a new feature and API sketch, and invite discussion.
We’ve talked about wanting to capture directly to GPU memory. This is much faster than capturing to CPU memory. If the user is going to encode to H.264 (such as for screen sharing or recording), or send to an AI library, this is definitely a performance win. It also has less impact on other programs.
On Linux/X11, captures to GPU memory can only come from a single window, not the entire screen. This is the opposite of Windows DXGI Desktop Duplication, which can only capture the entire screen. Windows WGC does both, but is going to be quite tough to implement, and isn’t supported on early versions of Windows. I don’t remember about Wayland or macOS.
That means that we’ll want to add single-window capture as an explicit capability of the MSS API, rather than having it be entirely based on a bounding box.
To be clear, this proposal isn’t about adding GPU capture to MSS, not yet. It’s about adding single-window capture. But we do want to anticipate the needs of GPU capture as we design the single-window capture API.
Example
Under this proposal, most of the time, the user can do something simple:
with mss.MSS(window="GitHub - Firefox") as sct:
...
or
with mss.MSS(window=re.compile(r"(^| - )Firefox$") as sct:
...
More advanced use cases are still accommodated by this proposal, such as the CP2077 example below.
When to specify the window
It might seem most convenient for the user to specify the window as part of the grab() call. But that has some problems; we may need to have them specify the window as part of the MSS object construction. I’ll explore the problems here.
We want to make things easy for the users, to free them from having to think about the implementation. As an example, we let the users use backend="default" (the default value). In my head, I’ve been using this as “let MSS decide at runtime”, not “an always-the-same default for the current platform”.
This is because of the limitations I mentioned above. If the user wants to capture a single window, there may be faster mechanisms available than to capture the whole screen. Or vice versa. Or their version of Windows might not support WGC. Specifying backend="default" means that the user wants MSS to choose which backend to use, in the given circumstance. We can write a small decision tree (just some nested if/else logic) for that.
We currently create the MSSImplementation object as part of the MSS constructor. Since this is before grab has been called, then if the user is using backend="default", then we want to know whether they’ll be using single-window or full-screen capture at that time.
Again, while this isn’t necessary with the current backends, it will become necessary with anticipated future backends.
Alternatives (about when to specify the window)
One alternative might be for the user to specify whether they want to use single-window or full-screen capture at construction time, and defer specifying the window to use until later. I’m not sure this buys us anything.
Another alternative might be to defer backend instantiation until the first grab, and to create other backends if the user changes between full-screen and single-window mode. This may be awkward to implement, especially around things like monitor enumeration. I’m not sure that the extra complexity is worthwhile.
How to specify the window
Different platforms may have different ways that users might want to specify the window to use. Here, I’ll refer to this as the window selector. As an initial proposal, I suggest the following.
To accommodate a wide variety of window selection needs, the API I’m proposing is quite flexible, but still easy to implement and to use. The user may specify the window as any of the following:
- An integer, giving the native ID. This is a Window XID on Linux/X11, HWND on Windows, and CGWindowID on macOS. This identifies a specific currently-existing native window, not an application-level window that survives recreation.
- A string, giving the user-visible title, such as “Firefox - GitHub”.
- A
re.Pattern object, to be matched against the user-visible title, such as re.compile(r"(^| - )Firefox$").
- A Mapping. The keys are platform-specific, but include a few common ones such as "id" (native ID), "exe" (process executable), "pid", "title", "class" (
WM_CLASS / lpszClassName, usually more stable than the title), etc, in addition to some platform-specific values if appropriate. The values are strings or re.Patterns if appropriate, or integers if that makes more sense. The targeted window is whatever matches all the criteria. In practice, this will probably just have one entry, but using a Mapping makes it easy to allow for a flexible key/value pair. Examples: window={"class": "firefox-esr"}, or window={"pid": 12345}.
- A callback function. This will be called with a list of all the windows available. Specifically, it receives a Sequence of (MSS-defined) window objects, each with properties similar to those in the mapping described above. The callback must return whichever window object is desired. This is the most flexible option, but can be cumbersome to use in simple programs. The above options are provided as a convenience for the common cases.
- The window objects deliberately aren’t simple mappings. This is because it may be expensive to collect all the potentially-useful information about all windows, when the user only cares about one or two attributes. Specifying these as objects makes it simpler to lazily populate.
- 💭 We may want to extend the callback API in the future, so should be careful to make it forwards-compatible. There are a few ways to do this; we can discuss if desired.
💭 We should consider the specific convenience selectors we want to make available, and their semantics. For instance, is a string as a shorthand clearly a shortcut for a title? At any point in the docs where we use a string selector, we should probably note at that point that it’s shorthand for {"title": "Firefox"}.
💭 The convenience selectors (other than native ID) might not be unique: there could be several Firefox windows open. In that event, we need to decide whether we raise an exception, use some sort of predefined tie-breaker (like first created or top in window stack), or what. The user can, of course, choose their own tie-breaker by using a callback function instead of a convenience selector.
Persistence
The target window is selected, according to the window selector, when the MSS object is created. After that time, the same window will continue to be used for future captures as long as it exists. For instance, if the user’s window selector specifies the window entitled “GitHub - Firefox”, then the actual window that the MSS object is targeting remains with that OS window, even if the title changes.
If the targeted window is destroyed, then when grab is next called, what should MSS do? I propose that it searches for a new window, based on the same selector as the user originally passed. This allows the program to continue following an application through common situations, such as switching between fullscreen and windowed modes, where the application may internally destroy and recreate its native window.
Exceptions
If the callback function used as a window specifier raises an exception, then that is propagated upwards.
If this happens on a retargeting grab (one after the initial MSS construction), then the MSS object remains in a valid and usable state. This allows a user to do something like this:
class WindowNotFound(Exception):
pass
def window_selector(windows):
for window in windows:
if window.title == "Cyberpunk 2077 (C) 2020 by CD Projekt RED":
return window
raise WindowNotFound()
wait_for_new_window_count = 0
with mss.MSS(window=window_selector) as sct:
while True:
try:
img = sct.grab()
except WindowNotFound:
# Poll for up to to one second, to wait for CP2077 to create
# its new window when switching between full-screen and
# windowed mode.
wait_for_new_window_count += 1
if wait_for_new_window_count > 60:
# The game has probably exited.
break
time.sleep(1/60)
continue
wait_for_new_window_count = 0 # Reset the poll counter
process_screenshot(img)
💭 Note that this code snippet assumes that the initial call to mss.MSS will successfully find a window. It’s possible to provide ways for the selector to request a refresh of the window list, which could be used both in the initial call, and in updates. I’m not sure if this is worthwhile.
Alternatives (persistence)
We could simply specify that attempting to grab a window that’s been destroyed will raise a ScreenShotError, and the user will need to recreate the MSS object. This seems like some significant extra burden on users, especially those wanting to use a pattern like the CP2077 example above. It would also mean creating and destroying XCB connections in the delay loop. It seems that the persistence and exception features described above aren’t difficult to implement, and allow for the best reuse of resources.
We could also specify ways to retarget an existing MSS object to a new window after the grab fails, rather than having MSS automatically find the new target window. This isn’t a bad feature to provide, but I suspect it requires most of the same internal machinery anyway.
Window changes
I don’t recommend any explicit action from MSS if the targeted window size changes. The ScreenShot object already provides its size, and the user can take appropriate actions on change if needed. Different use cases have different needs; a user who is scaling their screenshots to a particular size (such as for streaming or neural net inputs) won’t care, while a user who is capturing to a video file might.
💭 I think that if a window is minimized (iconified) or unmapped, it should probably be treated the same as if it was destroyed.
Implementation note
The different convenience window selectors can be implemented as simple functions, reducing the bulk of the internal window selection code to just the callback API.
These convenience window selectors can raise a ScreenShotError (I recommend a new dedicated subclass) if they don’t match any windows. As specified above, the normal path will simply propagate this upwards.
Alternatives (how to specify the window)
A couple of alternatives seem obvious, but have some subtle disadvantages. Whether these are worthwhile is a judgment call.
- Have the
MSS object provide a method that returns a list of windows. However, this requires the MSS object to be constructed first. This means that a constructor using backend="default" would still need to be told if it will be used to target a single window or a full screen.
- Make an
MSS class (rather than instance) method that returns a list of windows. The user is expected to choose their desired window and provide its native ID to the MSS constructor. Under Linux/X11, this would require making a separate XCB connection whenever the list of windows is requested. This also means that the user would need to recreate the MSS object each time the window is destroyed and recreated, such as when switching to fullscreen. Doing so may be quite annoying in the typical ‘with/while‘ pattern.
I like the ease-of-use of allowing the user to specify, in a simple way, the window they want to capture. The alternatives seem more cumbersome to use for simple use cases, and provide little additional clarity for advanced use cases.
The only use case that I see that is benefited by providing grab a window selector, rather than the MSS constructor, is when the desired target window is changing frequently. I don’t think this is likely to be a significant use case.
There’s no problem adding a method to the MSS objects that provides a list of windows. It might be useful for some users, such as when populating a window selection UI. But I don’t think it should be the basis of the window selection API.
Capabilities
This isn’t yet relevant, but I’ll mention it here.
If a particular combination of options is requested when the MSS constructor is invoked, and it’s not possible to fulfill these, I suggest that we simply raise an exception. For example, if a user requests single-window capture to the GPU on Windows 8, then MSS should raise an exception.
💭 We might consider an API that can be used to query the different option combinations available on the current platform.
I suggest that we do not try to track windows ourselves and crop a full-screen capture, if single-window capture is requested but not available. We simply raise an exception.
That is because full-screen capture is always the composited image, including overlapping windows. This risks inadvertent information exposure if, for instance, a user is streaming to Twitch and a popup with private information temporarily appears on their screen. In this example, the original developer of the app may not be aware of the risk; they may have tested on a different version of Windows that does support WGC.
Open questions
@halldorfannar pointed out two considerations.
- 💭 Should single-window capture grab the entire window, including borders, or just the client area? The OS API defaults vary. I suggest we add this as another option to the MSS constructor, but for the initial implementation, make the behavior explicitly platform-specific. We should add this option in a future update.
- 💭 Should we allow the user to specify crop offsets from the outside, instead of having to give an explicit width and height with the crop rectangle (which would require already knowing the window size)? In the conversation context, this was about allowing the user to crop out system borders. Note, however, that using
GetDCEx instead of GetWindowDC would provide a cleaner route, if the option described in the previous question is adopted. That said, there may be other reasons a user would want to crop out other borders, such as parts of the application’s UI.
In this issue, I propose a new feature and API sketch, and invite discussion.
We’ve talked about wanting to capture directly to GPU memory. This is much faster than capturing to CPU memory. If the user is going to encode to H.264 (such as for screen sharing or recording), or send to an AI library, this is definitely a performance win. It also has less impact on other programs.
On Linux/X11, captures to GPU memory can only come from a single window, not the entire screen. This is the opposite of Windows DXGI Desktop Duplication, which can only capture the entire screen. Windows WGC does both, but is going to be quite tough to implement, and isn’t supported on early versions of Windows. I don’t remember about Wayland or macOS.
That means that we’ll want to add single-window capture as an explicit capability of the MSS API, rather than having it be entirely based on a bounding box.
To be clear, this proposal isn’t about adding GPU capture to MSS, not yet. It’s about adding single-window capture. But we do want to anticipate the needs of GPU capture as we design the single-window capture API.
Example
Under this proposal, most of the time, the user can do something simple:
or
More advanced use cases are still accommodated by this proposal, such as the CP2077 example below.
When to specify the window
It might seem most convenient for the user to specify the window as part of the
grab()call. But that has some problems; we may need to have them specify the window as part of theMSSobject construction. I’ll explore the problems here.We want to make things easy for the users, to free them from having to think about the implementation. As an example, we let the users use
backend="default"(the default value). In my head, I’ve been using this as “let MSS decide at runtime”, not “an always-the-same default for the current platform”.This is because of the limitations I mentioned above. If the user wants to capture a single window, there may be faster mechanisms available than to capture the whole screen. Or vice versa. Or their version of Windows might not support WGC. Specifying
backend="default"means that the user wants MSS to choose which backend to use, in the given circumstance. We can write a small decision tree (just some nested if/else logic) for that.We currently create the
MSSImplementationobject as part of theMSSconstructor. Since this is beforegrabhas been called, then if the user is usingbackend="default", then we want to know whether they’ll be using single-window or full-screen capture at that time.Again, while this isn’t necessary with the current backends, it will become necessary with anticipated future backends.
Alternatives (about when to specify the window)
One alternative might be for the user to specify whether they want to use single-window or full-screen capture at construction time, and defer specifying the window to use until later. I’m not sure this buys us anything.
Another alternative might be to defer backend instantiation until the first grab, and to create other backends if the user changes between full-screen and single-window mode. This may be awkward to implement, especially around things like monitor enumeration. I’m not sure that the extra complexity is worthwhile.
How to specify the window
Different platforms may have different ways that users might want to specify the window to use. Here, I’ll refer to this as the window selector. As an initial proposal, I suggest the following.
To accommodate a wide variety of window selection needs, the API I’m proposing is quite flexible, but still easy to implement and to use. The user may specify the window as any of the following:
re.Patternobject, to be matched against the user-visible title, such asre.compile(r"(^| - )Firefox$").WM_CLASS/lpszClassName, usually more stable than the title), etc, in addition to some platform-specific values if appropriate. The values are strings orre.Patterns if appropriate, or integers if that makes more sense. The targeted window is whatever matches all the criteria. In practice, this will probably just have one entry, but using a Mapping makes it easy to allow for a flexible key/value pair. Examples:window={"class": "firefox-esr"}, orwindow={"pid": 12345}.💭 We should consider the specific convenience selectors we want to make available, and their semantics. For instance, is a string as a shorthand clearly a shortcut for a title? At any point in the docs where we use a string selector, we should probably note at that point that it’s shorthand for
{"title": "Firefox"}.💭 The convenience selectors (other than native ID) might not be unique: there could be several Firefox windows open. In that event, we need to decide whether we raise an exception, use some sort of predefined tie-breaker (like first created or top in window stack), or what. The user can, of course, choose their own tie-breaker by using a callback function instead of a convenience selector.
Persistence
The target window is selected, according to the window selector, when the
MSSobject is created. After that time, the same window will continue to be used for future captures as long as it exists. For instance, if the user’s window selector specifies the window entitled “GitHub - Firefox”, then the actual window that the MSS object is targeting remains with that OS window, even if the title changes.If the targeted window is destroyed, then when
grabis next called, what should MSS do? I propose that it searches for a new window, based on the same selector as the user originally passed. This allows the program to continue following an application through common situations, such as switching between fullscreen and windowed modes, where the application may internally destroy and recreate its native window.Exceptions
If the callback function used as a window specifier raises an exception, then that is propagated upwards.
If this happens on a retargeting
grab(one after the initialMSSconstruction), then the MSS object remains in a valid and usable state. This allows a user to do something like this:💭 Note that this code snippet assumes that the initial call to
mss.MSSwill successfully find a window. It’s possible to provide ways for the selector to request a refresh of the window list, which could be used both in the initial call, and in updates. I’m not sure if this is worthwhile.Alternatives (persistence)
We could simply specify that attempting to grab a window that’s been destroyed will raise a
ScreenShotError, and the user will need to recreate theMSSobject. This seems like some significant extra burden on users, especially those wanting to use a pattern like the CP2077 example above. It would also mean creating and destroying XCB connections in the delay loop. It seems that the persistence and exception features described above aren’t difficult to implement, and allow for the best reuse of resources.We could also specify ways to retarget an existing MSS object to a new window after the grab fails, rather than having MSS automatically find the new target window. This isn’t a bad feature to provide, but I suspect it requires most of the same internal machinery anyway.
Window changes
I don’t recommend any explicit action from MSS if the targeted window size changes. The
ScreenShotobject already provides its size, and the user can take appropriate actions on change if needed. Different use cases have different needs; a user who is scaling their screenshots to a particular size (such as for streaming or neural net inputs) won’t care, while a user who is capturing to a video file might.💭 I think that if a window is minimized (iconified) or unmapped, it should probably be treated the same as if it was destroyed.
Implementation note
The different convenience window selectors can be implemented as simple functions, reducing the bulk of the internal window selection code to just the callback API.
These convenience window selectors can raise a
ScreenShotError(I recommend a new dedicated subclass) if they don’t match any windows. As specified above, the normal path will simply propagate this upwards.Alternatives (how to specify the window)
A couple of alternatives seem obvious, but have some subtle disadvantages. Whether these are worthwhile is a judgment call.
MSSobject provide a method that returns a list of windows. However, this requires the MSS object to be constructed first. This means that a constructor usingbackend="default"would still need to be told if it will be used to target a single window or a full screen.MSSclass (rather than instance) method that returns a list of windows. The user is expected to choose their desired window and provide its native ID to the MSS constructor. Under Linux/X11, this would require making a separate XCB connection whenever the list of windows is requested. This also means that the user would need to recreate the MSS object each time the window is destroyed and recreated, such as when switching to fullscreen. Doing so may be quite annoying in the typical ‘with/while‘ pattern.I like the ease-of-use of allowing the user to specify, in a simple way, the window they want to capture. The alternatives seem more cumbersome to use for simple use cases, and provide little additional clarity for advanced use cases.
The only use case that I see that is benefited by providing
graba window selector, rather than theMSSconstructor, is when the desired target window is changing frequently. I don’t think this is likely to be a significant use case.There’s no problem adding a method to the
MSSobjects that provides a list of windows. It might be useful for some users, such as when populating a window selection UI. But I don’t think it should be the basis of the window selection API.Capabilities
This isn’t yet relevant, but I’ll mention it here.
If a particular combination of options is requested when the
MSSconstructor is invoked, and it’s not possible to fulfill these, I suggest that we simply raise an exception. For example, if a user requests single-window capture to the GPU on Windows 8, then MSS should raise an exception.💭 We might consider an API that can be used to query the different option combinations available on the current platform.
I suggest that we do not try to track windows ourselves and crop a full-screen capture, if single-window capture is requested but not available. We simply raise an exception.
That is because full-screen capture is always the composited image, including overlapping windows. This risks inadvertent information exposure if, for instance, a user is streaming to Twitch and a popup with private information temporarily appears on their screen. In this example, the original developer of the app may not be aware of the risk; they may have tested on a different version of Windows that does support WGC.
Open questions
@halldorfannar pointed out two considerations.
GetDCExinstead ofGetWindowDCwould provide a cleaner route, if the option described in the previous question is adopted. That said, there may be other reasons a user would want to crop out other borders, such as parts of the application’s UI.