Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 49 additions & 6 deletions tests/gold_tests/thread_config/check_threads.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,26 +20,45 @@
import psutil
import argparse
import sys
import time


def count_threads(ts_path, etnet_threads, accept_threads, task_threads, aio_threads):

for p in psutil.process_iter(['name', 'cwd', 'threads']):

# Use cached info from process_iter attrs to avoid race conditions
# where the process exits between iteration and inspection.
try:
proc_name = p.info.get('name', '')
proc_cwd = p.info.get('cwd', '')
except (psutil.NoSuchProcess, psutil.AccessDenied):
continue
Comment on lines +32 to +36
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try/except (psutil.NoSuchProcess, psutil.AccessDenied) block wrapping p.info.get() at lines 32–36 is incorrect. p.info is an ordinary Python dict; calling .get() on it never raises psutil exceptions. When process_iter can't fetch an attribute (e.g., cwd is denied), it stores the exception object as the dict value rather than raising it. As a result:

  1. The except clause never fires, giving a false sense of protection.
  2. proc_cwd may silently be an exception object instead of a string, which causes the == ts_path comparison to silently fail — harmless but misleading.

The correct approach here is to check whether the value stored in p.info is an exception instance before using it, e.g., proc_cwd = p.info.get('cwd', '') if not isinstance(p.info.get('cwd'), Exception) else ''. Alternatively, for modern psutil versions (≥6), attribute errors are represented differently, so consult the psutil docs for the version in use. The try/except block can simply be removed since it provides no protection.

Copilot uses AI. Check for mistakes.

# Find the pid corresponding to the ats process we started in autest.
# It needs to match the process name and the binary path.
# If autest can expose the pid of the process this is not needed anymore.
if p.name() == '[TS_MAIN]' and p.cwd() == ts_path:
if proc_name == '[TS_MAIN]' and proc_cwd == ts_path:

etnet_check = set()
accept_check = set()
task_check = set()
aio_check = set()

for t in p.threads():
try:
threads = p.threads()
except (psutil.NoSuchProcess, psutil.AccessDenied):
sys.stderr.write(f'Process {p.pid} disappeared while reading threads.\n')
return 1
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a process disappears while reading threads (line 52), count_threads returns 1, and the retry logic in main() will retry because result == 1. However, the diagnostic message printed on line 51 only describes the disappearance — the retry message printed on line 161 then says "process not found, retrying in 2s..." which is misleading (the process was found but then disappeared). On a subsequent attempt, ATS has likely fully exited, so the diagnostic output will report "No [TS_MAIN] processes found at all." rather than explaining that the process disappeared mid-read.

Consider distinguishing between the "process not found" case (return 1) and "process disappeared mid-inspection" case with a different return code so that the retry message and diagnostics are accurate.

Suggested change
return 1
return 12

Copilot uses AI. Check for mistakes.

for t in threads:

# Get the name of the thread.
thread_name = psutil.Process(t.id).name()
# Get the name of the thread. The thread may have exited
# between p.threads() and this call, so handle that.
try:
thread_name = psutil.Process(t.id).name()
except (psutil.NoSuchProcess, psutil.AccessDenied):
continue
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a thread exits between p.threads() and the per-thread psutil.Process(t.id).name() call, the thread is silently skipped via continue at line 61. This means that thread is not added to the relevant check set (e.g., etnet_check, task_check). When the final size checks run (lines 110–121), the set will have fewer entries than expected, triggering a "wrong count" error with exit codes 4, 6, 9, or 11.

The retry logic in main() only retries on exit code 1 ("process not found"). Exit codes 4, 6, 9, or 11 are treated as definitive failures and not retried. This creates a scenario where a transient TOCTOU race causes a non-retried definitive failure, which is worse than the original behavior.

One option is to also retry on these exit codes, or alternatively, to count the number of skipped threads and fall back to exit code 1 (to trigger a retry) if any threads were skipped.

Suggested change
continue
sys.stderr.write(f'Thread {t.id} disappeared while reading name.\n')
return 1

Copilot uses AI. Check for mistakes.

if thread_name.startswith('[ET_NET'):

Expand Down Expand Up @@ -103,7 +122,20 @@ def count_threads(ts_path, etnet_threads, accept_threads, task_threads, aio_thre
else:
return 0

# Return 1 if no pid is found to match the ats process.
# No matching process found. Print diagnostic info to help debug CI failures.
ts_main_procs = []
for p in psutil.process_iter(['name', 'cwd']):
try:
if p.info.get('name') == '[TS_MAIN]':
ts_main_procs.append(f' pid={p.pid} cwd={p.info.get("cwd")}')
except (psutil.NoSuchProcess, psutil.AccessDenied):
pass

sys.stderr.write(f'No [TS_MAIN] process found with cwd={ts_path}.\n')
if ts_main_procs:
sys.stderr.write('Found [TS_MAIN] processes:\n' + '\n'.join(ts_main_procs) + '\n')
else:
sys.stderr.write('No [TS_MAIN] processes found at all.\n')
return 1


Expand All @@ -118,7 +150,18 @@ def main():
'-t', '--task-threads', type=int, dest='task_threads', help='expected number of TASK threads', required=True)
parser.add_argument('-c', '--aio-threads', type=int, dest='aio_threads', help='expected number of AIO threads', required=True)
args = parser.parse_args()
exit(count_threads(args.ts_path, args.etnet_threads, args.accept_threads, args.task_threads, args.aio_threads))

max_attempts = 3
result = 1
for attempt in range(max_attempts):
result = count_threads(args.ts_path, args.etnet_threads, args.accept_threads, args.task_threads, args.aio_threads)
if result != 1: # Only retry when process not found (exit code 1).
break
if attempt < max_attempts - 1:
sys.stderr.write(f'Attempt {attempt + 1}/{max_attempts}: process not found, retrying in 2s...\n')
time.sleep(2)

exit(result)


if __name__ == '__main__':
Expand Down