thread_config test - add visibility and some defensive improvements.#12916
thread_config test - add visibility and some defensive improvements.#12916brbzull0 wants to merge 1 commit intoapache:masterfrom
thread_config test - add visibility and some defensive improvements.#12916Conversation
The thread_config test intermittently fails on CI with exit code 1 (no matching [TS_MAIN] process found) despite ATS being fully initialized. The script produced no diagnostic output, making CI failures impossible to debug. Root causes: - p.name() and p.cwd() make fresh system calls even though process_iter already pre-fetches attributes into p.info, creating a race window where the process can exit between iteration and inspection. - psutil.Process(t.id).name() can throw NoSuchProcess if a thread exits between p.threads() and the per-thread lookup. - On loaded CI machines, transient process visibility issues cause the single-attempt scan to miss the ATS process. Changes: - Use cached p.info dict instead of p.name()/p.cwd() method calls to eliminate the iteration-vs-inspection race. - Handle psutil.NoSuchProcess and psutil.AccessDenied exceptions in process iteration, thread listing, and per-thread name lookup. - Add retry logic (3 attempts, 2s delay) only for the "process not found" case (exit code 1). Real failures (thread count mismatches, exit codes 2-11) return immediately and are never masked. - Print diagnostic output on failure: the searched path and any [TS_MAIN] processes found with their cwds.
thread_config test - add visibility with some defensive improvements.thread_config test - add visibility and some defensive improvements.
There was a problem hiding this comment.
Pull request overview
This PR improves the resilience and debuggability of the thread_config autest by hardening the check_threads.py helper script against transient CI failures. ATS is process-scanned via psutil to verify thread counts, and the prior code had TOCTOU races and no diagnostic output when the target process was not found.
Changes:
- Use cached
p.infodict values frompsutil.process_iter()instead of livep.name()/p.cwd()method calls to reduce TOCTOU race window. - Add
psutil.NoSuchProcess/psutil.AccessDeniedhandling in thread listing and per-thread name lookup. - Add retry logic (3 attempts, 2s delay) for the "process not found" case, plus diagnostic stderr output when the process cannot be found.
| try: | ||
| proc_name = p.info.get('name', '') | ||
| proc_cwd = p.info.get('cwd', '') | ||
| except (psutil.NoSuchProcess, psutil.AccessDenied): | ||
| continue |
There was a problem hiding this comment.
The try/except (psutil.NoSuchProcess, psutil.AccessDenied) block wrapping p.info.get() at lines 32–36 is incorrect. p.info is an ordinary Python dict; calling .get() on it never raises psutil exceptions. When process_iter can't fetch an attribute (e.g., cwd is denied), it stores the exception object as the dict value rather than raising it. As a result:
- The
exceptclause never fires, giving a false sense of protection. proc_cwdmay silently be an exception object instead of a string, which causes the== ts_pathcomparison to silently fail — harmless but misleading.
The correct approach here is to check whether the value stored in p.info is an exception instance before using it, e.g., proc_cwd = p.info.get('cwd', '') if not isinstance(p.info.get('cwd'), Exception) else ''. Alternatively, for modern psutil versions (≥6), attribute errors are represented differently, so consult the psutil docs for the version in use. The try/except block can simply be removed since it provides no protection.
| try: | ||
| thread_name = psutil.Process(t.id).name() | ||
| except (psutil.NoSuchProcess, psutil.AccessDenied): | ||
| continue |
There was a problem hiding this comment.
When a thread exits between p.threads() and the per-thread psutil.Process(t.id).name() call, the thread is silently skipped via continue at line 61. This means that thread is not added to the relevant check set (e.g., etnet_check, task_check). When the final size checks run (lines 110–121), the set will have fewer entries than expected, triggering a "wrong count" error with exit codes 4, 6, 9, or 11.
The retry logic in main() only retries on exit code 1 ("process not found"). Exit codes 4, 6, 9, or 11 are treated as definitive failures and not retried. This creates a scenario where a transient TOCTOU race causes a non-retried definitive failure, which is worse than the original behavior.
One option is to also retry on these exit codes, or alternatively, to count the number of skipped threads and fall back to exit code 1 (to trigger a retry) if any threads were skipped.
| continue | |
| sys.stderr.write(f'Thread {t.id} disappeared while reading name.\n') | |
| return 1 |
| threads = p.threads() | ||
| except (psutil.NoSuchProcess, psutil.AccessDenied): | ||
| sys.stderr.write(f'Process {p.pid} disappeared while reading threads.\n') | ||
| return 1 |
There was a problem hiding this comment.
When a process disappears while reading threads (line 52), count_threads returns 1, and the retry logic in main() will retry because result == 1. However, the diagnostic message printed on line 51 only describes the disappearance — the retry message printed on line 161 then says "process not found, retrying in 2s..." which is misleading (the process was found but then disappeared). On a subsequent attempt, ATS has likely fully exited, so the diagnostic output will report "No [TS_MAIN] processes found at all." rather than explaining that the process disappeared mid-read.
Consider distinguishing between the "process not found" case (return 1) and "process disappeared mid-inspection" case with a different return code so that the retry message and diagnostics are accurate.
| return 1 | |
| return 12 |
cmcfarlen
left a comment
There was a problem hiding this comment.
Can you address or resolve the copilot comments?
With AI help:
Improve resilience and diagnostics in thread_config test
Problem
The
thread_configautest intermittently fails on CI with exit code 1(
[TS_MAIN]process not found) even though ATS logs show it reached"fully initialized". The script produces no diagnostic output on
failure, making CI failures impossible to investigate.
Analysis
The CI failure is likely caused by ATS exiting under resource pressure
(12 ATS instances running, one with 100 threads). The
check_threads.pyscript also has defensive gaps:
Race between
process_iterand method calls: The script passesattrs=['name', 'cwd', 'threads']topsutil.process_iter()(whichpre-fetches data into
p.info), but then callsp.name()andp.cwd()which make fresh system calls — creating a window where the process
can exit between iteration and inspection.
Unhandled
psutil.NoSuchProcess:psutil.Process(t.id).name()can throw if a thread exits between
p.threads()and the per-threadlookup. With 100+ threads, this is not unlikely.
No diagnostic output: When the process isn't found, the script
silently returns exit code 1 with nothing on stderr.
Changes
p.infodict instead ofp.name()/p.cwd()method calls.psutil.NoSuchProcessandpsutil.AccessDeniedin processiteration, thread listing, and per-thread name lookup.
found" case (exit code 1). Thread count mismatches (exit codes 2–11)
return immediately and are never masked.
[TS_MAIN]processes found with their cwds.Limitations
This does not address the root cause of the intermittent CI failure,
which is likely ATS exiting under resource pressure. These changes improve
resilience to transient issues and ensure future failures produce
actionable diagnostic output for further investigation.