Skip to content

thread_config test - add visibility and some defensive improvements.#12916

Open
brbzull0 wants to merge 1 commit intoapache:masterfrom
brbzull0:thread_config_test_stablity
Open

thread_config test - add visibility and some defensive improvements.#12916
brbzull0 wants to merge 1 commit intoapache:masterfrom
brbzull0:thread_config_test_stablity

Conversation

@brbzull0
Copy link
Contributor

With AI help:

Improve resilience and diagnostics in thread_config test

Problem

The thread_config autest intermittently fails on CI with exit code 1
([TS_MAIN] process not found) even though ATS logs show it reached
"fully initialized". The script produces no diagnostic output on
failure, making CI failures impossible to investigate.

Analysis

The CI failure is likely caused by ATS exiting under resource pressure
(12 ATS instances running, one with 100 threads). The check_threads.py
script also has defensive gaps:

  1. Race between process_iter and method calls: The script passes
    attrs=['name', 'cwd', 'threads'] to psutil.process_iter() (which
    pre-fetches data into p.info), but then calls p.name() and p.cwd()
    which make fresh system calls — creating a window where the process
    can exit between iteration and inspection.

  2. Unhandled psutil.NoSuchProcess: psutil.Process(t.id).name()
    can throw if a thread exits between p.threads() and the per-thread
    lookup. With 100+ threads, this is not unlikely.

  3. No diagnostic output: When the process isn't found, the script
    silently returns exit code 1 with nothing on stderr.

Changes

  • Use cached p.info dict instead of p.name()/p.cwd() method calls.
  • Handle psutil.NoSuchProcess and psutil.AccessDenied in process
    iteration, thread listing, and per-thread name lookup.
  • Add retry logic (3 attempts, 2s delay) only for the "process not
    found" case (exit code 1). Thread count mismatches (exit codes 2–11)
    return immediately and are never masked.
  • Print diagnostic output on failure: the searched path and any
    [TS_MAIN] processes found with their cwds.

Limitations

This does not address the root cause of the intermittent CI failure,
which is likely ATS exiting under resource pressure. These changes improve
resilience to transient issues and ensure future failures produce
actionable diagnostic output for further investigation.

The thread_config test intermittently fails on CI with exit code 1
(no matching [TS_MAIN] process found) despite ATS being fully
initialized. The script produced no diagnostic output, making CI
failures impossible to debug.

Root causes:
- p.name() and p.cwd() make fresh system calls even though
  process_iter already pre-fetches attributes into p.info, creating
  a race window where the process can exit between iteration and
  inspection.
- psutil.Process(t.id).name() can throw NoSuchProcess if a thread
  exits between p.threads() and the per-thread lookup.
- On loaded CI machines, transient process visibility issues cause
  the single-attempt scan to miss the ATS process.

Changes:
- Use cached p.info dict instead of p.name()/p.cwd() method calls
  to eliminate the iteration-vs-inspection race.
- Handle psutil.NoSuchProcess and psutil.AccessDenied exceptions
  in process iteration, thread listing, and per-thread name lookup.
- Add retry logic (3 attempts, 2s delay) only for the "process not
  found" case (exit code 1). Real failures (thread count mismatches,
  exit codes 2-11) return immediately and are never masked.
- Print diagnostic output on failure: the searched path and any
  [TS_MAIN] processes found with their cwds.
@brbzull0 brbzull0 self-assigned this Feb 26, 2026
@brbzull0 brbzull0 added this to the 10.2.0 milestone Feb 26, 2026
@brbzull0 brbzull0 marked this pull request as ready for review February 26, 2026 14:08
@brbzull0 brbzull0 changed the title thread_config test - add visibility with some defensive improvements. thread_config test - add visibility and some defensive improvements. Feb 26, 2026
@bryancall bryancall requested a review from Copilot March 2, 2026 22:54
@bryancall bryancall modified the milestones: 10.2.0, 11.0.0 Mar 2, 2026
@bryancall bryancall requested a review from cmcfarlen March 2, 2026 22:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the resilience and debuggability of the thread_config autest by hardening the check_threads.py helper script against transient CI failures. ATS is process-scanned via psutil to verify thread counts, and the prior code had TOCTOU races and no diagnostic output when the target process was not found.

Changes:

  • Use cached p.info dict values from psutil.process_iter() instead of live p.name()/p.cwd() method calls to reduce TOCTOU race window.
  • Add psutil.NoSuchProcess/psutil.AccessDenied handling in thread listing and per-thread name lookup.
  • Add retry logic (3 attempts, 2s delay) for the "process not found" case, plus diagnostic stderr output when the process cannot be found.

Comment on lines +32 to +36
try:
proc_name = p.info.get('name', '')
proc_cwd = p.info.get('cwd', '')
except (psutil.NoSuchProcess, psutil.AccessDenied):
continue
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try/except (psutil.NoSuchProcess, psutil.AccessDenied) block wrapping p.info.get() at lines 32–36 is incorrect. p.info is an ordinary Python dict; calling .get() on it never raises psutil exceptions. When process_iter can't fetch an attribute (e.g., cwd is denied), it stores the exception object as the dict value rather than raising it. As a result:

  1. The except clause never fires, giving a false sense of protection.
  2. proc_cwd may silently be an exception object instead of a string, which causes the == ts_path comparison to silently fail — harmless but misleading.

The correct approach here is to check whether the value stored in p.info is an exception instance before using it, e.g., proc_cwd = p.info.get('cwd', '') if not isinstance(p.info.get('cwd'), Exception) else ''. Alternatively, for modern psutil versions (≥6), attribute errors are represented differently, so consult the psutil docs for the version in use. The try/except block can simply be removed since it provides no protection.

Copilot uses AI. Check for mistakes.
try:
thread_name = psutil.Process(t.id).name()
except (psutil.NoSuchProcess, psutil.AccessDenied):
continue
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a thread exits between p.threads() and the per-thread psutil.Process(t.id).name() call, the thread is silently skipped via continue at line 61. This means that thread is not added to the relevant check set (e.g., etnet_check, task_check). When the final size checks run (lines 110–121), the set will have fewer entries than expected, triggering a "wrong count" error with exit codes 4, 6, 9, or 11.

The retry logic in main() only retries on exit code 1 ("process not found"). Exit codes 4, 6, 9, or 11 are treated as definitive failures and not retried. This creates a scenario where a transient TOCTOU race causes a non-retried definitive failure, which is worse than the original behavior.

One option is to also retry on these exit codes, or alternatively, to count the number of skipped threads and fall back to exit code 1 (to trigger a retry) if any threads were skipped.

Suggested change
continue
sys.stderr.write(f'Thread {t.id} disappeared while reading name.\n')
return 1

Copilot uses AI. Check for mistakes.
threads = p.threads()
except (psutil.NoSuchProcess, psutil.AccessDenied):
sys.stderr.write(f'Process {p.pid} disappeared while reading threads.\n')
return 1
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a process disappears while reading threads (line 52), count_threads returns 1, and the retry logic in main() will retry because result == 1. However, the diagnostic message printed on line 51 only describes the disappearance — the retry message printed on line 161 then says "process not found, retrying in 2s..." which is misleading (the process was found but then disappeared). On a subsequent attempt, ATS has likely fully exited, so the diagnostic output will report "No [TS_MAIN] processes found at all." rather than explaining that the process disappeared mid-read.

Consider distinguishing between the "process not found" case (return 1) and "process disappeared mid-inspection" case with a different return code so that the retry message and diagnostics are accurate.

Suggested change
return 1
return 12

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@cmcfarlen cmcfarlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you address or resolve the copilot comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants