`thread_config` test - add visibility and some defensive improvements. by brbzull0 · Pull Request #12916 · apache/trafficserver

brbzull0 · 2026-02-26T13:41:06Z

With AI help:

Improve resilience and diagnostics in thread_config test

Problem

The thread_config autest intermittently fails on CI with exit code 1
([TS_MAIN] process not found) even though ATS logs show it reached
"fully initialized". The script produces no diagnostic output on
failure, making CI failures impossible to investigate.

Analysis

The CI failure is likely caused by ATS exiting under resource pressure
(12 ATS instances running, one with 100 threads). The check_threads.py
script also has defensive gaps:

Race between process_iter and method calls: The script passes
attrs=['name', 'cwd', 'threads'] to psutil.process_iter() (which
pre-fetches data into p.info), but then calls p.name() and p.cwd()
which make fresh system calls — creating a window where the process
can exit between iteration and inspection.
Unhandled psutil.NoSuchProcess: psutil.Process(t.id).name()
can throw if a thread exits between p.threads() and the per-thread
lookup. With 100+ threads, this is not unlikely.
No diagnostic output: When the process isn't found, the script
silently returns exit code 1 with nothing on stderr.

Changes

Use cached p.info dict instead of p.name()/p.cwd() method calls.
Handle psutil.NoSuchProcess and psutil.AccessDenied in process
iteration, thread listing, and per-thread name lookup.
Add retry logic (3 attempts, 2s delay) only for the "process not
found" case (exit code 1). Thread count mismatches (exit codes 2–11)
return immediately and are never masked.
Print diagnostic output on failure: the searched path and any
[TS_MAIN] processes found with their cwds.

Limitations

This does not address the root cause of the intermittent CI failure,
which is likely ATS exiting under resource pressure. These changes improve
resilience to transient issues and ensure future failures produce
actionable diagnostic output for further investigation.

The thread_config test intermittently fails on CI with exit code 1 (no matching [TS_MAIN] process found) despite ATS being fully initialized. The script produced no diagnostic output, making CI failures impossible to debug. Root causes: - p.name() and p.cwd() make fresh system calls even though process_iter already pre-fetches attributes into p.info, creating a race window where the process can exit between iteration and inspection. - psutil.Process(t.id).name() can throw NoSuchProcess if a thread exits between p.threads() and the per-thread lookup. - On loaded CI machines, transient process visibility issues cause the single-attempt scan to miss the ATS process. Changes: - Use cached p.info dict instead of p.name()/p.cwd() method calls to eliminate the iteration-vs-inspection race. - Handle psutil.NoSuchProcess and psutil.AccessDenied exceptions in process iteration, thread listing, and per-thread name lookup. - Add retry logic (3 attempts, 2s delay) only for the "process not found" case (exit code 1). Real failures (thread count mismatches, exit codes 2-11) return immediately and are never masked. - Print diagnostic output on failure: the searched path and any [TS_MAIN] processes found with their cwds.

Copilot

Pull request overview

This PR improves the resilience and debuggability of the thread_config autest by hardening the check_threads.py helper script against transient CI failures. ATS is process-scanned via psutil to verify thread counts, and the prior code had TOCTOU races and no diagnostic output when the target process was not found.

Changes:

Use cached p.info dict values from psutil.process_iter() instead of live p.name()/p.cwd() method calls to reduce TOCTOU race window.
Add psutil.NoSuchProcess/psutil.AccessDenied handling in thread listing and per-thread name lookup.
Add retry logic (3 attempts, 2s delay) for the "process not found" case, plus diagnostic stderr output when the process cannot be found.

Copilot · 2026-03-02T22:56:50Z

tests/gold_tests/thread_config/check_threads.py

+        try:
+            proc_name = p.info.get('name', '')
+            proc_cwd = p.info.get('cwd', '')
+        except (psutil.NoSuchProcess, psutil.AccessDenied):
+            continue


The try/except (psutil.NoSuchProcess, psutil.AccessDenied) block wrapping p.info.get() at lines 32–36 is incorrect. p.info is an ordinary Python dict; calling .get() on it never raises psutil exceptions. When process_iter can't fetch an attribute (e.g., cwd is denied), it stores the exception object as the dict value rather than raising it. As a result:

The except clause never fires, giving a false sense of protection.

proc_cwd may silently be an exception object instead of a string, which causes the == ts_path comparison to silently fail — harmless but misleading.

The correct approach here is to check whether the value stored in p.info is an exception instance before using it, e.g., proc_cwd = p.info.get('cwd', '') if not isinstance(p.info.get('cwd'), Exception) else ''. Alternatively, for modern psutil versions (≥6), attribute errors are represented differently, so consult the psutil docs for the version in use. The try/except block can simply be removed since it provides no protection.

Copilot · 2026-03-02T22:56:50Z

tests/gold_tests/thread_config/check_threads.py

+                try:
+                    thread_name = psutil.Process(t.id).name()
+                except (psutil.NoSuchProcess, psutil.AccessDenied):
+                    continue


When a thread exits between p.threads() and the per-thread psutil.Process(t.id).name() call, the thread is silently skipped via continue at line 61. This means that thread is not added to the relevant check set (e.g., etnet_check, task_check). When the final size checks run (lines 110–121), the set will have fewer entries than expected, triggering a "wrong count" error with exit codes 4, 6, 9, or 11.

The retry logic in main() only retries on exit code 1 ("process not found"). Exit codes 4, 6, 9, or 11 are treated as definitive failures and not retried. This creates a scenario where a transient TOCTOU race causes a non-retried definitive failure, which is worse than the original behavior.

One option is to also retry on these exit codes, or alternatively, to count the number of skipped threads and fall back to exit code 1 (to trigger a retry) if any threads were skipped.

Suggested change

continue

sys.stderr.write(f'Thread {t.id} disappeared while reading name.\n')

return 1

Copilot · 2026-03-02T22:56:50Z

tests/gold_tests/thread_config/check_threads.py

+                threads = p.threads()
+            except (psutil.NoSuchProcess, psutil.AccessDenied):
+                sys.stderr.write(f'Process {p.pid} disappeared while reading threads.\n')
+                return 1


When a process disappears while reading threads (line 52), count_threads returns 1, and the retry logic in main() will retry because result == 1. However, the diagnostic message printed on line 51 only describes the disappearance — the retry message printed on line 161 then says "process not found, retrying in 2s..." which is misleading (the process was found but then disappeared). On a subsequent attempt, ATS has likely fully exited, so the diagnostic output will report "No [TS_MAIN] processes found at all." rather than explaining that the process disappeared mid-read.

Consider distinguishing between the "process not found" case (return 1) and "process disappeared mid-inspection" case with a different return code so that the retry message and diagnostics are accurate.

Suggested change

return 1

return 12

cmcfarlen

Can you address or resolve the copilot comments?

brbzull0 self-assigned this Feb 26, 2026

brbzull0 added Configuration Tests AuTest labels Feb 26, 2026

brbzull0 added this to the 10.2.0 milestone Feb 26, 2026

brbzull0 marked this pull request as ready for review February 26, 2026 14:08

brbzull0 changed the title ~~thread_config test - add visibility with some defensive improvements.~~ thread_config test - add visibility and some defensive improvements. Feb 26, 2026

bryancall requested a review from Copilot March 2, 2026 22:54

Copilot started reviewing on behalf of bryancall March 2, 2026 22:55 View session

bryancall modified the milestones: 10.2.0, 11.0.0 Mar 2, 2026

bryancall requested a review from cmcfarlen March 2, 2026 22:55

Copilot AI reviewed Mar 2, 2026

View reviewed changes

cmcfarlen reviewed Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`thread_config` test - add visibility and some defensive improvements.#12916

`thread_config` test - add visibility and some defensive improvements.#12916
brbzull0 wants to merge 1 commit intoapache:masterfrom
brbzull0:thread_config_test_stablity

brbzull0 commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

cmcfarlen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	continue
	sys.stderr.write(f'Thread {t.id} disappeared while reading name.\n')
	return 1

Conversation

brbzull0 commented Feb 26, 2026

Improve resilience and diagnostics in thread_config test

Problem

Analysis

Changes

Limitations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

cmcfarlen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants