Skip to content

Modify URM Test Runner#459

Merged
smuppand merged 1 commit into
qualcomm-linux:mainfrom
kartnema:improve-urm-test-runner
May 18, 2026
Merged

Modify URM Test Runner#459
smuppand merged 1 commit into
qualcomm-linux:mainfrom
kartnema:improve-urm-test-runner

Conversation

@kartnema
Copy link
Copy Markdown
Contributor

@kartnema kartnema commented May 14, 2026

  • Add support to delete any stale lock files created in previous runs
  • Remove log-based result mapping (parse_and_score_log) and solely rely on the return code (0 or 1). The URM Test Runner already handles the return code, which the script can rely on. Judging pass / fail based on the logs results in some false positives.
  • Modify test skip logic, a suite should only be skipped if it is not in the approved list, of if the required configs are not found, or if the test executable itself is absent. Skipping of one or more tests part of the testsuite, should not result in an overall SKIP qualification, it should be either PASS OR FAIL.

@kartnema kartnema force-pushed the improve-urm-test-runner branch from 5717167 to fc0863e Compare May 14, 2026 07:44
# ---------- Lock (avoid concurrent runs on same host) ----------
LOCKFILE="/tmp/${TESTNAME}.lock"
# Delete stale lock file, if it exists
rm "${LOCKFILE}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not remove the lock file before flock. With flock, stale lock files are harmless because the lock is tied to the open file descriptor, not just the path. Keep the file and let flock -n decide whether another run is active. If stale directory cleanup is required for the no-flock fallback, handle only the directory fallback and only when it is provably stale. Do not remove the flock file before acquiring the lock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that even while attempting fresh runs (i.e. flashing a new image and then running the tests through Axiom) we are running into the error that: "another userspace resource manager run" is active. Which should not be happening, since this is the first run.

Also, as mentioned, each time a CI/CD job is triggered a new image (from qli-artifacts) will be flashed on the device (and we won't be using this device directly manually), hence I do not see much use for this lock-protection code in general.

Copy link
Copy Markdown

@varunsinghal29 varunsinghal29 May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smuppand even we tried this as well.

if command -v flock >/dev/null 2>&1; then
exec 9>"$LOCKFILE"
if ! flock -n 9; then
log_warn "Another $ run is active; skipping"
echo "$TESTNAME SKIP" >"$RES_FILE"
exit 0
fi
trap 'exec 9>&-' EXIT INT TERM [use to release the lock]

still not working.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smuppand even we tried this as well.

if command -v flock >/dev/null 2>&1; then exec 9>"$LOCKFILE" if ! flock -n 9; then log_warn "Another $ run is active; skipping" echo "$TESTNAME SKIP" >"$RES_FILE" exit 0 fi trap 'exec 9>&-' EXIT INT TERM [use to release the lock]

still not working.

I agree the symptom is valid: on a fresh flashed image / first Axiom run, we should not get “another userspace resource manager run is active”.

But I do not think removing the flock file before acquiring the lock is safe. With flock, the lock is tied to the open file/inode. If another process is actively holding a lock and we remove the path, a second process can recreate the path and acquire a lock on a different inode, allowing concurrent URM runs.

For the flock path, stale lock files are harmless and should not be removed before locking.

A safer approach would be:

  1. Use flock normally without deleting the lock file.
  2. For the no-flock fallback, use a separate lock directory, for example:
    /tmp/userspace-resource-manager.lockdir
  3. If we really need stale cleanup, clean only the fallback lock directory and only when it is provably stale, for example when no matching URM runner process is active.

So I would suggest:

  • remove rm "$LOCKFILE" from the flock path
  • add debug when flock fails, such as ps output for URM/run.sh processes
  • optionally use a separate fallback lockdir for the mkdir path

This keeps concurrency protection intact while still helping debug the false “active run” case.

LOCKFILE="/tmp/${TESTNAME}.lock"
LOCKDIR="/tmp/${TESTNAME}.lockdir"
 
if command -v flock >/dev/null 2>&1; then
  exec 9>"$LOCKFILE"
  if ! flock -n 9; then
    log_warn "Another ${TESTNAME} run is active; skipping"
    log_info "Active URM-related processes:"
    ps 2>/dev/null | grep -E 'userspace-resource-manager|Urm(Component|Integration)Tests|[run.sh](http://run.sh/)' | grep -v grep || true
    echo "$TESTNAME SKIP" >"$RES_FILE"
    exit 0
  fi
else
  if ! mkdir "$LOCKDIR" 2>/dev/null; then
    log_warn "Another ${TESTNAME} run is active or stale fallback lockdir exists: $LOCKDIR"
    echo "$TESTNAME SKIP" >"$RES_FILE"
    exit 0
  fi
  trap 'rmdir "$LOCKDIR" 2>/dev/null || true' EXIT INT TERM
fi

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

fi

case $score in
case $rc in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only return codes 0 and 1 are handled. Any other non-zero return code, such as:

124 from timeout

126 command found but not executable

127 command not found

signal-based exits

will not be explicitly handled. Depending on shell behavior, this can cause run_one() to return success or an unintended status, leading to wrong PASS/FAIL accounting.

Copy link
Copy Markdown
Contributor Author

@kartnema kartnema May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URM test framework handles most of the exceptions internally (through exception handling) and returns an aggregate status code (https://github.com/qualcomm/userspace-resource-manager/blob/main/tests/Utils/URMTests.cpp#L123), which will always be 0 or 1. I wanted to map the script behavior more closely with that of the URM Test Framework.

Also, since we are checking whether the executable exists (and whether it is executable) earlier in the script already, hence those issues won't factor in later.

In any case, i'll add a default (wildcard) to the shell script switching logic as a fallaback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URM test framework handles most of the exceptions internally (through exception handling) and returns an aggregate status code (https://github.com/qualcomm/userspace-resource-manager/blob/main/tests/Utils/URMTests.cpp#L123), which will always be 0 or 1. I wanted to map the script behavior more closely with that of the URM Test Framework.

Also, since we are checking whether the executable exists (and whether it is executable) earlier in the script already, hence those issues won't factor in later.

In any case, i'll add a default (wildcard) to the shell script switching logic as a fallaback.

Thanks, that makes sense. If the URM framework guarantees 0/1 for framework-handled test failures, relying on rc is fine and is cleaner than parsing logs.

I would still keep the wildcard fallback in the shell case statement, because failures outside the URM framework can still return other codes, for example timeout wrapper rc=124, shell execution failures, or signal exits. Treating those as FAIL is safer than leaving the case unmatched.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added fallback (*)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@smuppand
Copy link
Copy Markdown
Contributor

@kartnema Please include an appropriate commit message and sign off on the commit to resolve the CI failures.

@kartnema kartnema force-pushed the improve-urm-test-runner branch 3 times, most recently from b6f27ec to 99e7475 Compare May 18, 2026 08:58
@kartnema kartnema requested a review from smuppand May 18, 2026 09:51
Copy link
Copy Markdown
Contributor

@smuppand smuppand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell Lint is still failing, so please fix the reported lint issue from the latest workflow.

if ! flock -n 9; then
log_warn "Another ${TESTNAME} run is active; skipping"
log_info "Active URM-related processes:"
ps 2>/dev/null | grep -E 'userspace-resource-manager|Urm(Component|Integration)Tests|[run.sh](http://run.sh/)' | grep -v grep || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug ps grep pattern includes GitHub-rendered markdown.

Please replace it with a plain regex, for example:
ps 2>/dev/null | grep -E 'userspace-resource-manager|Urm(Component|Integration)Tests|run\.sh' | grep -v grep || true

Copy link
Copy Markdown
Contributor Author

@kartnema kartnema May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the check is still failing and suggesting to use pgrep

Can we use something like:

kartnema@hu-kartnema-hyd:/local/mnt/workspace/kartnema/images/meta-qcom/recipes-support/userspace-resource-manager/files/src/qcom-linux-testkit/Runner/suites/Performance/userspace-resource-manager$ pgrep -af 'userspace-resource-manager|Urm(Component|Integration)Tests|run.sh' || true
1168298 UrmComponentTests
kartnema@hu-kartnema-hyd:/local/mnt/workspace/kartnema/images/meta-qcom/recipes-support/userspace-resource-manager/files/src/qcom-linux-testkit/Runner/suites/Performance/userspace-resource-manager$

Does it meet the intended requirements.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback lock cleanup uses the wrong path:
LOCKDIR="/tmp/${TESTNAME}.lockdir"
but trap still does:
rmdir "$LOCKFILE"

Please change it to:
trap 'rmdir "$LOCKDIR" 2>/dev/null || true' EXIT INT TERM

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out

@kartnema kartnema force-pushed the improve-urm-test-runner branch 2 times, most recently from 429f45e to 7efe7b4 Compare May 18, 2026 12:54
@kartnema kartnema requested a review from smuppand May 18, 2026 12:55
if ! flock -n 9; then
log_warn "Another ${TESTNAME} run is active; skipping"
log_info "Active URM-related processes:"
pgrep -af 'userspace-resource-manager|Urm(Component|Integration)Tests|run.sh' || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pgrep is used before the dependency check and is not listed in check_dependencies

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved check_dependencies above in the script, and added pgrep to the list

- Add logs to debug flock related concurrency issues, preventing URM test runs from going through.
- Remove log-based result mapping (parse_and_score_log) and solely rely on the return code (0 or 1).
  The URM Test Runner already handles the return code, which the script can rely on. Judging pass / fail
  based on the logs results in some false positives.
- Modify test skip logic, a suite should only be skipped if it is not in the approved list, of if the
  required configs are not found, or if the test executable itself is absent (or is not marked executable).
  Skipping of one or more tests part of the testsuite, should not result in an overall SKIP qualification,
  it should be either PASS OR FAIL.

Signed-off-by: Kartik Nema <kartnema@qti.qualcomm.com>
@kartnema kartnema force-pushed the improve-urm-test-runner branch from 7efe7b4 to 3d34f53 Compare May 18, 2026 17:39
@kartnema kartnema requested a review from smuppand May 18, 2026 17:41
Copy link
Copy Markdown
Contributor

@smuppand smuppand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@smuppand smuppand merged commit 2c5f843 into qualcomm-linux:main May 18, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants