Modify URM Test Runner#459
Conversation
5717167 to
fc0863e
Compare
| # ---------- Lock (avoid concurrent runs on same host) ---------- | ||
| LOCKFILE="/tmp/${TESTNAME}.lock" | ||
| # Delete stale lock file, if it exists | ||
| rm "${LOCKFILE}" |
There was a problem hiding this comment.
Do not remove the lock file before flock. With flock, stale lock files are harmless because the lock is tied to the open file descriptor, not just the path. Keep the file and let flock -n decide whether another run is active. If stale directory cleanup is required for the no-flock fallback, handle only the directory fallback and only when it is provably stale. Do not remove the flock file before acquiring the lock.
There was a problem hiding this comment.
The issue is that even while attempting fresh runs (i.e. flashing a new image and then running the tests through Axiom) we are running into the error that: "another userspace resource manager run" is active. Which should not be happening, since this is the first run.
Also, as mentioned, each time a CI/CD job is triggered a new image (from qli-artifacts) will be flashed on the device (and we won't be using this device directly manually), hence I do not see much use for this lock-protection code in general.
There was a problem hiding this comment.
@smuppand even we tried this as well.
if command -v flock >/dev/null 2>&1; then
exec 9>"$LOCKFILE"
if ! flock -n 9; then
log_warn "Another $ run is active; skipping"
echo "$TESTNAME SKIP" >"$RES_FILE"
exit 0
fi
trap 'exec 9>&-' EXIT INT TERM [use to release the lock]
still not working.
There was a problem hiding this comment.
@smuppand even we tried this as well.
if command -v flock >/dev/null 2>&1; then exec 9>"$LOCKFILE" if ! flock -n 9; then log_warn "Another $ run is active; skipping" echo "$TESTNAME SKIP" >"$RES_FILE" exit 0 fi trap 'exec 9>&-' EXIT INT TERM [use to release the lock]
still not working.
I agree the symptom is valid: on a fresh flashed image / first Axiom run, we should not get “another userspace resource manager run is active”.
But I do not think removing the flock file before acquiring the lock is safe. With flock, the lock is tied to the open file/inode. If another process is actively holding a lock and we remove the path, a second process can recreate the path and acquire a lock on a different inode, allowing concurrent URM runs.
For the flock path, stale lock files are harmless and should not be removed before locking.
A safer approach would be:
- Use flock normally without deleting the lock file.
- For the no-flock fallback, use a separate lock directory, for example:
/tmp/userspace-resource-manager.lockdir - If we really need stale cleanup, clean only the fallback lock directory and only when it is provably stale, for example when no matching URM runner process is active.
So I would suggest:
- remove
rm "$LOCKFILE"from the flock path - add debug when flock fails, such as
psoutput for URM/run.sh processes - optionally use a separate fallback lockdir for the mkdir path
This keeps concurrency protection intact while still helping debug the false “active run” case.
LOCKFILE="/tmp/${TESTNAME}.lock"
LOCKDIR="/tmp/${TESTNAME}.lockdir"
if command -v flock >/dev/null 2>&1; then
exec 9>"$LOCKFILE"
if ! flock -n 9; then
log_warn "Another ${TESTNAME} run is active; skipping"
log_info "Active URM-related processes:"
ps 2>/dev/null | grep -E 'userspace-resource-manager|Urm(Component|Integration)Tests|[run.sh](http://run.sh/)' | grep -v grep || true
echo "$TESTNAME SKIP" >"$RES_FILE"
exit 0
fi
else
if ! mkdir "$LOCKDIR" 2>/dev/null; then
log_warn "Another ${TESTNAME} run is active or stale fallback lockdir exists: $LOCKDIR"
echo "$TESTNAME SKIP" >"$RES_FILE"
exit 0
fi
trap 'rmdir "$LOCKDIR" 2>/dev/null || true' EXIT INT TERM
fi
| fi | ||
|
|
||
| case $score in | ||
| case $rc in |
There was a problem hiding this comment.
Only return codes 0 and 1 are handled. Any other non-zero return code, such as:
124 from timeout
126 command found but not executable
127 command not found
signal-based exits
will not be explicitly handled. Depending on shell behavior, this can cause run_one() to return success or an unintended status, leading to wrong PASS/FAIL accounting.
There was a problem hiding this comment.
The URM test framework handles most of the exceptions internally (through exception handling) and returns an aggregate status code (https://github.com/qualcomm/userspace-resource-manager/blob/main/tests/Utils/URMTests.cpp#L123), which will always be 0 or 1. I wanted to map the script behavior more closely with that of the URM Test Framework.
Also, since we are checking whether the executable exists (and whether it is executable) earlier in the script already, hence those issues won't factor in later.
In any case, i'll add a default (wildcard) to the shell script switching logic as a fallaback.
There was a problem hiding this comment.
The URM test framework handles most of the exceptions internally (through exception handling) and returns an aggregate status code (https://github.com/qualcomm/userspace-resource-manager/blob/main/tests/Utils/URMTests.cpp#L123), which will always be 0 or 1. I wanted to map the script behavior more closely with that of the URM Test Framework.
Also, since we are checking whether the executable exists (and whether it is executable) earlier in the script already, hence those issues won't factor in later.
In any case, i'll add a default (wildcard) to the shell script switching logic as a fallaback.
Thanks, that makes sense. If the URM framework guarantees 0/1 for framework-handled test failures, relying on rc is fine and is cleaner than parsing logs.
I would still keep the wildcard fallback in the shell case statement, because failures outside the URM framework can still return other codes, for example timeout wrapper rc=124, shell execution failures, or signal exits. Treating those as FAIL is safer than leaving the case unmatched.
There was a problem hiding this comment.
Added fallback (*)
|
@kartnema Please include an appropriate commit message and sign off on the commit to resolve the CI failures. |
b6f27ec to
99e7475
Compare
smuppand
left a comment
There was a problem hiding this comment.
Shell Lint is still failing, so please fix the reported lint issue from the latest workflow.
| if ! flock -n 9; then | ||
| log_warn "Another ${TESTNAME} run is active; skipping" | ||
| log_info "Active URM-related processes:" | ||
| ps 2>/dev/null | grep -E 'userspace-resource-manager|Urm(Component|Integration)Tests|[run.sh](http://run.sh/)' | grep -v grep || true |
There was a problem hiding this comment.
The debug ps grep pattern includes GitHub-rendered markdown.
Please replace it with a plain regex, for example:
ps 2>/dev/null | grep -E 'userspace-resource-manager|Urm(Component|Integration)Tests|run\.sh' | grep -v grep || true
There was a problem hiding this comment.
I see the check is still failing and suggesting to use pgrep
Can we use something like:
kartnema@hu-kartnema-hyd:/local/mnt/workspace/kartnema/images/meta-qcom/recipes-support/userspace-resource-manager/files/src/qcom-linux-testkit/Runner/suites/Performance/userspace-resource-manager$ pgrep -af 'userspace-resource-manager|Urm(Component|Integration)Tests|run.sh' || true
1168298 UrmComponentTests
kartnema@hu-kartnema-hyd:/local/mnt/workspace/kartnema/images/meta-qcom/recipes-support/userspace-resource-manager/files/src/qcom-linux-testkit/Runner/suites/Performance/userspace-resource-manager$
Does it meet the intended requirements.
There was a problem hiding this comment.
The fallback lock cleanup uses the wrong path:
LOCKDIR="/tmp/${TESTNAME}.lockdir"
but trap still does:
rmdir "$LOCKFILE"
Please change it to:
trap 'rmdir "$LOCKDIR" 2>/dev/null || true' EXIT INT TERM
There was a problem hiding this comment.
Thanks for pointing this out
429f45e to
7efe7b4
Compare
| if ! flock -n 9; then | ||
| log_warn "Another ${TESTNAME} run is active; skipping" | ||
| log_info "Active URM-related processes:" | ||
| pgrep -af 'userspace-resource-manager|Urm(Component|Integration)Tests|run.sh' || true |
There was a problem hiding this comment.
pgrep is used before the dependency check and is not listed in check_dependencies
There was a problem hiding this comment.
Moved check_dependencies above in the script, and added pgrep to the list
- Add logs to debug flock related concurrency issues, preventing URM test runs from going through. - Remove log-based result mapping (parse_and_score_log) and solely rely on the return code (0 or 1). The URM Test Runner already handles the return code, which the script can rely on. Judging pass / fail based on the logs results in some false positives. - Modify test skip logic, a suite should only be skipped if it is not in the approved list, of if the required configs are not found, or if the test executable itself is absent (or is not marked executable). Skipping of one or more tests part of the testsuite, should not result in an overall SKIP qualification, it should be either PASS OR FAIL. Signed-off-by: Kartik Nema <kartnema@qti.qualcomm.com>
7efe7b4 to
3d34f53
Compare
Uh oh!
There was an error while loading. Please reload this page.