Conversation
pyopencl/cache.py
Outdated
| attempts += 1 | ||
|
|
||
| if attempts > 10: | ||
| if attempts % (10/wait_time_seconds) == 0: |
There was a problem hiding this comment.
That won't work reliably. / unconditionally has a floating point result, and comparing floating point numbers to zero is unreliable.
There was a problem hiding this comment.
That won't work reliably.
/unconditionally has a floating point result, and comparing floating point numbers to zero is unreliable.
4d13b82 should fix this
|
Actually, what about don't do polling at all, i.e. no sleep. There's a lot more efficient system calls available like |
|
Thanks for the suggestion. IMO, anytime there is contention for this lock,
you're running your program poorly. (I.e. this should always be avoidable,
e.g. by setting a private cache folder.) As such, I'm not very inclined to optimize that case.
|
A bit of background on this PR: We keep seeing this timeout a lot when running multiple MPI ranks on a single node, which is a difficult situation to set private cache folders (especially when running interactively). |
| except OSError: | ||
| pass | ||
|
|
||
| wait_time_seconds = 0.05 |
There was a problem hiding this comment.
What's the motivation for this change? It seems really short. If you were running this (by mistake, say) on 1000 nodes sharing a cache directory, I think the resulting thrashing would not be a good thing.
There was a problem hiding this comment.
The specific value was taken from https://github.com/tox-dev/py-filelock/blob/a6c8fabc4192fa7a4ae19b1875ee842ec5eb4f61/src/filelock/_api.py#L113. I can't really predict what the result of this with thousands of nodes would be, but we already keep on hitting the 1 second timeout when running with 2 ranks on a single node, which I guess indicates that the existing polling time would cause minutes-long waiting times when running with thousands of ranks.
There was a problem hiding this comment.
I can see that argument. Could you add a comment justifying this choice of time-out value? (making reference to the single-node interactive "test" use case)
CC: #503.