Skip to content

Fix ENOENT on shared build steps from stale build-log path#208

Open
mtelvers wants to merge 1 commit into
ocurrent:masterfrom
mtelvers:fix-log-stale-path
Open

Fix ENOENT on shared build steps from stale build-log path#208
mtelvers wants to merge 1 commit into
ocurrent:masterfrom
mtelvers:fix-log-stale-path

Conversation

@mtelvers

Copy link
Copy Markdown
Member

We have been seeing intermittent build failures on busy workers:

Build failed: Unix.Unix_error(Unix.ENOENT, "open", "/var/cache/obuilder/in-progress/<id>/log")
Job failed: Internal error

When a build finishes, Build_log.finish parks the log as `Readonly <path> so late-joining tailers can still read it, but <path> is the in-progress location, and the store renames the build directory in-progress -> result on success. A late-joining tailer is typically a sibling job sharing a build step, via db_store's Some existing branch), that reopens that path after the rename gets ENOENT.

Tailers that attach while the log is still `Open are fine as they dup the live fd, which follows the inode through the rename. Only the reopen-by-path in the finished state is affected. This bug was introduced by me in 393a5ef as part of the Windows HCS work as that required finish to close the fd before the rename (required on Windows) and reopen by path instead of relying on the still-open fd.

The fix is to keep a read-only dup of the fd and read tailers from it rather than reopening by path. The fd follows the file through the directory rename. On Windows the fd must be closed before the directory can be renamed, so it keeps reopening by path there.

Furthermore, this adds a regression test (build_log / "Readable after rename") that finishes a log, renames its directory as the store does, then tails it and asserts the content is still readable. It reproduces the exact ENOENT on the pre-fix code and passes with the fix.

A finished build log was parked in the Readonly state holding its
in-progress path, but the store renames the build directory into place
on success. A late-joining log tailer (e.g. a sibling build sharing a
deduplicated step) reopening that path after the rename hit
ENOENT (..., "open", ".../in-progress/<id>/log").

Retain a read-only dup of the fd in the finished state and read from it
instead of reopening by path; the fd follows the file through the
rename. On Windows the fd must be closed before the directory can be
renamed, so it keeps reopening by path there.

Add a regression test asserting a finished log stays readable after its
directory is renamed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant