fix(vfs): resolve umount_lock self-deadlock and writeback page leak in sync path by fslongjin · Pull Request #1936 · DragonOS-Community/DragonOS

fslongjin · 2026-05-28T11:40:49Z

Summary

Fixes a CI-blocking deadlock introduced in #1934 where umount() hangs
indefinitely when mount propagation triggers sync_filesystem() on a
peer that shares the same non-reentrant umount_lock RwSem.

Fix self-deadlock: Move sync_filesystem() before umount_write()
(Linux generic_shutdown_super pattern); remove sync from propagation path
Fix writeback page leak: Add RAII WritebackGuard to prevent pages
from getting permanently stuck in Writeback state
Fix page reclaim contention: Remove sync_fs from page reclaim thread;
metadata sync belongs to sync()/fsync()/umount(), not the reclaimer
Optimize sync scan: Skip clean page caches via has_dirty_pages() check

Root Cause

deepcopy() shares super_block_state (containing umount_lock: RwSem)
between source and propagated mounts via Arc::clone(). The call chain:

umount()
  → sb_state.umount_write()           // WRITE acquired
  → do_umount()
    → propagate_umount()
      → umount_at_peer()
        → child.sync_filesystem()
          → sb_state.umount_read()    // same RwSem → DEADLOCK

Design

Aligns with Linux 6.6.21 generic_shutdown_super() which syncs the
filesystem before acquiring s_umount write-lock. The propagation
path (umount_tree → propagate_umount) does not re-sync each peer
since all peers share the same superblock — one top-level sync suffices.

fslongjin · 2026-05-28T11:40:59Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: abdb16692a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-28T11:44:43Z

+            // Metadata sync (flush_dirty_inodes) is the responsibility of sync()/fsync()/umount(),
+            // not the page reclaim thread — calling it here would frequently contend on io_guard
+            // and hold umount_lock.read(), blocking umount.


恢复周期性元数据刷盘

这里移除了页面回收线程中唯一的周期性 sync_fs_with_umount_read(true) 调用，但没有新增替代的后台元数据回写路径；我在仓库内检查到剩余 sync_fs 调用只来自显式 sync()/syncfs()/umount() 路径，而 ext4 的 sync_fs 正是用来刷 dirty_inodes。因此在只发生 chmod/utimens/truncate 等元数据变更、应用不显式 sync/fsync 且系统长期运行或崩溃前，这些 dirty inode 会无限期滞留内存，较本提交前每轮回收线程都会刷元数据的行为退化。建议在移出 reclaimer 前先接入独立的周期性 inode writeback。

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-28T11:44:43Z

+    // Mark root as shared so subsequent mounts inherit shared propagation.
+    if (mount(NULL, "/", NULL, MS_REC | MS_SHARED, NULL) != 0) {


创建实际 peer 挂载以覆盖传播卸载路径

这个测试只把 / 标记为 shared，但没有创建第二个 peer mount；在当前传播实现里 propagate_umount() 会把触发卸载的 parent 预先加入 processed，如果 peer 组里只有它自己，循环会直接跳过，umount_at_peer() 根本不会被调用。因此该用例即使回退到会在 umount_at_peer() 中重入 sync_filesystem() 的实现也不会复现死锁，容易让传播卸载回归漏检；需要先构造两个同 peer group 的挂载点再在其中一个下面挂载/卸载子挂载。

Useful? React with 👍 / 👎.

…n sync path The sync/syncfs refactoring introduced in DragonOS-Community#1934 caused the test-x86 CI to hang indefinitely. This patch fixes three critical issues: 1. Self-deadlock on umount_lock RwSem: MountFS::umount() acquired umount_write, then through propagate_umount → umount_at_peer called child.sync_filesystem() which attempted umount_read on the SAME RwSem (shared via Arc::clone in deepcopy). Since RwSem is non-reentrant, writer + reader on same thread = permanent sleep. Fix: Move sync_filesystem() BEFORE acquiring umount_write (aligning with Linux's generic_shutdown_super pattern), and remove redundant sync from umount_at_peer since all propagated peers share the same superblock. 2. Writeback page leak: writeback_entry() could return Err after prepare_writeback_entry set page state to Writeback, without calling finish_writeback_entry. The page would be stuck in Writeback state forever, causing any future wait_writeback_entry to sleep indefinitely. Fix: Add RAII WritebackGuard that ensures finish_writeback_entry is always called on early exit paths. 3. Over-aggressive page reclaim: The page reclaim thread was calling sync_fs_with_umount_read(true) for all mounts every 500ms, competing with user I/O on io_guard and holding umount_read which delays umount. Metadata sync is not the reclaimer's responsibility. Fix: Remove sync_fs from page reclaim thread; increase writeback interval from 500ms to 5s (matching Linux dirty_writeback_centisecs). Additionally: - Add has_dirty_pages() short-circuit to sync_inodes_of_mount() to skip clean page caches without expensive inode upgrade + Arc::ptr_eq. - Add SharedMountPropagationUmountNoDeadlock dunitest that exercises the exact deadlock scenario (MS_SHARED + concurrent sync/umount). Signed-off-by: longjin <longjin@DragonOS.org>

Signed-off-by: longjin <longjin@dragonos.org>

Drop the temporary PR 1936 review analysis document from the branch before merge. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions Bot added the Bug fix A bug is fixed in this pull request label May 28, 2026

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

fslongjin force-pushed the fix-umount-deadlock branch 2 times, most recently from 3952707 to 45d9c0c Compare May 29, 2026 00:49

fslongjin and others added 2 commits May 29, 2026 15:23

fix(vfs): avoid writeback umount deadlock

9dfc8e6

Signed-off-by: longjin <longjin@dragonos.org>

fslongjin force-pushed the fix-umount-deadlock branch from 9111699 to 9dfc8e6 Compare May 29, 2026 07:24

chore(vfs): remove PR review fix plan document

a312482

Drop the temporary PR 1936 review analysis document from the branch before merge. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(vfs): resolve umount_lock self-deadlock and writeback page leak in sync path#1936

fix(vfs): resolve umount_lock self-deadlock and writeback page leak in sync path#1936
fslongjin wants to merge 3 commits into
DragonOS-Community:masterfrom
fslongjin:fix-umount-deadlock

fslongjin commented May 28, 2026

Uh oh!

fslongjin commented May 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		// Mark root as shared so subsequent mounts inherit shared propagation.
		if (mount(NULL, "/", NULL, MS_REC \| MS_SHARED, NULL) != 0) {

Uh oh!

Conversation

fslongjin commented May 28, 2026

Summary

Root Cause

Design

Uh oh!

fslongjin commented May 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant