Skip to content

Commit eadcc47

Browse files
committed
fix(auto-inject): use nm for symbol lookup and fix serial thread safety
Bug 1: _resolve_symbol_addr() used GDB (~0.6s/symbol), replaced with nm-based fast path via elf_utils.get_symbols() with mtime cache. 9 symbols: 5.4s -> 0.3s (first) / ~0s (cached). Bug 2: _trigger_auto_inject() spawned a thread that directly called serial I/O, violating ThreadCheckedSerial ownership. All serial ops now routed through run_in_device_worker() to fpb-worker thread. - Add _get_elf_symbols() with lazy mtime-based cache - Route do_inject/do_unpatch through DeviceWorker - Add 7 new tests (nm resolve, cache, GDB fallback, worker timeout) - Update existing auto-inject tests for DeviceWorker mock - Add postmortem doc
1 parent f5f912e commit eadcc47

6 files changed

Lines changed: 507 additions & 123 deletions

File tree

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# 自动注入功能 Bug 修复复盘
2+
3+
## 问题背景
4+
5+
2026-03-10 在实际使用自动注入功能时,发现两个严重问题导致注入全部失败。
6+
7+
## Bug 1: 符号查找性能问题 — GDB 太慢
8+
9+
### 现象
10+
11+
```
12+
[INFO] lookup_symbol start: 'lv_obj_create'
13+
[INFO] lookup_symbol done: 'lv_obj_create' -> 0x2C2DD780 (0.603s)
14+
[INFO] lookup_symbol start: 'lv_obj_constructor'
15+
[INFO] lookup_symbol done: 'lv_obj_constructor' -> 0x2C2DCA94 (0.613s)
16+
...
17+
```
18+
19+
9 个函数符号查找耗时 **~5.4 秒**(每个 ~0.6s),全部通过 GDB/MI 协议的 `info address` 命令逐个查询。
20+
21+
### 根因
22+
23+
`_resolve_symbol_addr()` 仅使用 GDB session 查找符号地址。GDB 启动了完整的调试会话,加载了整个 ELF 的调试信息,每次查询都需要通过 GDB/MI 协议往返通信。
24+
25+
但自动注入场景只需要 **符号名 → 地址** 的映射,不需要调试信息、类型信息或源码定位。
26+
27+
### 修复方案
28+
29+
改用 `nm` 工具(通过已有的 `elf_utils.get_symbols()`)作为快速路径:
30+
31+
```
32+
_resolve_symbol_addr(sym_name)
33+
├── 快速路径: nm 查找(一次 subprocess 调用加载全部符号,带 mtime 缓存)
34+
│ └── 命中 → 直接返回地址
35+
└── 慢速路径: GDB fallback(仅当 nm 未找到时)
36+
```
37+
38+
- nm 一次调用加载所有符号到内存缓存(~0.3s)
39+
- 后续查找为纯内存字典查找(~0μs)
40+
- 缓存按 ELF 文件 mtime 自动失效
41+
42+
### 性能对比
43+
44+
| 方式 | 9 个符号总耗时 | 单符号耗时 |
45+
|------|---------------|-----------|
46+
| GDB (修复前) | ~5.4s | ~0.6s |
47+
| nm + 缓存 (修复后) | ~0.3s (首次) / ~0s (缓存) | ~0s |
48+
49+
## Bug 2: 串口线程安全问题 — ThreadCheckedSerial 违规
50+
51+
### 现象
52+
53+
```
54+
[ERROR] Inject failed for lv_obj_create: Failed to get device info:
55+
Serial.reset_input_buffer() called from thread 'Thread-496 (do_auto_inject)'
56+
(id=140669059319360), but owner is 'fpb-worker' (id=140669611779648)
57+
```
58+
59+
所有 8 个注入操作全部因线程违规失败。
60+
61+
### 根因
62+
63+
`_trigger_auto_inject()` 在后台线程 `do_auto_inject` 中直接调用 `fpb.enter_fl_mode()` → 串口 I/O。但串口被 `ThreadCheckedSerial` 包装,绑定到 `fpb-worker` 线程,任何其他线程的 I/O 操作都会抛出 `SerialThreadViolation`
64+
65+
调用链:
66+
67+
```
68+
文件变更回调 (watchdog 线程)
69+
└── _trigger_auto_inject()
70+
└── threading.Thread(do_auto_inject) ← 新线程
71+
└── fpb.enter_fl_mode()
72+
└── serial.reset_input_buffer() ← ❌ 非 owner 线程
73+
```
74+
75+
### 修复方案
76+
77+
将所有串口 I/O 操作通过 `run_in_device_worker()` 派发到 `fpb-worker` 线程执行:
78+
79+
```
80+
文件变更回调 (watchdog 线程)
81+
└── _trigger_auto_inject()
82+
└── threading.Thread(do_auto_inject) ← 后台线程(非串口操作)
83+
├── 文件 I/O: 检测 FPB_INJECT 标记 ← 安全
84+
└── run_in_device_worker(do_inject)
85+
└── fpb-worker 线程执行:
86+
├── enter_fl_mode() ← ✅ owner 线程
87+
├── inject_multi() ← ✅ owner 线程
88+
└── exit_fl_mode() ← ✅ owner 线程
89+
```
90+
91+
同样处理了 auto-unpatch 路径(标记移除时自动清除注入)。
92+
93+
## 修改文件
94+
95+
| 文件 | 修改内容 |
96+
|------|---------|
97+
| `fpb_inject.py` | `_resolve_symbol_addr()` 增加 nm 快速路径;新增 `_get_elf_symbols()` 带 mtime 缓存 |
98+
| `services/file_watcher_manager.py` | `do_auto_inject``do_unpatch` 通过 `run_in_device_worker()` 派发串口操作 |
99+
| `tests/test_fpb_inject.py` | 新增 `TestResolveSymbolAddr` 6 个测试(nm 快速路径、缓存失效、GDB fallback、边界条件) |
100+
| `tests/test_file_watcher_manager.py` | 更新 `TestTriggerAutoInject` mock `run_in_device_worker`;新增 worker timeout 测试 |
101+
| `tests/test_compile_inplace.py` | 更新 `test_inplace_flow_success``test_inplace_flow_auto_unpatch` mock worker |
102+
103+
## 测试验证
104+
105+
- 全部 Python 测试通过(含新增 7 个测试用例)
106+
- 覆盖率达标(≥85%)
107+
108+
## 经验教训
109+
110+
1. **串口访问必须走 DeviceWorker**`ThreadCheckedSerial` 是最后一道防线,但正确做法是在架构层面保证所有串口操作都通过 `run_in_device_worker()` 派发,而不是依赖运行时检测。
111+
112+
2. **能用离线工具就不要用在线工具**:符号查找只需要 `nm`(纯文件操作),不需要启动完整的 GDB 调试会话。选择最轻量的工具完成任务。
113+
114+
3. **后台线程 + 串口 = 必须审查**:任何新建线程中如果涉及串口操作,都需要检查是否通过 DeviceWorker 路由。

Tools/WebServer/fpb_inject.py

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -240,24 +240,74 @@ def get_symbols(self, elf_path: str) -> Dict[str, dict]:
240240
return elf_utils.get_symbols(elf_path, self._toolchain_path)
241241

242242
def _resolve_symbol_addr(self, sym_name: str) -> Optional[int]:
243-
"""Resolve a symbol name to its address via GDB session.
243+
"""Resolve a symbol name to its address.
244+
245+
Uses nm (via elf_utils.get_symbols) for fast offline lookup.
246+
Falls back to GDB session if nm lookup fails.
244247
245248
Returns the address as int, or None if not found.
246249
"""
250+
# Fast path: use nm to resolve symbol from ELF
251+
elf_path = getattr(self.device, "elf_path", None)
252+
if elf_path and os.path.exists(elf_path):
253+
symbols = self._get_elf_symbols()
254+
if sym_name in symbols:
255+
addr = symbols[sym_name]["addr"]
256+
logger.debug(f"Symbol '{sym_name}' resolved via nm: 0x{addr:08X}")
257+
return addr
258+
259+
# Slow path: fall back to GDB session
247260
from core.state import state
248261
from core.gdb_manager import is_gdb_available
249262

250263
if not is_gdb_available(state):
251-
logger.warning("GDB not available, cannot resolve symbol address")
264+
logger.warning(
265+
f"Symbol '{sym_name}' not found via nm and GDB not available"
266+
)
252267
return None
253268

269+
logger.debug(f"Symbol '{sym_name}' not in nm cache, falling back to GDB")
254270
info = state.gdb_session.lookup_symbol(sym_name)
255271
if info is None:
256272
return None
257273

258274
addr = info["addr"] if isinstance(info, dict) else info
259275
return addr
260276

277+
def _get_elf_symbols(self) -> Dict[str, dict]:
278+
"""Get cached ELF symbols (lazy-loaded, invalidated on ELF change).
279+
280+
Returns dict: {name: {"addr": int, "sym_type": str}}
281+
"""
282+
elf_path = self.device.elf_path
283+
if not elf_path:
284+
return {}
285+
286+
# Check if cache is still valid
287+
try:
288+
current_mtime = os.path.getmtime(elf_path)
289+
except OSError:
290+
return {}
291+
292+
if (
293+
hasattr(self, "_elf_symbols_cache")
294+
and self._elf_symbols_cache_path == elf_path
295+
and self._elf_symbols_cache_mtime == current_mtime
296+
):
297+
return self._elf_symbols_cache
298+
299+
# Rebuild cache
300+
logger.info(f"Loading ELF symbols via nm: {elf_path}")
301+
t_start = time.time()
302+
symbols = elf_utils.get_symbols(elf_path, self._toolchain_path)
303+
elapsed = time.time() - t_start
304+
logger.info(f"Loaded {len(symbols)} symbols via nm in {elapsed:.3f}s")
305+
306+
self._elf_symbols_cache = symbols
307+
self._elf_symbols_cache_path = elf_path
308+
self._elf_symbols_cache_mtime = current_mtime
309+
return symbols
310+
261311
def disassemble_function(self, elf_path: str, func_name: str) -> Tuple[bool, str]:
262312
"""Disassemble a specific function from ELF file."""
263313
return elf_utils.disassemble_function(elf_path, func_name, self._toolchain_path)

0 commit comments

Comments
 (0)