fix: accumulate domain loss predictions instead of overwriting by zhpjunfei · Pull Request #523 · alibaba/TorchEasyRec

zhpjunfei · 2026-05-22T09:00:51Z

What

修复 PEPNet._select_domain_task_output 中 domain 预测值累积的 bug。

Why

new_predictions 初始化为 defaultdict(list)，但旧代码使用 = 赋值单元素列表，导致每次循环覆盖之前的值。最终只有最后一个 (domain_index, value) 参与排序和 stack，使得多 domain 场景下的 torch.gather 选择结果错误。

How

将 new_predictions[tower_loss_name] = [...] 改为 .append(...)，正确累积所有 domain 的预测值。

CLAassistant · 2026-05-22T09:00:58Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

张峻飞 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

github-actions · 2026-05-22T09:59:14Z

+                new_predictions[tower_loss_name].append(
                    (domain_index, tower_domain_loss_predict_value)
-                ]
+                )


Good catch on the overwrite. One thing worth handling in the same PR: now that the list actually accumulates across domains, the sorted(..., key=lambda x: x[0]) a few lines below (line 196) sorts on domain_index as a string (it comes from rsplit("_", 1)[1]). For ≥10 domains this gives lexicographic order — "10" sorts before "2" — which then misaligns the stacked tensor with the integer index from batch.labels[self._domain_input_name] passed to torch.gather. Before this fix the list always had one element so the ordering never mattered; this fix exposes it.

Suggest casting the index to int when appending so the subsequent sort is numeric:

Suggested change

new_predictions[tower_loss_name].append(

(domain_index, tower_domain_loss_predict_value)

]

)

new_predictions[tower_loss_name].append(

(int(domain_index), tower_domain_loss_predict_value)

)

github-actions · 2026-05-22T09:59:21Z

Nice, targeted fix — the defaultdict(list) + = assignment was indeed silently dropping all but the last domain.

A couple of notes worth considering:

Test gap: _select_domain_task_output is only reached through loss() / update_metric(), and the existing test_pepnet in tzrec/models/pepnet_test.py only calls the forward pass and asserts output shape. The buggy code path was unexercised, and even if it were, a shape-only assertion wouldn't have caught the overwrite. A small unit test that builds a synthetic predictions dict with ≥2 towers × ≥2 domains and verifies the gathered values match the per-row domainf label would lock this in.
Latent sort bug exposed by this fix: see inline comment on the changed lines — domain_index is sorted as a string, so it breaks at ≥10 domains. Worth folding into this PR since the fix is what makes the sort path actually matter.

fix: accumulate domain loss predictions instead of overwriting

56819eb

tiankongdeguiji requested a review from chengaofei May 22, 2026 09:53

tiankongdeguiji added the claude-review Let Claude Review label May 22, 2026

github-actions Bot removed the claude-review Let Claude Review label May 22, 2026

github-actions Bot reviewed May 22, 2026

View reviewed changes

chengaofei approved these changes May 22, 2026

View reviewed changes

tiankongdeguiji closed this May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: accumulate domain loss predictions instead of overwriting#523

fix: accumulate domain loss predictions instead of overwriting#523
zhpjunfei wants to merge 1 commit into
alibaba:masterfrom
zhpjunfei:master

zhpjunfei commented May 22, 2026

Uh oh!

CLAassistant commented May 22, 2026

Uh oh!

github-actions Bot May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhpjunfei commented May 22, 2026

What

Why

How

Uh oh!

CLAassistant commented May 22, 2026

Uh oh!

github-actions Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants