Skip to content

关于Z-Image Omni代码中的文本编码长度问题 #1418

@PlutoQyl

Description

@PlutoQyl
def _pad_with_ids(
        self,
        feat: torch.Tensor,
        pos_grid_size: Tuple,
        pos_start: Tuple,
        device: torch.device,
        noise_mask_val: Optional[int] = None,
    ):
        """Pad feature to SEQ_MULTI_OF, create position IDs and pad mask."""
        ori_len = len(feat)
        pad_len = (-ori_len) % SEQ_MULTI_OF
        total_len = ori_len + pad_len
# Process captions
for j, cap_item in enumerate(all_cap_feats[i]):
    noise_val = images_noise_mask[i][j] if j < len(images_noise_mask[i]) else 1
    cap_out, cap_pos, cap_mask, cap_len, cap_nm = self._pad_with_ids(
        cap_item,
        (len(cap_item) + (-len(cap_item)) % SEQ_MULTI_OF, 1, 1),
        (cap_cu_len, 0, 0),
        device,
        noise_val,
    )
    cap_feats_list.append(cap_out)
    cap_pos_list.append(cap_pos)

我在研究Omni源码时候发现,对于文本编码的特征。_pad_with_ids函数入参的时候就已经对齐了SEQ_MULTI_OF=32的倍数,但是进入之后还是会pad_len = (-ori_len) % SEQ_MULTI_OF继续对齐,导致输出cap_out和cap_pos的长度并不一样,这是否有问题。同时我看图像的vae和siglip特征并没有这个问题

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions