InternVideo-NeXt clip_projector training procedure: eval only? from scratch?

Dear authors,

Thanks for your interesting paper!
I see you mention in your paper in section B.2 of the appendix that you train an attention pooling head for evaluation.

At the same time, I see that the model you released on HuggingFace has a clip_projector module (AttentionPoolingBlock) attached to it, and in the forward of the model you set projected=True ([HF code here](https://huggingface.co/revliter/internvideo_next_base_p14_res224_f16/blob/main/modeling_internvideo_next.py#L940))

What I would like to know is:
1. on what data was this release clip_projector trained? 
2. if I want to evaluate your model on action recognition, should I:
a. train the attentive probe *from scratch* on the target dataset
b. *finetune* the attentive probe on the target dataset (if so, from which weights?)
3. which setup did you use in your paper, 2.a. or 2.b., or none of them? 

You released some action recognition code for InternVideo2 and for linear probing / attention probing there is this `--open_clip_projector` [parameter](https://github.com/OpenGVLab/InternVideo/blob/4a8a5680f3c064cfb6283e0dff2a947cbae6ba43/InternVideo2/single_modality/run_linear_probing.py#L108) which controls whether you're finetuning the head, or not. **But this doesn't say on which data the head was trained before the finetuning.**

You mention in the paper in section 4.2 (Video Classification)

```
 We test the model in an ‘Attentive Probing’ setting
where the encoders are frozen and a single-layer attention
pooling head is trained. Such Frozen Encoder settings can
test representation’s quality in an unbiased way. Our methods achieve the best results with only public data and less
computation cost on these foundation tasks.
```

What would make sense to me is that you're not using the Internvideo2 evaluation script anymore and you're training the attentive probe **from scratch** (2.a. mentioned above). But if so, I'd like to know on which evaluation experiment the weights of the release clip_projector were obtained.

Thanks in advance for your clarifications!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InternVideo-NeXt clip_projector training procedure: eval only? from scratch? #314

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

InternVideo-NeXt clip_projector training procedure: eval only? from scratch? #314

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions