Skip to content

Create empty genotype arrays when there are no samples#464

Merged
jeromekelleher merged 4 commits intosgkit-dev:mainfrom
tomwhite:create-empty-genotype-arrays
Apr 11, 2026
Merged

Create empty genotype arrays when there are no samples#464
jeromekelleher merged 4 commits intosgkit-dev:mainfrom
tomwhite:create-empty-genotype-arrays

Conversation

@tomwhite
Copy link
Copy Markdown
Member

@tomwhite tomwhite commented Apr 2, 2026

Fixes #463

@jeromekelleher
Copy link
Copy Markdown
Member

Ah, hmm, I think I misunderstood. I think we should continue to have no call_genotype arrays when there is no GT in the input VCF by default. However, we do allow for creating empty genotypes if you specify a ploidy value explicitly? Or is this only handling cases where GTs are specified in the header, but there's no samples?

It's not clear to me what happens when the ploidy value that you specify disagrees with what's in the GTs. It is simplest to just raise an error for now?

In any case I think we need to spell out the semantics somewhere in the documentation and it's probably best to that now while we're making the change.

@tomwhite
Copy link
Copy Markdown
Member Author

tomwhite commented Apr 2, 2026

I think we should continue to have no call_genotype arrays when there is no GT in the input VCF by default.

That's inconsistent with all the other genotype fields though. If there is a GQ field defined in the header but no samples then vcf2zarr will create empty call_GQ fields.

It's not clear to me what happens when the ploidy value that you specify disagrees with what's in the GTs. It is simplest to just raise an error for now?

Yes, that's probably best.

@jeromekelleher
Copy link
Copy Markdown
Member

There's two orthogonal things here:

  1. Is GT included in the header?
  2. Are there samples present with GT data?

I think we have to deal with both? If there's no GT in the header and no samples present, we should not output call_genotype by default I think. But, we can force call_genotype to be included, by specifying the ploidy value.

@tomwhite
Copy link
Copy Markdown
Member Author

tomwhite commented Apr 2, 2026

I agree they are orthogonal.

Case GT in header samples with GT create call_genotype
1 No No No
2 No Yes Error
3 Yes No Yes
4 Yes Yes Yes

Currently for case 3 we don't create call_genotype, but I'm arguing that we should, for consistency with other format fields.

I think specifying ploidy is also orthogonal. In case 1 it could force creating call_genotype (although it doesn't have to), and it's needed for case 3 in order to specify diploid for example (which I why I introduced it). For case 4 we could error if it disagrees with the data, as you suggested earlier.

What do you think?

@jeromekelleher
Copy link
Copy Markdown
Member

SGTM. What do we do in case 2 at the moment?

@jeromekelleher
Copy link
Copy Markdown
Member

(If we're already handling case 2 sensibly I think we should leave it - experience has shown that you just have to accept malformed VCFs)

@tomwhite tomwhite force-pushed the create-empty-genotype-arrays branch from 940d62c to 9f51adc Compare April 2, 2026 16:15
@tomwhite
Copy link
Copy Markdown
Member Author

tomwhite commented Apr 2, 2026

Hopefully this is a bit better. For case 1 setting ploidy doesn't have any effect. I haven't changed case 2 (or checked what it does). I've implemented case 3. And for case 4 it will error if the set ploidy is less than the maximum ploidy in the data.

@jeromekelleher jeromekelleher added this pull request to the merge queue Apr 11, 2026
Merged via the queue into sgkit-dev:main with commit 5250446 Apr 11, 2026
13 checks passed
tomwhite added a commit to sgkit-dev/vczstore that referenced this pull request Apr 13, 2026
tomwhite added a commit to sgkit-dev/vczstore that referenced this pull request Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create empty genotype arrays when there are no samples

2 participants