Skip to content

Admixture f3 and f4 statistics function.#1330

Open
KellyLBennett wants to merge 3 commits into
masterfrom
f3-f4-stats
Open

Admixture f3 and f4 statistics function.#1330
KellyLBennett wants to merge 3 commits into
masterfrom
f3-f4-stats

Conversation

@KellyLBennett

@KellyLBennett KellyLBennett commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

New branch with f3 and f4 admixture stats added including tests and example notebook. Resolves historical issue #705 . Please note this was created in VS code but due to a WSL issue new files and changes were then added to workbench to push from there.

@jonbrenas jonbrenas left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of it is just comments, but the confusion around 'ana = np.sum(acc, axis=1)' needs changing, IMHO.

)

# Locate biallelic variants and deal with missingness.
ac = acc + aca + acb

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no acd?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because D is the outgroup, its been a very long time since the stats function was written so my memory is hazy but I remember considering this point and I believe there would much less or not enough variation for analysis if the outgroup was also taken into account. We do not except them to share anywhere near as much variation as the other populations.

# Locate biallelic variants and deal with missingness.
ac = acc + aca + acb
loc_bi = allel.AlleleCountsArray(ac).is_biallelic()
ana = np.sum(acc, axis=1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably not going to cause an error, but that's very confusing!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please could you give some more explanation here/a suggested change. Thanks.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, ana should be np.sum(aca, axis=1), not np.sum(acc, axis=1), I believe.


# Compute f4 statistic.
blen = acc_seg.shape[0] // n_jack
f4, se, z, _, _ = allel.average_patterson_d(

@jonbrenas jonbrenas Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert here, but I think D and f4 are slightly different. The code for patterson_d states that you can get f4 by ignoring the den but the average_patterson_d doesn't do that, and thus returns (as the name implies) D and not f4.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, the description is slightly off and should be revised to D.

)
def patterson_f3(
self,
recipient_query: base_params.sample_query,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the lack of symmetry between the various queries, giving them different parameter descriptions might be helpful.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it is called recipient query here, it is basically the same (just a sample query) and is more to avoid confusion over which population to test.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a "naive" user asking for these function's help should be given more info on the assumptions associated to each query than the current:
"""
A pandas query string to be evaluated against the sample metadata, to
select samples to be included in the returned data. E.g., "country ==
'Uganda'". If the query returns zero results, a warning will be
emitted with fuzzy-match suggestions for possible typos or case
mismatches.
"""
For instance, while it makes sense that cohort D is the outgroup in f4/D, I think it is worth telling the user explicitly, in particular when it is treated differently.

mapping = np.empty((n_variants, n_alleles), dtype=np.int32)
mapping[:] = -1
# Iterate over variants.
for i in range(n_variants):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem optimised but I don't have a better approach in mind right now.


params = dict(
recipient_query=f"taxon == {taxa[0]!r}",
source1_query=f"taxon == {taxa[1]!r}",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If source1_query == source2_query, isn't it technically testing f2?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is the only way to run tests on this function, since we do not have three taxa we can access to generate the simulated data in the repository.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you use cohort_admin2_month instead of taxa? There are at least 3, I believe.

assert isinstance(f3, np.float64)
assert isinstance(se, np.float64)
assert isinstance(z, np.float64)
assert -0.1 <= f3 <= 1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patterson 2012 says that "The observation of a significantly negative value of f3(C; A, B) is thus evidence of complex phylogeny in C." which implies that f3 is not always >= -0.1, does it not?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that I think about it, it circles back to my comment on f2 being tested as f2 cannot be significantly negative.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this to -1 to 1 to be safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants