Admixture f3 and f4 statistics function. by KellyLBennett · Pull Request #1330 · malariagen/malariagen-data-python

KellyLBennett · 2026-06-10T14:46:15Z

New branch with f3 and f4 admixture stats added including tests and example notebook. Resolves historical issue #705 . Please note this was created in VS code but due to a WSL issue new files and changes were then added to workbench to push from there.

jonbrenas

A lot of it is just comments, but the confusion around 'ana = np.sum(acc, axis=1)' needs changing, IMHO.

jonbrenas · 2026-06-11T13:12:30Z

+        )
+
+        # Locate biallelic variants and deal with missingness.
+        ac = acc + aca + acb


Why no acd?

This is because D is the outgroup, its been a very long time since the stats function was written so my memory is hazy but I remember considering this point and I believe there would much less or not enough variation for analysis if the outgroup was also taken into account. We do not except them to share anywhere near as much variation as the other populations.

jonbrenas · 2026-06-11T13:13:55Z

+        # Locate biallelic variants and deal with missingness.
+        ac = acc + aca + acb
+        loc_bi = allel.AlleleCountsArray(ac).is_biallelic()
+        ana = np.sum(acc, axis=1)


That's probably not going to cause an error, but that's very confusing!

Please could you give some more explanation here/a suggested change. Thanks.

Well, ana should be np.sum(aca, axis=1), not np.sum(acc, axis=1), I believe.

jonbrenas · 2026-06-11T13:32:47Z

+
+        # Compute f4 statistic.
+        blen = acc_seg.shape[0] // n_jack
+        f4, se, z, _, _ = allel.average_patterson_d(


I am not an expert here, but I think D and f4 are slightly different. The code for patterson_d states that you can get f4 by ignoring the den but the average_patterson_d doesn't do that, and thus returns (as the name implies) D and not f4.

You are correct, the description is slightly off and should be revised to D.

jonbrenas · 2026-06-11T13:34:01Z

+    )
+    def patterson_f3(
+        self,
+        recipient_query: base_params.sample_query,


Given the lack of symmetry between the various queries, giving them different parameter descriptions might be helpful.

Although it is called recipient query here, it is basically the same (just a sample query) and is more to avoid confusion over which population to test.

I think a "naive" user asking for these function's help should be given more info on the assumptions associated to each query than the current:
"""
A pandas query string to be evaluated against the sample metadata, to
select samples to be included in the returned data. E.g., "country ==
'Uganda'". If the query returns zero results, a warning will be
emitted with fuzzy-match suggestions for possible typos or case
mismatches.
"""
For instance, while it makes sense that cohort D is the outgroup in f4/D, I think it is worth telling the user explicitly, in particular when it is treated differently.

jonbrenas · 2026-06-11T13:42:26Z

+    mapping = np.empty((n_variants, n_alleles), dtype=np.int32)
+    mapping[:] = -1
+    # Iterate over variants.
+    for i in range(n_variants):


This doesn't seem optimised but I don't have a better approach in mind right now.

jonbrenas · 2026-06-11T13:47:40Z

+
+    params = dict(
+        recipient_query=f"taxon == {taxa[0]!r}",
+        source1_query=f"taxon == {taxa[1]!r}",


If source1_query == source2_query, isn't it technically testing f2?

Unfortunately this is the only way to run tests on this function, since we do not have three taxa we can access to generate the simulated data in the repository.

Couldn't you use cohort_admin2_month instead of taxa? There are at least 3, I believe.

jonbrenas · 2026-06-11T13:53:05Z

+    assert isinstance(f3, np.float64)
+    assert isinstance(se, np.float64)
+    assert isinstance(z, np.float64)
+    assert -0.1 <= f3 <= 1


Patterson 2012 says that "The observation of a significantly negative value of f3(C; A, B) is thus evidence of complex phylogeny in C." which implies that f3 is not always >= -0.1, does it not?

Actually, now that I think about it, it circles back to my comment on f2 being tested as f2 cannot be significantly negative.

I can change this to -1 to 1 to be safe.

f3f4 stats

76fd926

KellyLBennett requested review from ahernank, cclarkson and jonbrenas June 10, 2026 14:46

KellyLBennett self-assigned this Jun 10, 2026

Jon Brenas added 2 commits June 10, 2026 17:49

Linting

e6df5cc

More linting

e80b4a2

jonbrenas requested changes Jun 11, 2026

View reviewed changes

Conversation

KellyLBennett commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonbrenas Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KellyLBennett commented Jun 10, 2026 •

edited

Loading

jonbrenas Jun 11, 2026 •

edited

Loading