Admixture f3 and f4 statistics function.#1330
Conversation
jonbrenas
left a comment
There was a problem hiding this comment.
A lot of it is just comments, but the confusion around 'ana = np.sum(acc, axis=1)' needs changing, IMHO.
| ) | ||
|
|
||
| # Locate biallelic variants and deal with missingness. | ||
| ac = acc + aca + acb |
There was a problem hiding this comment.
This is because D is the outgroup, its been a very long time since the stats function was written so my memory is hazy but I remember considering this point and I believe there would much less or not enough variation for analysis if the outgroup was also taken into account. We do not except them to share anywhere near as much variation as the other populations.
| # Locate biallelic variants and deal with missingness. | ||
| ac = acc + aca + acb | ||
| loc_bi = allel.AlleleCountsArray(ac).is_biallelic() | ||
| ana = np.sum(acc, axis=1) |
There was a problem hiding this comment.
That's probably not going to cause an error, but that's very confusing!
There was a problem hiding this comment.
Please could you give some more explanation here/a suggested change. Thanks.
There was a problem hiding this comment.
Well, ana should be np.sum(aca, axis=1), not np.sum(acc, axis=1), I believe.
|
|
||
| # Compute f4 statistic. | ||
| blen = acc_seg.shape[0] // n_jack | ||
| f4, se, z, _, _ = allel.average_patterson_d( |
There was a problem hiding this comment.
I am not an expert here, but I think D and f4 are slightly different. The code for patterson_d states that you can get f4 by ignoring the den but the average_patterson_d doesn't do that, and thus returns (as the name implies) D and not f4.
There was a problem hiding this comment.
You are correct, the description is slightly off and should be revised to D.
| ) | ||
| def patterson_f3( | ||
| self, | ||
| recipient_query: base_params.sample_query, |
There was a problem hiding this comment.
Given the lack of symmetry between the various queries, giving them different parameter descriptions might be helpful.
There was a problem hiding this comment.
Although it is called recipient query here, it is basically the same (just a sample query) and is more to avoid confusion over which population to test.
There was a problem hiding this comment.
I think a "naive" user asking for these function's help should be given more info on the assumptions associated to each query than the current:
"""
A pandas query string to be evaluated against the sample metadata, to
select samples to be included in the returned data. E.g., "country ==
'Uganda'". If the query returns zero results, a warning will be
emitted with fuzzy-match suggestions for possible typos or case
mismatches.
"""
For instance, while it makes sense that cohort D is the outgroup in f4/D, I think it is worth telling the user explicitly, in particular when it is treated differently.
| mapping = np.empty((n_variants, n_alleles), dtype=np.int32) | ||
| mapping[:] = -1 | ||
| # Iterate over variants. | ||
| for i in range(n_variants): |
There was a problem hiding this comment.
This doesn't seem optimised but I don't have a better approach in mind right now.
|
|
||
| params = dict( | ||
| recipient_query=f"taxon == {taxa[0]!r}", | ||
| source1_query=f"taxon == {taxa[1]!r}", |
There was a problem hiding this comment.
If source1_query == source2_query, isn't it technically testing f2?
There was a problem hiding this comment.
Unfortunately this is the only way to run tests on this function, since we do not have three taxa we can access to generate the simulated data in the repository.
There was a problem hiding this comment.
Couldn't you use cohort_admin2_month instead of taxa? There are at least 3, I believe.
| assert isinstance(f3, np.float64) | ||
| assert isinstance(se, np.float64) | ||
| assert isinstance(z, np.float64) | ||
| assert -0.1 <= f3 <= 1 |
There was a problem hiding this comment.
Patterson 2012 says that "The observation of a significantly negative value of f3(C; A, B) is thus evidence of complex phylogeny in C." which implies that f3 is not always >= -0.1, does it not?
There was a problem hiding this comment.
Actually, now that I think about it, it circles back to my comment on f2 being tested as f2 cannot be significantly negative.
There was a problem hiding this comment.
I can change this to -1 to 1 to be safe.
New branch with f3 and f4 admixture stats added including tests and example notebook. Resolves historical issue #705 . Please note this was created in VS code but due to a WSL issue new files and changes were then added to workbench to push from there.