Conversation
|
|
Hey, good catch for the OVO column ordering. I have no idea where the bug comes from, it seems to be related to group labels for PBMC that |
|
So it seems to be due to np.argsort and np.sort not sorting the weird characters (" ", "+", "/", "#", etc) of the bulk_labels the same way. I will ship a fix sometime today or this weekend. NB: I spent a bit of time looking into the PBMC dataset and it seems rather unusual: a lot of values are negative leading to non-defined logfoldchanges, on top of which there seems to be very little value diversity in the dataset. As a result, a lot of genes end up with identical ranksum, but because illico and scanpy do not compute z-score the exact same way (mathematically equivalent but programmatically different), gene ordering is impacted. The gene ordering impacts the position of NaN values in the results, but non-NaN values do match because they are very similar. I am not sure as if this dataset is a good test case. NB2: details of the programmatic differences: U = ranksum - n_tgt * (n_tgt+ 1)/2
z = U - n_ref * n_tgt / 2scanpy does: z =
ranksum - n_tgt * (n_tgt + 1 + n_ref) / 2Those are mathematically equivalent but result in differences of the order of 1.e-9 approximately, changing gene orders when ranksums are equal. |
Interesting, good observation but glad we are aware of this now. This could be another target for scanpy 2.0. I will try a different dataset but glad to have this documented. |
|
Forget my previous message which was actually kind of off topic. This PBMC dataset was actually a very good test case.
Example: let's take NB: what is the situation with the Dask issue ? Current version is not dask-compatible. |
That should come next, I want to keep this scoped for now to the in-memory stuff. Another thing if you're doing a batch of fixes this weekend - supporting Also for your |
I'm seeing some instances of https://github.com/satijalab/seurat/blob/main/R/differential_expression.R#L1089 i.e., they calculate the mean expression with this "+1" offset so don't need the correction. So we have three different things here, I guess. I think just giving the option to match scanpy (since that seems a project goal) is good, and then we can maybe revisit what seurat does as a new parameter for scanpy 2.0.
For this, I would be open to also making this part of a scanpy 2.0 set of default i.e., using Overall, my ideal scenario would be that Short of exact matches, like I said we can just document changes. It sounds like the biggest things that would close the gap are:
I think we can patch the Overall, this is really cool and I'm super happy that things seem to be pretty close! |
|
Hey, I just released version 0.5.0rc1 which should fix the issues identified on this test case:
Regarding the numerical precision question: so far I force the Testing locally, it seems everything now runs smoothly for PMBC. Let me know how it goes with the rest of the CI. |
|
@remydubois I think the issue is that you are sorting / cutting off by In any case, I think I got a little lost-in-the-sauce. I think matching p-values and scores should be enough because scanpy can internally handle LFC when it wraps I'm pushing the results of With this in mind, I am going to push ahead with integrating BTW: https://remydubois.github.io/illico/api.html looks both incomplete and like it's leaking internals. |
|
Ok everything passing locally with just getting P-value and z-score from I assume the LFC calculation (i.e., actually calculating the means) is the less computationally intensive part? |
Ok ! That's good news to me. Because this was bugging me out a little bit: I checked out on your def test_illico(test, corr_method, exp_post_agg):
from illico.asymptotic_wilcoxon import asymptotic_wilcoxon
pbmc = pbmc68k_reduced()
pbmc = ad.AnnData(pbmc.X.copy(), obs=pbmc.obs.copy(), var=pbmc.var.copy())
# ... Rest of the test. They all pass on my machineThen, all tests passed (LFC, pvals, scores, pvals_adj, up to By the way, wouldn't it be safer for
Very happy to read it's moving forward 🚀
Yes it's neglectable, I could not measure its impact on the overall runtime. |
| tie_correct=tie_correct, | ||
| use_continuity=False, | ||
| alternative="two-sided", | ||
| use_rust=False, |
There was a problem hiding this comment.
From what @remydubois said on zulip: https://scverse.zulipchat.com/#narrow/channel/315570-random/topic/Article.20on.20speeding.20up.20computation/near/581587388 it is basically not worth the headache. Furthermore, if we were to ever adopt the codebase, it would be the numba version, not the rust version. His observations about Rust largely match ours (the performance is almost always identical to numba)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
|
@remydubois Everything seems to pass with |
|
@ilan-gold good news !
That makes sense. Does it mean any specifics for illico releases (i.e: should I release/not release some features in the v0.6.0) ? |
I think you should be pretty good releasing what you want - we can always catch up here e.g., when we do a proper assessment of dask. |
TODOs:
See: https://github.com/scverse/scanpy/actions/runs/24088645078/job/70268566419?pr=4038
ovo"column" in scores (cc @remydubois)illico(see thefilterwarningson the new test)ovrresults are so far offexp_post_aggfeature feat: allow exponentiation post agg for log-fold-change inrank_genes_groups#4037) toillico, we will add theillicobackend (or all changes have been documented)illico#4012