Skip to content

PCA tutorial#301

Open
hannesbecher wants to merge 1 commit intotskit-dev:mainfrom
hannesbecher:main
Open

PCA tutorial#301
hannesbecher wants to merge 1 commit intotskit-dev:mainfrom
hannesbecher:main

Conversation

@hannesbecher
Copy link
Copy Markdown

Hi,
here is my proposed PCA tutorial. This covers:

  • PCA on genotypes vs branches
  • haplotypes vs diploid genotypes
  • simulated data only

Looking forward to opinions/comments/requests!

HELP please: for some reason the cell outputs of this tutorial don't render when I run make on my system (but they do for all other existing python tutorials). What I might be going on? I had written this as an ipynb and then converted to md.

Cheers,
Hannes

@hannesbecher hannesbecher marked this pull request as draft March 25, 2026 14:08
@hannesbecher hannesbecher marked this pull request as ready for review March 25, 2026 15:54
@hannesbecher
Copy link
Copy Markdown
Author

OK, have figured it out. format_name in the md file has to be format_name: myst, it seems.

@hannesbecher hannesbecher mentioned this pull request Mar 25, 2026
PCA.md Outdated

```{code-cell} ipython3
# load required libraries
import msprime as msp
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer simply import msprime, which matches the rest of the documentation

fstmat
```

## PCA 'by hand'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we perhaps give a quick demo of ts.pca first, in case people just want to copy/paste the code? I know that doesn't quite fit with your logical flow, however.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that would be better tutorial-wise, simple approach first, then details

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jaja...

@hyanwong
Copy link
Copy Markdown
Member

A few comments, one minor, the other a bit more major (should we demo the built-in method quickly first?). Happy to merge either way, and we can change later.

One thing though: could you possibly squash all the commits into one? Thanks @hannesbecher !

PCA.md Outdated
demography=dmg,
random_seed=1234,
sequence_length=1e6,
recombination_rate=1e-9)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe increase to the same order as mut rate?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, makes the plots look clearer.

This command should do the trick:
```
$ R -e 'install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()'
$ R -e 'options(repos=c(CRAN="http://cran.r-project.org")); install.packages(c("reticulate", "IRkernel")); IRkernel::installspec()'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the default anyway?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed on my Mac without that addition. I guess it is not going to do any harm?

fstmat = np.zeros([nPop,nPop])
for i in range(nPop-1):
for j in range(i+1,nPop):
fstmat[i,j] = tsm.Fst([range(i*nHap,(i+1)*nHap), range((i+1)*nHap,(i+2)*nHap)])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be value in computing branch Fst too? I assume that below you will compute branch and site PCA ...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this for the stats tutorial(s).

PCA.md Outdated

```{code-cell} ipython3
# haplotypes, each sample haplotype is ues by default
hapBranchPca=ts.pca(num_components=10)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nitpick, but above you used ht/gt to denote haplotypes and genotypes, so maybe use the same style for results too, htBranchPca etc.

Copy link
Copy Markdown
Author

@hannesbecher hannesbecher Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

c=np.repeat([1,2,3,4,5], [nHap] * nPop))
axs[0, 0].set_title('Haplotypes (sites)')

axs[0,0].set_ylabel("PC2")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to add PC1 axis label too?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not ... doing this in a browser;)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's beautiful as it is, GG.


The plots on the left show one dot per haplotype. These have twice as many dots as the plots on the right, which show individuals. The colours indicate from which of the five islands a haplotype or individual was sampled. As expected with low geneflow, there is some grouping by island. Feel free to re-run with higher or lower values of `migRate` to see how the separations between the island samples changes.

+++
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite like it. But happy should somebody send a PR...

## Empirical data
Here, we demonstrated using simulated data how SNPs and ARG branches lead to equivalent PCA results. For empirical data, the ancestral states of variant sites are not known a priori, which will in practice often lead to polarisation differences. That may affect the outcome of PCA.

**TODO:** Extend Tutorial to empirical data.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be simplest to demonstrate this by randomly? repolarising SNP matrices and repeating the PCA on these?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or major allele -> ancestral, etc. Random as in 50:50, did not make much difference as far as I remember from the randPedPCA project. Needs more planning.

@hannesbecher
Copy link
Copy Markdown
Author

A few comments, one minor, the other a bit more major (should we demo the built-in method quickly first?). Happy to merge either way, and we can change later.

One thing though: could you possibly squash all the commits into one? Thanks @hannesbecher !

Absolutely. I can do this with the merge button, though, correct, @hyanwong ?

Added tutorial on comparing branch and SNP-based PCA using msprime and tskit, including code examples for simulating ARGs and performing PCA.

Add PCA file to the table of contents

Modify R installation command in README

Updated R package installation command to specify CRAN repository.

Refactor PCA.md for clarity and organization

Updated kernel specifications and code formatting for clarity. Added comments and organized code cells for better readability.

Revise PCA tutorial and update Jupyter metadata

Updated Jupyter Notebook metadata and improved PCA tutorial content.

Update kernelspec display name and language in PCA.md

Update PCA.md

Co-authored-by: Gregor Gorjanc <gregor.gorjanc@gmail.com>

Update PCA.md

Co-authored-by: Gregor Gorjanc <gregor.gorjanc@gmail.com>

Update PCA.md

Co-authored-by: Gregor Gorjanc <gregor.gorjanc@gmail.com>

PR comments implemented

heading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants