Based on notes by Ida Moltke.
The purpose of this first exercise is to make sure it is clear how a coalescence tree is simulated. We will use R so a little familiarity with this language will help. First, let us try to simulate a coalescence tree for five samples by hand:
- Start by drawing on a piece of paper a small circle for each of the five samples. They should be lined up on an invisible horizontal line and you should leave enough space above the circles for drawing a tree above them (which we will do shortly). We will henceforth call these five circles "nodes" and label them 1,2,3,4,5
- Also, make a list of the node names. You can either do this by hand or you can do it in R by simply writing:
R
nodes = c(1,2,3,4,5) # make the list and call it nodes
nodes # print the list
- Sample which two nodes will coalesce first (going back in time) by randomly picking two of the nodes. You can either do this by hand or you can do it in R by typing:
nodecount = length(nodes) # save the number of nodes in the variable nodecount
tocoalesce = sample(1:nodecount, size=2) # sample 2 different nodes in node list
nodes[tocoalesce[1]] # print the first node sampled
nodes[tocoalesce[2]] # print the second node sampled
If you used R then make sure you understand what the R code does before moving on.
- Sample the time it takes before these two nodes coalesce (measured from previous
coalescence event in units of 2N) by sampling from an exponential distribution with rate equal
to "nodecount choose 2" (expressed in R as
choose(nodecount,2)) where nodecount is the number of nodes in your node list. Do this in R by typing:
coalescencerate = choose(nodecount,2) # calculate the coalescent rate
coalescencetime = rexp(1, rate=coalescencerate) # sample from exponential w. that rate
coalescencetime
Make sure you understand what the R code does before moving on.
- Now, in your piece of paper, draw a node that is the sampled amount of time further up in the tree than the currently highest node (so if the currently highest node is drawn at height T then draw the new one at height T plus the sampled coalescence time) and draw a branch from each of the nodes you sampled in step 3 to this new node indicating that these two nodes coalesce at this time.
- Next, make an updated list of the nodes that are left by removing the two nodes that coalesced and instead adding the newly drawn node that represents their common ancestor. You can call the new node the next number not used as a name yet (e.g. if this is the first coalescence event you can call it 6, if it is the second coalescence event you can call it 7 etc.). You can either do this by hand or in R. If you want to do it R you can do it as follows:
nodes <- nodes[-tocoalesce] # remove the two nodes that coalesced
nodes <- c(nodes,2*5-length(nodes)-1) # add the new node
nodes # print the new list
If you used R then make sure you understand what the R code does before moving on.
- If you only have one node left in your list of remaining nodes you are done. If not, go back to step 3.
In the end you should have a tree, which is a simulation of a coalescence tree. Try to do this a couple times until you feel like you know how it is done and understand how the coalescence process works (if after drawing a few trees still don’t understand, then feel free to ask for help!).
Doing this by hand is obviously a bit tedious. So based on the R code snippets you already got, we have built a function that allows you to do this automatically (it even makes a drawing of the tree). You can use it from the course server by typing the following in R:
R
source("/home/fernando/simulatecoalescencetrees.R")
Once you have done this you can simulate and draw trees just like you just did by hand by typing the code below, which will print out ten trees on the screen:
par (mfrow=c(2,5))
for (i in c(1:10)){
print("New Tree")
yourtree <-simtree(5) # simulate tree with 5 nodes
ct<-read.tree(text=yourtree);plot(ct,cex=1.5);add.scale.bar(y=1.2,x=0.2,cex = 2,col = "red",lcol="red",lwd=3)
print(" ")
}
You should see several trees printed out in the screen. If this doesn't happen, try downloading the R script from this github website, and then running it locally in your machine (after you cd to the folder in which you downloaded the script).
R
install.packages("ape")
source("simulatecoalescencetrees.R")
Note that the code also prints the simulated coalescence times.
Based on the results you get answer the following questions:
-
Which coalescence event takes the longest on average (the first coalescence event, the second, …, or the last)? And which event takes the shortest on average?
-
Is that what you would expect? Recall that the expectation of an exponential distribution with rate lambda is 1/lambda and the coalescence rate when there are n nodes left is equal to "n choose 2", or n!/(2!(n-2)!). One can verify (after some simplification) that this is equal to n(n-1)/2. The expected time till coalescence is the inverse of the rate of coalescence, so the expected time till coalescence is equal to 2/(n(n-1)). For instance, when there are 5 nodes left, the expected coalescent time is 2/(5(5-1))=0.1 coalescent units, or 0.1* 2N generations.
-
Which coalescence event time seems to vary the most?
-
Is that what you would expect? Recall that, if we have a random variable that follows an exponential distribution with rate lambda, then its variance is equal 1/(lambda^2).
-
Finally, imagine the following case: a researcher has estimated the structure of a tree for mtDNA from a species sampled in a single location. She obtains a tree looking as follows:
Based on the structure of the tree, i.e. two groups of related individuals separated by long branches down to the root of the tree, she concludes that there must be population subdivision with two clearly differentiated groups. Based on what you have learned from the simulations, do you agree with this conclusion?


