If you want to see this document with all the results (numbers and plots) included, click here.
You can find the slides from the lecture here.
You sequenced the genomes of four Africans and four Eurasians and got genotypes from a single chromosome from each of them (so you have genotypes of four African and four Eurasian chromosomes). Unfortunately, there’s been a mix up in the lab and you don’t know which one is which! You only know that they are labeled A, B, C, …, H. What a disaster!
Fortunately, you also have genotypes from three other individuals whose identity you know for certain: an African, a Neanderthal, and a chimpanzee. This means that you are able to compute an $f_4$ statistic which will test for evidence of Neanderthal introgression in a given sample $X$.
Can you save the day and determine which of the A, B, C, …, H samples are African and which are Eurasian based on the following $f_4$ statistic test?
Recall that only Eurasians are expected to have appreciable amounts of Neanderthal ancestry but Africans don’t.
Type R
in your terminal or (better) just use RStudio R console if you
have it.
First load the genotype table into R:
gt <- read.table(url("https://github.com/bodkan/ku-introgression2024/raw/main/genotypes_ex1.tsv"), sep = "\t", header = TRUE)
Familiarize yourself with the data by running this R command which shows information from only the first few genotypes:
head(gt)
The gt
data set is a plain R data frame where each column contains the
genotype of that individual (0
- ancestral allele, 1
- derived
allele).
How many positions of the genome do we have genotyped?
nrow(gt)
You can extract all the genotypes of a given individual by using the
$
or [[
subsetting operators of R data frames like this:
gt$African
gt[["African"]]
A useful trick for comparing two chromosomes in their entirety is to rely in the fact that R can perform vectorized operations (operations performed on multiple elements of a vector at once). For instance, if this gives us the genotypes of an African and Neanderthal chromosome:
gt[["African"]]
gt[["Neanderthal"]]
then we can find positions at which those two samples carry the same allele like this:
gt[["African"]] == gt[["Neanderthal"]] # this gives us TRUE/FALSE values
Counting those positions can be done using the sum()
function like
this:
# sum() treats TRUE as 1 and FALSE as 0, so we can sum everything up!
# -- this gives us the number of positions at which an African carries the same
# allele as the Neanderthal
sum(gt[["African"]] == gt[["Neanderthal"]])
So the answer to this task’s question can be computed as:
sum(gt[["African"]] == gt[["Chimp"]])
sum(gt[["Neanderthal"]] == gt[["Chimp"]])
sum(gt[["African"]] == gt[["Neanderthal"]])
Does this make sense from a phylogenetic point of view?
Above we computed alleles which agree between two samples.
On the other hand, this would count how many alleles are different between a African and chimpanzee chromosome:
sum(gt[["African"]] != gt[["Chimp"]]) # note the != instead of ==
Inside the sum()
function we can compose multiple logical conditions
to create more complex comparison operations using the &
operator (AND
operation in mathematical logic).
Armed with this knowledge, we can compute the BABA and ABBA counts using this bit of R code:
X <- "A"
abba <- sum(
(gt[["African"]] == gt[["Chimp"]]) & # filters for A**A sites
(gt[["African"]] != gt[["Neanderthal"]]) & # filters for A*B* sites
(gt[[X]] == gt[["Neanderthal"]]) # filters for *BB* sites
) # together then ABBA
baba <- sum(
(gt[["African"]] != gt[["Chimp"]]) & # filters for B**A sites
(gt[["African"]] == gt[["Neanderthal"]]) & # filters for B*B* sites
(gt[[X]] == gt[["Chimp"]]) # filters for *A*A sites
) # together then BABA
From these counts we can get an idea about whether one or the other are more frequently appearing in the data:
baba - abba
Finally, we can compute an
f4_value <- (baba - abba) / nrow(gt)
f4_value
Task: Are ABBA or BABA sites the only ones in our data? For instance, can you find if there are any AAAB sites for the quartet $f_4(\textrm{AFR, X; NEA, Chimp})$ ? Would those be useful for studying introgression?
X <- "A"
aaab <- sum(
(gt[["African"]] == gt[[X]]) & # filters for AA** sites
(gt[[X]] == gt[["Neanderthal"]]) & # filters for *AA* sites
(gt[["Neanderthal"]] != gt[["Chimp"]]) # filters for **AB sites
)
aaab
You know that if X
is a African, you expect to see roughly the same
count of BABA
and ABBA
site patterns, so the difference should “be
about zero”. Use the code above to compute baba
, abba
, and
f4_value
to compute the
[If you are more familiar with R, compute the counts automatically in a loop of some kind and make a figure.]
What does it mean for this test statistic to “be about zero”? What are we missing to truly use this as a statistical significance test?
If you’re not comfortable with R, feel free to run this in full and answer the questions based on the results you get:
X <- c("A", "B", "C", "D", "E", "F", "G", "H")
f4_values <- sapply(X, function(x) {
abba <- sum(
(gt[["African"]] == gt[["Chimp"]]) & # filters for A**A sites
(gt[["African"]] != gt[["Neanderthal"]]) & # filters for A*B* sites
(gt[[x]] == gt[["Neanderthal"]]) # filters for *BB* sites
) # together then ABBA
baba <- sum(
(gt[["African"]] != gt[["Chimp"]]) & # filters for B**A sites
(gt[["African"]] == gt[["Neanderthal"]]) & # filters for B*B* sites
(gt[[x]] == gt[["Chimp"]]) # filters for *A*A sites
) # together then BABA
(baba - abba) / nrow(gt)
})
data.frame(f4_values) # to print values in a neater table
plot(f4_values, xaxt = "n", xlab = "test sample", ylab = "f4(African, X; Neanderthal, Chimp)")
abline(h = 0, lty = 2, col = "red")
axis(side = 1, at = seq_along(X), labels = X)
We can see that the samples A-D are consistent with an
On the other hand, samples E-H show a much “more negative value of the
Important: In this simple example we’re missing confidence intervals
– those would allow us to do a proper statistical test to determine for
which samples we really cannot reject a null hypothesis of no gene flow
from Neanderthals. This would allow us to avoid the vague and
statistically unsatisfying talk about some value being “almost zero”,
and some other value being “much more negative” than that. The
confidence interval for a given
Real-world software such as ADMIXTOOLS computes confidence intervals using a so-called bootstrap procedure across windows along a genome.
If you want to take a closer look at how the genotype data was prepared (it was simulated!), you can see the complete code here.
Having saved the day by identifying which of the A-H samples are of African or Eurasian origin by testing which of them appear to carry evidence of Neanderthal introgression, you now want to estimate how much of their genome derives from the Neanderthals. In order to do this, you need to compute a ratio of $f_4$ values as described in the lecture.
Of course, in order to compute this $f_4$-ratio estimate, you will need
“another Neanderthal” genome! Luckily, we now have genomes of several
Neanderthals so this is not an issue and a local friendly
bioinformatician has already presciently merged your gt
genotype table
from the first exercise with the genotypes of “another Neanderthal”.
Estimate the proportion of Neanderthal ancestry in each of your A-H samples!
Type R
in your terminal or (better) just use RStudio R console if you
have it.
You will be using the same genotype table as in the previous exercise,
with one additional column called another_Neanderthal
. You can load it
again like this:
gt <- read.table(url("https://github.com/bodkan/ku-introgression2024/raw/main/genotypes_ex2.tsv"), sep = "\t", header = TRUE)
As always, verify that the format of the data set you have matches what you expect:
head(gt)
nrow(gt)
From the lecture you know that we can get an estimate for the proportion
of Neanderthal ancestry in a sample
To do this, we can take the gt
):
# we can compute the f4 values for everyone (A-H samples as well as
# "another_neanderthal") using the same code as above
X <- c("A", "B", "C", "D", "E", "F", "G", "H",
"another_Neanderthal") # <---- we added this to our loop
f4_values <- sapply(X, function(x) {
abba <- sum(
(gt[["African"]] == gt[["Chimp"]]) & # filters for A**A sites
(gt[["African"]] != gt[["Neanderthal"]]) & # filters for A*B* sites
(gt[[x]] == gt[["Neanderthal"]]) # filters for *BB* sites
) # together then ABBA
baba <- sum(
(gt[["African"]] != gt[["Chimp"]]) & # filters for B**A sites
(gt[["African"]] == gt[["Neanderthal"]]) & # filters for B*B* sites
(gt[[x]] == gt[["Chimp"]]) # filters for *A*A sites
) # together then BABA
(baba - abba) / nrow(gt)
})
# to arrive at the estimate of Neanderthal ancestry, we divide f4 values for
# samples A-H by f4 value comparing the two Neanderthals
proportions <- f4_values / f4_values["another_Neanderthal"]
data.frame(proportions) # to print values in a neater table
How much Neanderthal ancestry did you estimate in Africans vs Eurasians? Do those numbers fit what you’ve learned from the lecture?
To make the results clearer to see, let’s visualize them:
plot(proportions[-length(proportions)] * 100,
xlab = "test individual", ylab = "proportion of Neanderthal ancestry [%]",
ylim = c(0, 10))
abline(h = 3, lty = 2, col = "red")
Why did we not plot the proportion of Neanderthal ancestry in the very
last item of the proportions
variable? What does that last element of
the vector proportions
contain and why?
If you want to take a closer look at how the genotype data was prepared (it was simulated!), you can see the complete code here.
You sequenced 100 diploid genomes from a Eurasian population and are interested in estimating the time of Neanderthal introgression into the ancestors of this population. The literature suggests that the introgression happened around 55 thousand years ago – does this also apply to the population that you sequenced?
To be able to do this, you ran an inference software which gives you the exact coordinates of Neanderthal DNA tracts present in every Eurasian genome that you sequenced. This of course means that you also know the lengths of each of those tracts.
Use the distribution of the Neanderthal tract lengths in your population to estimate the time of Neanderthal introgression!
Type R
in your terminal or (better) just use RStudio R console if you
have it.
First load the table with coordinates of all Neanderthal tracts into R:
tracts <- read.table(url("https://github.com/bodkan/ku-introgression2024/raw/main/tracts.tsv"), sep = "\t", header = TRUE)
Familiarize yourself with the data by running this R command which shows information from only the first few genotypes:
head(tracts)
For how many individuals do we have information about Neanderthal DNA tracts that they carry?
length(unique(tracts$individual))
It looks like the inference software (or a helpful bioinformatician)
binned each tract according to its length (see the column bin
). What
does the distribution of Neanderthal tract lengths looks like in your
data? Knowing that recombination has acted on the introgressed
Neanderthal DNA over time, each generation, suggests that the
distribution should look exponential – do you see this in the data? To
answer this, plot the proportion of tracts in each bin.
# get the bin numbers
bins <- sort(unique(tracts$bin))
# count the tracts in each bin and compute the proportion of tracts in each bin
counts <- as.integer(table(tracts$bin))
props <- counts / sum(counts)
plot(bins, props, xlab = "Neanderthal tract length bin", ylab = "proportion of tracts")
The distribution does, indeed, look quite exponential. Next we will try to use this to date Neanderthal introgression using the information encoded in the distribution of tract lengths!
As we know, over time since admixture, recombination breaks up longer
haplotypes into shorter ones, regularly almost like a clock. And it
turns out that the distribution of tract lengths after time
where the
The
It also turns out that the expected value of this exponential
distribution plotted above (which can be
computed
simply as
Of course, you can also compute this expectation from the data, so do
this now: get an estimate of the expected length of an introgressed
fragment after (unknown) time
L <- mean(tracts$length)
L # length in units of base pairs
Taking the simple math above together and doing a little algebra, we can
express the average expected length of an introgressed fragment after
time
Because we know
where r
is the recombination rate and L
is the average length of an
introgressed Neanderthal tract (as you just computed it).
Note that this time estimate will be in units of generations, so we’ll have to multiply this quantity by the generation time (roughly 30 years for humans) to get time in years before the present.
Use this simple equation
r <- 1e-8 # crossovers per bp per generation
L <- mean(tracts$length) # average tract length after time t
t <- 1 / (r * L)
t
# convert the time of introgression to years before present assuming
# generation time of 30 years
t * 30
You should get a value will be quite close to ~55 thousand years ago, an estimate which is often found in the literature as the time when Neanderthals and anatomically modern humans interbred!
As a last sanity check, if we use this time to compute the rate of
exponential decay
plot(bins, props, xlab = "Neanderthal tract length bin", ylab = "proportion of tracts")
Let’s try if we can plot the theoretical exponential decay from the estimated time of admixture.
First, because the exponential tract decay you plotted above shows the
decay of the length of introgressed tracts in bins, the
# compute the average length in each bin
average_bins <- aggregate(length ~ bin, data = tracts, FUN = mean)
# compute the difference between bins to get "average bin step size"
bin_step <- mean(diff(average_bins$length))
bin_step
With this, we can overlay the theoretical exponential decay expectation across the empirical distribution of tract lengths in different bins:
r <- 1e-8 # crossovers per bp per generation
t <- 1800 # time of admixture (in generations) we computed above
# plot the empirical proportions of tracts in each bin
plot(bins, props, xlab = "Neanderthal tract length bin", ylab = "proportion of tracts")
# overlay theoretical exponential density curve assuming recombination rate r
# and time of introgression t generations ago
lambda <- r * bin_step * t
y <- dexp(bins, rate = lambda)
lines(bins, y, col = "red", lty = 2)
If you want to take a closer look at how the tracts data was prepared (it was simulated!), you can see the complete code here.
This work is licensed under a Creative Commons Attribution 4.0 International License.