Genotype imputation: a simple yet powerful method to infer the missing genome
Following progress, efficiency, and decreased computational cost encompassed by Moore's law, whole-genome sequencing (WGS) and whole-exome sequencing (WES) are becoming increasingly more accessible to research and industry applications but their price, which is still in the thousands of dollars, prevents massive adoption. However, because a very large proportion of the genome does not change between people, the question remains, whether it actually makes sense to actually sequence the whole genome or exome, or rather just focus on single nucleotide polymorphisms (SNPs), where the variability in the general population is greatest and which, as the name implies are small and highly polymorphic. This is the main driving force behind next-generation sequencing (NGS), which is massively parallel, speedy, and cost-efficient. In fact, the cost of performing an NGS is about 10 % compared to the price of WGS or WES. Unfortunately, there is no free lunch. While we may receive genetic information on about 500,000 to 1,000,000 SNPs per person, there are many more which were not part of the panel and are therefore unaccounted for.
This is where genotype imputation comes in.
The scientists pondered about and invented a way to estimate the remainder of the SNPs with the information from the SNPs that were sequenced. But to understand this concept, let's quickly delve into the biology of genetic inheritance.
DNA recombination and mutation rate
Our genetic material is diploid (2n), which means each of our cells contains one haplotype from our father and the other haplotype from our mother. During meiosis, which occurs in either testes or ovaries, DNA recombination starts at various sites along each chromosome, when one of the strands from either of the haplotypes invades the other haplotype, mixing the genetical material inherited by both of our parents. Starting with one cell, we are left with four cells each containing a haploid (n) genetic material. I encourage you to watch the following video to fully understand the beauty of this process.
It would be way too simple to assume recombinations occur equally throughout the genome. In honor of Thomas Morgan, who famously researched fruit flies, a centimorgan - a measure of genetic distance was put into use. A centimorgan (cM) or map unit (m.u.) is a unit for measuring genetic linkage. It is not equal to physical distance along the genome but is instead calibrated so that one centimorgan corresponds to a 1 % chance of recombination between a pair of loci.
Centimorgan, it turns out, is a very important metric when imputing the genome. If two SNPs are very very close together in terms of centimorgans, the likelihood of one recombining and the other staying put is small. Conversely, if the loci are much further apart, the likelihood of only one of them recombining is larger as opposed to the first case.
Therefore, if we know an SNP at locus A and we wish to infer the contents of another SNP at locus B an X distance away (in centimorgans), we can probabilistically calculate the chance of the two SNPs coming from the same reference haplotype.
What is a reference haplotype?
A reference haplotype is a haplotype, for which we have the full genomic sequence. Often they are collected in panels, the most notable of which is The Haplotype Reference Consortium (HRC), which contains 64,976 haplotypes. Think of them like cars in the scrapyard. There are numerous cars to choose from, but in the end, everyone chooses only 2, with the additional help that you can use spare parts from several different cars.
To continue with the car analogy, genotype imputation works like this. If the left door on one of the two cars is red, then the likelihood of the other door being red, even if we haven't seen it, is also very high. This is analogous to short genetic distance. On the other hand, now that you've seen both of the doors, you still have little information about the color or the type of the exhaust system. The only thing that can give you a hint may be the brand of the car. This corresponds to a large genetic distance, where the chance of recombination is high. Note that I am still not looking at the other car. That is because each of the haplotypes is considered independent of the other unless your parents are related.
How to build a mathematical model on top of this?
Unlike many other domains, handling genetic data is both computationally and memory very intensive. With that in mind, there are several requirements such model needs to have:
- The time complexity must not scale exponentially with data size.
- The memory complexity must not exceed common RAM capacities.
- The model needs to be fast and cheap, ideally several orders of magnitude cheaper compared to the cost of performing an NGS.
A hidden Markov model (HMM) is an elegant and common solution that fits all the criteria. Li and Stephens pioneered the use of HMMs in the area of genotype imputation. Ever since it has been the foundation on which subsequent authors developed newer models to achieve the same goal.
To fully understand the mechanics of the model, watch this series of videos by NormalizedNerd where he clearly explains the mathematics behind the model and how the time and memory complexity neatly scale down using dynamic programming.
In general, an HMM is composed of hidden states and observable states. Hidden states have transition probabilites (i.e. what is the probability of landing on hidden state B if starting from A), while the observables have emission probabilities. One of its major features is that the likelihood of a hidden state occurring is conditioned only on the previous state, which greatly reduces the computational cost. In this setting, we assume the hidden states are markers (alleles), and the observables are observed alleles. The difference between the two is that emission probabilities account for mutations, while transitional probabilities account only for recombination. Because most SNPs are biallelic, we will abbreviate the alleles as being either 0 or 1, regardless of their position. The algorithm then works like this:
- First, we start at the first sequenced allele in a given window. Usually, imputations are performed in windows of a fixed base-pair length due to memory constraints.
- When we reach the next locus, which has not been sequenced, we calculate the probability of recombination between the first and the current allele adjusted for the size of the reference panel (Eq. 1).
- We use the forward-backward algorithm to calculate the probability of the observed haplotype (the probability of observable states occurring in sequence).
- We obtain posterior allele probabilities by summing the state probabilities for all reference haplotypes that carry the same allele and choosing the allele with the highest posterior probability (Eq. 2, Browning et. al., 2021)
Caveats to watch out for
The most important problem when imputing the genome is that low-frequency variants are imputed very poorly. This stems from the fact that their influence on neighboring markers is much more stochastic - determined more by chance. Browning et. al. suggest a mean allele frequency (MAF) of no less than 5 % or a mean allele count (MAC) of 1000 in absolute terms.
Another problem not pointed out before is the choice of centimorgans as the metric of genetic recombination. Their main flaw is that they only capture an odd number of recombination events because an even number of recombinations can in some cases yield the same marker at the next locus (if the Holliday junctions do not result in cross-over).