Unpacking GREML (#3): Genomic Relationship Matrices

In part two of Unpacking GREML, I described the idea of ridge regression for solving for random SNP effects ($\beta_1…\beta_p$) in the following mixed linear model:

$$ \begin{equation} \mathbf{y = X\beta + e} \end{equation} $$ $$ \begin{equation} \hat{\beta} = \mathbf{(X^TX + \lambda I)^{-1}X^Ty} \end{equation} $$

Indeed, this approach (known as SNP-BLUP) is how you would calculate SNP effects. In GREML, however, we do something slightly different. This is because the variance estimate you would obtain for the above model would explain the variance of allelic substitution effects. You can imagine allelic substitution effects as the change in phenotypic value that occurs when an allele at a locus is switched from one to another, but all other variables are held constant. We aren’t actually interested in the variance of these effects, we are interested in the additive genetic variance: how genetic differences between individuals lead to phenotypic variation.

I think the difference between these variances is best illustrated by thinking about allele frequencies. The variance of allelic substitution effects is unaffected by the frequency of the alleles in the population. Regardless of whether you have a lot of allele $A_1$ in the population, or little of $A_1$ in the population, the effect of substituting $A_1$ for $A_2$ is always the same if the locus acts additively (this is not true if there is dominance, but this series ignores dominance). However, if $A_1$ is a very rare allele, the amount of phenotypic variation explained by its locus will be low, compared to if the frequency of $A_1$ was 0.5, where the amount of phenotypic variation explained by this locus would be maximal for its allelic substitution effect. What we are actually interested in is the former, because we want to understand the effect of additive genetic variation on the phenotype in the population of interest. This depends on more than just the allelic substitution effects - it also depends on how the alleles are distributed to individuals in the population.

So, in GREML our mixed model actually looks like:

$$ \begin{equation} \mathbf{y = Zu + e} \end{equation} $$

Where $\mathbf{u}$ is the ($n$ x 1) vector of breeding values and $\mathbf{Z}$ is a ($n$ x $n$) matrix relating phenotypic records to their breeding values (assuming, as I have done throughout this series, that each individual has exactly one record). This model is referred to as GBLUP. It is important to note that SNP-BLUP and GBLUP are essentially equivalent: you can still get breeding values from SNP-BLUP and you can still get SNP effects from GBLUP, using equations which I will not cover here.

Unlike SNP effects, which we assume to be distributed MVN(0, $\mathbf{\sigma^2_gI}$), breeding values are assumed to be distributed MVN(0, $\mathbf{\sigma^2_gG}$), where $\mathbf{G}$ is a genomic relationship matrix (GRM). Our calculation of breeding values is performed using:

$$ \begin{equation} \hat{u} = \mathbf{(Z^TZ + \lambda G^{-1})^{-1}Z^Ty} \end{equation} $$

Note how this changes how the shrinkage parameter $\lambda$ is applied to the breeding values for different individuals depending on the value in $\mathbf{G}$. This suggests that however we construct $\mathbf{G}$ will have an effect on the outcome of our model.

Numerous ways have been proposed for the calculation of $\mathbf{G}$. I’ll discuss three here: the GRM used in VanRaden model 1, the GRM used in VanRaden model 2 (which is implemented in GCTA), and the GRM used in LDAK. VanRaden models are GBLUP models proposed for animal breeding value prediction. GCTA and LDAK are both examples of software that carry out GREML analyses.

First though, it is good to understand what we want $\mathbf{G}$ to be. The idea of a GRM is to be a description of the relatedness of individuals in the sample at all of the loci that have an effect on the phenotype of interest. In reality, we don’t know what loci have an effect on the trait, so our GRM is just an approximation made using all of the genetic data that we have available to us. Thus, by using a GRM we assume that the trait is contributed to by infinite, small effect causal variants.

Recall the matrix $\mathbf{X}$ that was used in the SNP-BLUP examples. This is a ($n$ x $p$) matrix relating individuals to the number of alternative alleles they carry at genotyped loci. The matrix $\mathbf{X}$ is typically centred so that each SNP has mean 0 (i.e. the mean of column $j$ is subtracted from every element of column $j$). $\mathbf{G = XX^T}$ thus gives us an ($n$ x $n$) matrix, where two individuals who share more alleles will have a larger $G_{ij}$. The more rare alleles an individual carries, the greater their diagonal value $G_{ii}$ will be.

In the GRM used by VanRaden 1, $\mathbf{G}$ is also scaled by a constant that depends on the allele frequencies at all included loci:

$$ \begin{equation} \mathbf{G_{1}} = \frac{\mathbf{XX^T}}{(\sqrt{\sum_{j}{2p_j(1-p_j)}})^2} \end{equation} $$

This scaling means that the variance in the GRM represents the additive genetic variance, as long as the assumptions of Hardy-Weinberg equilibrium and linkage equilibrium hold. The use of this GRM means that SNPs with low minor allele frequencies (MAF) are shrunk the most. However, this does not necessarily reflect biology. Causal variants are often deleterious, and selection keeps deleterious variants at low frequencies. What’s more, the larger the variant’s effect on the trait, the stronger the effect of selection, and thus we might expect variants with larger effects to have smaller MAFs.

The GRM used in VanRaden 2 corrects for this by dividing each $j$th column of $\mathbf{X}$ by $\sqrt{2p_j(1 - p_j)}$ to produce $\mathbf{X^*}$. So unlike in $\mathbf{G_1}$, each locus is scaled individually depending on its MAF. Then, making the assumption that every variant contributes equally to the genetic variance, the GRM is divided by the total number of variants that were used to construct it, $m$:

$$ \begin{equation} \mathbf{G_2} = \frac{\mathbf{X^*X^{*T}}}{m} \end{equation} $$

For this to be true, the effects of rarer variants must be larger for these loci to contribute the same proportion of the genetic variance as a common variant. This GRM is used in the GREML analyses performed by the Genome-wide Complex Trait Analysis (GCTA) software.

Neither of the VanRaden models consider the influence of linkage disequilibrium (LD) on their heritability estimates. When using genotype data, we are typically picking up variance introduced by causal variants through genotyped variants that are in LD with them. It is possible that more than one of the genotyped variants represent signal from a single causal variant. This becomes problematic when some causal variants are disproportionately over-represented by genotyped variants than others (for example, because they are in a genomic region with high levels of LD). This could bias our heritability estimates.

A solution to this was proposed by Speed et al. in 2012 and is implemented in their software Linkage Disequilibrium Adjusted Kinship (LDAK). Each column $j$ of $\mathbf{X^*}$ is multiplied by a weight, $\sqrt{w_j}$, dependent on the LD of variant $j$ with every other genotyped variant. This weighting ensures that the heritability estimate is not biased by varying LD between causal and genotyped variants in the data.

I have described the very basics of both GCTA and LDAK approaches to GRMs, but both groups have developed further variations on their original approaches in attempts to further reduce bias in GREML heritability estimation. Both approaches are used in the literature: there is no community-wide consensus on which is best. Uncertainty surrounding the unbiasedness of GREML approaches arises because we have to make numerous assumptions when constructing any GRM: normality, polygenecity, inverse MAF-effect size relationship, and so on. The consequences of breaches of these assumptions on heritability estimates is hotly debated and a subject of continuing research.

GREML can be a tricky subject to understand by just diving into the relevant papers. My hope is that this short series has introduced you to some of the main concepts and terms you will encounter whilst reading about GREML; how these fit together; and most importantly, despite its age and prolific use, the fact that GREML is still evolving.

References

VanRaden, P. M. (2008) ‘Efficient methods to compute genomic predictions’, Journal of Dairy Science, 91(11), pp. 4414–4423. doi: 10.3168/jds.2007-0980.

Speed, D. et al. (2012) ‘Improved heritability estimation from genome-wide SNPs’, American Journal of Human Genetics. Elsevier, 91(6), pp. 1011–1021. doi: 10.1016/j.ajhg.2012.10.010.