Unpacking GREML (#2): Ridge Regression and Restricted Maximum Likelihood
In part two of my unpacking of GREML, I describe the two features of the approach that overcome the $p$ > $n$ problem that is commonly encountered with genetic data: restricted maximum likelihood (REML) and ridge regression.
First, I’ll begin by discussing the $p$ > $n$ problem more formally. In a regression, what we are trying to do is solve for $\beta_1 … \beta_p$ in the following equation:
$$ \begin{equation} \mathbf{y = X\beta + e} \\ \end{equation} $$ $$ \begin{align} \begin{bmatrix} y_{1}\\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \vdots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{bmatrix} \begin{bmatrix} \beta_{1}\\ \beta_{2} \\ \vdots \\ \beta_{p} \end{bmatrix} + \begin{bmatrix} e_{1}\\ e_{2} \\ \vdots \\ e_{n} \end{bmatrix} \end{align} $$Most of the time when we run simple regressions, this equation solved by obtaining least squares estimates for $\beta_1 … \beta_p$. With a single predictor, this is essentially finding the best fit line through the data: the line that minimises the distance between the data points and the line. Remember that because we have many predictors, we are considering $p$-dimensions, and the line is instead a hyperplane described by $p$ $\beta$ parameters. I won’t discuss how we come to this solution (see Hastie, Tibshirani & Friedman, 2009 if you are interested!), but to get the $\beta$ parameters that correspond to the best fit line we use:
$$ \begin{equation} \hat{\beta} = \mathbf{(X^TX)^{-1}X^Ty} \end{equation} $$If you try to perform this calculation using a dataset with p »> n you’ll run into some problems. You can try it yourself in R using a simulated dataset with 5 observations and 10 SNPs:
pheno <- data.frame("y"=numeric(5))
pheno[,] <- rnorm(5,mean=0.,sd=1.)
design_matrix <- data.frame(row.names = c("obs1", "obs2", "obs3", "obs4", "obs5"))
for (iii in 1:10) {
snpcolname <- paste("snp_", iii, sep = "")
design_matrix[,snpcolname] <- sample(0:2, 5, replace = T)
}
design_matrix <- as.matrix(design_matrix)
pheno <- as.matrix(pheno)
# error arises when solving using 10 predictors
solve(t(design_matrix) %*% design_matrix) %*% t(design_matrix) %*% pheno
# we can solve fine when we reduce the number of predictors to 4
fixed_dm <- design_matrix[,1:4]
solve(t(fixed_dm) %*% fixed_dm) %*% t(fixed_dm) %*% pheno
You will see that R gives you an error when trying to compute $\mathbf{(X^TX)^{-1}}$ when using 10 predictors: “system is computationally singular”. This is telling us that the inverse of $\mathbf{X^TX}$ doesn’t exist. Because $\mathbf{X}$ has more columns than rows, its columns can inevitably be expressed as linear combinations of each other. In other words, you can express columns in terms of other columns. This means that there is more than one unique solution to $\beta$, and introduces correlations between SNPs that prevent us from making any inferences from $\beta$.
Now, if you’re a biologist with little or no knowledge of linear algebra, the last paragraph might have been a spew of unfamiliar terms. This is not a linear algebra lesson, so I am not going to explain them in more depth. Instead, let me give a heuristic description of why correlations between SNPs is bad for your analysis. Remember that the question we are truly asking at the heart of a regression is “what is the relationship between a SNP and a phenotype?” If we were doing this in the lab, we’d want to hold every possible variable constant, and just change the genotype at the SNP of interest. This would be the best way to test whether the genotype and phenotype were associated! Now imagine if the genotype at our SNP of interest was always correlated with a genotype at another SNP. Which SNP is driving our association? We don’t know because of confounding! Thus, it is inappropriate to use least squares estimate when p »> n because it introduces correlations between SNPs. This should be your main takeaway.
Thus, we take on a different approach to solve $\beta$: ridge regression. This both provides us with a unique solution and prevents overfitting. We just add a constant to the diagonal of $\mathbf{XX^T}$ to make it non-singular:
$$ \begin{equation} \hat{\beta} = \mathbf{(X^TX + \lambda I)^{-1}X^Ty} \end{equation} $$$\lambda$ is referred to as the shrinkage parameter. It is referred to as so because it shrinks estimates of $\beta$ towards 0. The equation for lambda in GREML is:
$$ \begin{equation} \lambda = \frac{\sigma^2_e}{\sigma^2_g} \end{equation} $$So that smaller genetic variances result in more shrinkage but this penalty is applied evenly across all SNPs.
We have not yet discussed how we obtain any of the variance estimates such as $\sigma^2_e$ and $\sigma^2_g$. This is where restricted maximum likelihood (REML) estimation comes in. But first, we will discuss standard maximum likelihood (ML). In ML we assume that our data follows some distribution. So for example, in GREML we assume that the residual effects follow a multivariate normal distribution with mean 0. This distribution has some variance, $\sigma^2_e$, which we want to estimate. The ML estimate finds the value of $\sigma^2_e$ that best corresponds to the observed data given our assumption that the data follows a multivariate normal distribution with mean 0.
However, ML estimates are inherently downwardly biased. This occurs because standard ML estimation ignores fixed effects in the model. Let’s step away from GREML for a moment to better understand this. Imagine we have some data that we’re saying is normally distributed, and we want to estimate the mean $\mu$ and the variance $\sigma^2$ of this data. You’re probably familiar with the fact that we would calculate the variance like so:
$$ \begin{equation} \sigma^2 = \frac{\Sigma(x_i - \bar{x})^2}{d.f.} = \frac{\Sigma(x_i - \bar{x})^2}{n - 1} \end{equation} $$But the ML estimate of variance is larger than this:
$$ \begin{equation} \sigma^2 = \frac{\Sigma (x_i - \bar{x})^2}{n} \end{equation} $$Why this discrepancy? ML does not account for the degree of freedom lost to estimating the mean. Extrapolating this out to GREML where we may have numerous fixed effects (e.g. age, sex) that we need to estimate, we can see how our variance estimates are downwardly biased through the unaccounted for loss of degrees of freedom to fixed effect estimation.
REML provides us with a way to obtain an unbiased estimate of variance components in the model, even when we have numerous fixed effect predictors. Crudely speaking, it does this by removing the influence of fixed effects from the data $\mathbf{y}$, and calculating ML estimates based on this transformed data.
When we are estimating random effects, we use variance estimates obtained through REML to estimate $\beta$. Using REML variance estimates is said to give the best linear unbiased prediction (or BLUP) for a random effect (in this case a SNP effect, dubbed SNP-BLUP). We can also estimate the genetic effect associated with an individual (called an estimate breeding value, $\hat{u}$) from such a model using the SNP effects, which may be of value in animal breeding.
$$ \begin{equation} \mathbf{\hat{u} = X\hat{\beta}} \end{equation} $$However, we are generally not interested in the BLUPs of specific SNPs when we are performing heritability estimation. Tests for significant effects of SNPs on traits is performed through genome wide association studies (GWAS), where SNP effects are treated as fixed. We tend only to be interested in the value of $\sigma^2_g$, because this is used to calculate the heritability, $h^2$.
Understanding the SNP-BLUP approach I have described should help you easily grasp GBLUP, which underlies GREML. I will describe GBLUP and the consequences of its genomic relationship matrices in part three.
References
Duchateau, D. L.; Janssen, P. & Rowlands, J. (1998) Linear Mixed Models: An Introduction with Applications to Veterinary Research, 1st ed., Kenya: International Livestock Research Institute.
Hastie, T.; Tibshirani, R. & Friedman, J. (2009) The Elements of Statistical Learning, 2nd ed., New York: Springer.