Vector Calculus in Least Squares Regression
In my blog post Least Squares Regression as an Orthogonal Projection, I described by example how least squares regression can be interpreted as an orthogonal projection, inspired by the ‘Linear Algebra’ chapter of Mathematics for Machine Learning. Now I have forged through the chapter ‘Vector Calculus’, which has made me curious about the role calculus has to play in this simple machine learning algorithm.
In my previous example I introduced five sheep who produced different quantities of wool. I will revisit these sheep in this post. First, let’s only consider the sheep’s wool production (the (5 x 1) vector $\mathbf{y}$) and their grass intake (the (5 x 1) vector $\mathbf{x_1}$). As sheep farmers, we want to predict how much wool our sheep will produce given how much grass they have eaten. Mathematically what we are saying is that we think there is some function that relates grass intake to wool production: $f(\mathbf{x_1}) = \mathbf{y}$. Let’s assume that the relationship between grass intake and wool production is linear, so this problem is suitable for linear regression. By choosing to use linear regression, what we are now saying is that we think there is some function:
$$ \begin{equation} f(\mathbf{a}, \mathbf{x_1}) = \mathbf{y} = a_0 + a_1\mathbf{x_1} \end{equation} $$Where $\mathbf{a}$ is a vector of regression coefficients that help us describe the relationship between grass intake and wool production. In reality, there is unlikely to be a set of solutions for $\mathbf{a}$ that satisfy this equation exactly. This makes sense: wool production is dependent on more than just grass intake. Instead, we look for the value of $\mathbf{a}$ that gets us as close to $\mathbf{y}$ as possible. In other words, we know that we cannot accurately predict a sheep’s wool production from its grass intake alone, but we can try to get an estimate of a sheep’s wool production, $\hat{\mathbf{y}}$, from its grass intake. The question is now, what values of $\mathbf{a}$ get us the best prediction?
To answer this we need to think about how we define a “good” estimate of wool production. It’s probably the estimate of wool production that deviates the least from the actual wool production of the sheep! In other words, the best values of $\mathbf{\hat{y}}$ minimise the following: $(\mathbf{y} - \hat{\mathbf{y}})^2$. This can be rewritten as a loss (or cost) function, a function that evaluates error:
$$ \begin{equation} L(\mathbf{a}) = (\mathbf{y} - f(\mathbf{a}, \mathbf{x_1}))^2 \end{equation} $$A low value of $L(\mathbf{a})$ indicates low error, and vice versa. So how do we find values of $\mathbf{a}$ that minimise this loss function? This is where calculus comes in. In high school, you may have been asked to find the minima of a function. For a function with a single input and a single output, minima occur at points where the gradient is 0. Our loss function actually has two inputs: $a_0$ and $a_1$, and a single output $L(\mathbf{a})$, so we can imagine it in 3D space. The idea of a gradient generalises to multi-dimensional space: the gradient of a vector-valued function like $L(\mathbf{a})$ is itself a vector, where every nth component of the gradient vector is the partial derivative of the function with respect to the nth function input. This means that the gradient of $L(\mathbf{a})$ is:
$$ \begin{equation} \nabla L(\mathbf{a}) = \begin{bmatrix} \frac{\partial{L}}{\partial{a_0}} \\ \frac{\partial{L}}{\partial{a_1}} \end{bmatrix} \end{equation} $$And to find the minima of this function we must find the case where this gradient is equal to the zero vector (and do some additional maths to prove this point is a minima and not a maxima or saddlepoint, but I won’t go into that here). This is the same as saying that all of the partial derivatives must be equal to 0. Thus, we can solve this by solving for these partial derivatives. I will not go into the derivation, but it can be shown that these partial derivatives are:
$$ \begin{equation} \frac{\partial(L)}{\partial(a_0)} = \sum^N_{n = 1}{(y_n-(a_1x_n+a_0))} = 0 \end{equation} $$ $$ \begin{equation} \frac{\partial(L)}{\partial(a_1)} = \sum^N_{n = 1}{x_n(y_n-(a_1x_n+a_0))} = 0 \end{equation} $$We set these to zero, and solve for the coefficient $a_1$. We get an equation I got very excited about:
$$ \begin{equation} a_1 = \frac{\sum^N_{n = 1}{(x_n - \bar{x})(y_n - \bar{y})}}{\sum^N_{n = 1}{(x_n - \bar{x})^2}} \end{equation} $$When I learned linear regression at uni, I was introduced to the idea of the corrected sum of squares of $\mathbf{x}$ (i.e. $S_{xx}$), which is the denominator in equation 6, and the corrected sum of products (i.e. $S_{xy}$), which is the numerator in equation 6. I was taught to calculate a regression slope using $\frac{S_{xy}}{S_{xx}}$. Actually seeing where these equations come from is incredibly satisfying, and I wish I had been introduced to this in class. If like me you’re eager to see this derivation from loss function to classroom equation, I highly recommend this video by jbstatistics, who explains it with wonderful clarity.
We can solve for $a_0$ in terms of $a_1$, which we have already obtained:
$$ \begin{equation} a_0 = \bar{y} - a_1\bar{x} \end{equation} $$Let’s do this for our sheep. We get a value for 0.427 for $a_1$ and -3.155 for $a_0$, which are the same values that R’s lm()
function gives us for $a_1$ and $a_0$.
wool <- c(10, 20, 30, 40, 50)
grass <- c(25.2, 60.4, 82.5, 100.1, 120)
s_xx <- sum((grass- mean(grass))^2)
s_xy <- sum((wool - mean(wool))*(grass - mean(grass)))
a_1 <- s_xy/s_xx
print(a_1)
>>> 0.4270327
a_0 <- mean(wool) - a_1 * mean(grass)
print(a_0)
>>> -3.15482
lm(wool ~ grass)
>>> Call:
>>> lm(formula = wool ~ grass)
>>>
>>> Coefficients:
>>> (Intercept) grass
>>> -3.155 0.427
In this example I have only considered grass, but in my previous blog post, I also considered the influence of sleep and water on my imaginary sheep. So what about when we want to do a multiple regression with more than one predictor? $f(\mathbf{a}, \mathbf{x_1})$ becomes $f(\mathbf{a}, \mathbf{X})$, where $\mathbf{X}$ is a ($p$ x $n$) matrix relating predictors to sheep, and the vector $\mathbf{a}$ now includes an extra coefficient for every additional predictor. The gradient is still a vector of partial derivatives - there are just now more partial derivatives to take. Then you must solve for each coefficient when the partial derivatives are set to 0. As a model for thinking about regression, I think that this calculus perspective is less intuitive when generalised to many dimensions than the geometric approach I wrote about previously. Nevertheless, this was a good thought experiment for me to explore as I begin my journey into learning about optimisation methods.
References
Deisenroth, M. P.; Faisal, A. A.; & Ong, C. S. 2020, Mathematics for Machine Learning, Cambridge University Press.