Ah, linear regression. Gets looked down on sometimes these days, in the world of AI and super cool algorithms, but it does exactly what it’s supposed to.

We’re going to go over the main aspects of the theory behind this straightforward but effective modelling technique.

So, what is linear regression?

In short, it models the linear relationship between a dependent variable, and one or more independent variables. The dependent variable is the response, or outcome variable, our target. In linear regression, this is a scalar, continuous variable.

Simple linear regression is the most basic form, where we have one independent variable and one dependent variable. Multiple linear regression is when there are more than one dependent, or explanatory variables.

Simple Linear Regression

The best way to think about simple linear regression is as a line of best fit between two variables. We can explore this with an example.

We have our independent variable, which we call X, and our dependent variable, which we call Y.

Both are continuous variables, and we think there is some relationship between them. Our hypothesis is:

If we increase X, we will also see Y increase.

Let’s visualise this in a chart. We have plotted n observations of X and Y values together.

On the X-axis, we have our independent variable, and on the Y-axis, our dependent variable. Each point of our dataset is represented by a blue dot.

We can see that the relationship here is generally linear; the dots all form a line that is slightly flatter than 45 degrees.

This is a great time for linear regression then! We want to ask the question:

What is the relationship between these two variables?

The Simple Linear Regression Equation

We need a way of going from X to Y. This is linear regression, so we have a linear equation. We can input X into that equation, and it will give us a predicted value for Y.

Linear regression is great because it’s intuitive, we can understand the equation because it makes sense when we look at it. Let’s look at the chart above again.

When X is equal to 0, the Y values sit around -40. This is known as the intercept. In the absence of an X value, this is what we would predict Y to be.

When X increases to 60, Y looks to be around 40. These two observations help us build the parts of our equation.

Because Y increases at a scaled rate of X, we know that there needs to be some sort of multiplication constant.

Using these two observations, we can build the equation of the simple linear relationship as

Here, a and b are known as parameters of our model. We want to estimate them as best we can.

Let’s draw a line of best fit over the chart we have.

As you can see, the dashed black line goes through the middle of our thick mass of observations. It looks like a good fit, right?! Right. But there’s a lot more to this. Our line fits the data well, but not perfectly. It’s impossible for a straight line to perfectly fit all of these points. Therefore we actually want to optimise our equation so that it is a best fit overall.

We haven’t even started considering outliers yet!

So, we need a methodology to find the best equation that fits our data, and then we also want to actually get the values of a and b, so we can input a value of X and return a predicted value of Y. We use a slightly different name for the predicted value, a Y with a ‘hat’ on. Because we want to estimate our a and b, our equation for predicted Y is

We don’t have a little hat over the X term because we’re using actual values of it to predict Y.

To do that, we need to look at the difference between what is true, and what is predicted.

Residuals – the key to fitting a regression line

Residuals are the distance for each observation between our predicted value and the actual value, also known as error values.

So, the equation for calculating the residual for observation i is:

Note that we use lowercase letters when we’re dealing with individual observations.

We can rewrite this equation to show the relationship between our model prediction, and the actual value:

Now, let’s replace our Y-hat with our estimated model equation:

For an individual observation this becomes

We can measure the total error for the model as the sum of the individual residuals squared. We square the residuals because we want the total absolute difference to matter. If the errors are evenly distributed around 0, they’d cancel each other out. When we square the individual error values, we’re going to end up with a non-positive value that we want to minimise. In mathematical terms that looks like

This measure of fit of the model is known as Residual Sum of Squares (RSS). To minimise this is known as least squares estimation.

Estimating a and b

We know the model equation, we know our measure of fit. We now just need to estimate the parameters of our model.

In simpler terms, the estimating function is

We will cover the definitions of covariance and variance in more detail in a separate post, but for now: variance is how spread out a set of values are from their average, and covariance is the joint variability between two variables.

When we apply this to our equation, and expand it out:

Now this looks frightening, but it’s really just a lot of sums, maybe a few times they’re squared too. Separate it out, plug the values in and you’ll arrive at an estimate for b.

Once we have that, estimating a is actually way way easier. The equation for that looks like this:

The bar above the Y and X represents the average of those variables.

Again, we can extend that out to become

Plug the values in, calculate a few averages, and we have our linear regression model equation!

Measuring the performance of our model

Finally, we’re going to drill down into one value how ‘good’ our model is at predicting Y given X.

We already have the residual sum of squares, which gives us a summary value for errors produced by our model.

We’ll use this in conjunction with one last new term: total sum of squares (TSS). This one is pretty easy. It’s defined as the sum over our data of the squared differences of each observation from the mean of all observations. It looks like this:

Intuitively, we know that the RSS is the errors of our model which we want to minimise. It’s also a subset of the TSS.

If we divide the RSS by the TSS, we get the proportion of variance that our model gets wrong. So, if we take that value away from 100%, we get the proportion of variance that our model is accountable for. We call this metric the R-Squared, and the formula for that becomes

Now we have an easy summary measure of assessing the fit of our model.

From start to finish, simple linear regression!

So there we have it. We’ve started with two continuous variables, X and Y. We’ve looked at their relationship visually, and built up our understanding of this to produce an equation that allows us to predict Y given a value of X. This equation has two parameters, a and b, that we can estimate using the data we have. Finally, there is a simple way to assess the performance of the model, R-squared.
In the next post, we’ll extend this to having multiple independent variables to predict a single output variable: multiple linear regression.