So, you think you know linear regression? I have likely run thousands of regression models as a researcher and data analyst. However, a recent job interview made me think about some finer details of linear regression. As a consequence, I wanted to brush up on the topic. The outcome is a series of posts on linear regression. This is the first part and provides the foundations of linear regression.

### What is linear regression?

Linear regression describes the nature and strength of linear relationships among variables. Here, we assume the simplest case: two continous variables. **Continuous variables** can take on any value between its minimal and maximal value.

If the data can only take on specific values, we speak of a **discrete variable**. For guidance on selecting the right regression model (depending on the kind of variables), check out this guide by Statistics with Jim.

Let's return to our simple case with two continuous variables. We call one variable \(X\) and the other one \(Y\). By convention, \(Y\) is the variable that we want to predict. It is called the **response** or **dependent variable**, and \(X\) is called the **predictor** or **independent variable**.

The goal of regression is to find a formula (i.e., mathematical model) that defines \(Y\) as a function of \(X\). In other words, **regression** estimates the relationship between our input variable \(X\) and our response variable \(Y\). Once we have a formula describing the relationship of \(X\) and \(Y\), we can also predict future outcomes (e.g., conduct a **forecast** using new values of \(X\).)

First, let's plot the relationship of \(X\) and \(Y\) in a scatterplot. Note that this tutorial uses dummy variables to keep the focus on the mechanics, not the findings.

### How does linear regression work?

Linear regression describes the relationship of two variables. We can visualize such a relationship by a straight line.

Mathematically, a line is described as \(\hat{Y} = \beta_0 + \beta_1 X\).

No need to get scared. Let's go through the formula piece by piece:

- \(\hat{Y}\) is the predicted value at the given value of \(X\) (i.e., what we want to determine)
- \(\beta_1\) is the slope of the line (i.e., steepness of the line)
- \(\beta_0\) is the intercept of the line (i.e., the predicted value of \(\hat{Y}\) when \(X = 0\))
- \(X\) is the value of the predictor variable

Note that \(\beta_0\) and \(\beta_1\) are called the coefficients or parameters of the regression model. But how do we find the parameters? They are estimated from the data by defining a criterion on how we want to draw the line.

For instance, we could try to draw a straight line that goes through as many data points as possible. Or we could draw a line that has an approximately equal number of data points above and below.

However, the most common criterion is to minimize the **sum of squared errors (SSE)**. Before getting into the details on what that means, let's look at such a regression line.

In the image, we see a blue regression line through the data. Some data points are closer to the line, others farther away. Some are below, others above. The extent by which each data point deviates from the line is called the residual error (\(\epsilon_i\)). In the above image, the \(\epsilon_i\) for each data point is shown as a vertical, gray line.

The residual error is calculated by \(\epsilon_i = Y_i – \hat{Y_i}\), where \(Y_i\) is the actual data point at the \(i^{th}\) observation, and \(\hat{Y_i}\) is the predicted value of the \(i^{th}\) observation. The predicted values \(\hat{Y_i}\) are shown as pink circles in our graph. You can think of \(\hat{Y_i}\) as the value that \(Y_i\) should have according to our model if there was no error. In turn, it means that \(\epsilon\) is the part of each \(Y_i\) that cannot be explained by the model.

So in order to find our line we do the following: calculate the squared difference of the actual data point and its predicted value (i.e., \((Y_i-\hat{Y_i})^2\)). Then sum them up over all \(Y_i\) to obtain the sum of squared errors (\(SSE\)). In math speak, that comes out to \(\sum_i(Y_i-\hat{Y_i})^2 = \sum \epsilon_i^2\).

To make it a bit more confusing, the sum of squared errors (\(SSE\)), are also known as sum of squared residuals (\(SSR\)), residual sum of squares (\(RSS\), \(SS_{res}\)), or the sum of squared estimate of errors (\(SSE\)).

As mentioned above, we can fit different linear regression models to the data. That means, you can define what 'best-fit' means for a “best-fit line. OLS minimizes \(SSE\) but that actually makes it sensitive to outliers (i.e., a data point that strongly deviates from all other data points) may strongly influence the model. The culprit is that we square the errors, which means large errors receive more weight than small ones. By optimizing a different criterion (also called **loss function**), you will end up with a different 'best-fit' and possibly 'better', or more robust, model to describe your data.

### How to interpret a linear regression model?

The formula of the regression line (\(\hat Y = -0.331 + 0.925*X\)) gives us some information about the relationship of \(X\) and \(Y\). First of all, it is linear (no quadratic terms, etc.). Our intercept estimate (\(\beta_0\)) is -0.331. That means, when our \(X\) is zero, \(Y\) is -0.331. The line's slope (\(\beta_1\)) tells us that every one unit change in \(X\), causes an average 0.925 change in \(Y\).

This brings us to the end of the linear regression basics. In the next post, we'll look at the assumptions that underly linear regression, see how to test them, learn when it's (not) OK to ignore them, and transform your data to make them comply to the linear regression assumptions.