Linear Regression: Predicting Numbers

What is Linear Regression?

Linear regression is the simplest predictive model. It draws a straight line through your data that best represents the relationship between your input features and the output value.

The equation is one you've seen in maths class:

y = mx + c

Where:

y = the value you're predicting
x = the input feature
m = the slope (how much y changes for each unit of x)
c = the intercept (the value of y when x is 0)

How Do We Know if the Line is Good?

We use evaluation metrics to measure how far off our predictions are from the actual values.

Mean Absolute Error (MAE)

The average of absolute differences between predicted and actual values. Easy to interpret — it's in the same units as your data.

Mean Squared Error (MSE)

The average of squared differences. Penalises large errors more heavily than MAE.

Root Mean Squared Error (RMSE)

The square root of MSE. Brings the error back to the original units while still penalising large errors.

R² (R-Squared)

How much variance in Y does the model explain?

R² = 1.0 → the model explains all the variance (perfect fit)
R² = 0.0 → the model explains none of the variance (no better than guessing the mean)
R² < 0 → the model is worse than just predicting the mean

Calculated as: R² = 1 − (RSS / TSS)

Where RSS is the residual sum of squares and TSS is the total sum of squares.

Checking Correlation First

Before building a regression model, check which features are correlated with your target:

print(df.corr())

Features with high correlation to the target are good candidates for your model.

Multiple Linear Regression

When you use more than one input feature, it's called multiple linear regression. The equation extends to:

y = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n

Each feature gets its own coefficient. A negative coefficient means that feature has a negative association with the target — as it increases, the prediction decreases.

Polynomial Regression

What if the relationship isn't a straight line? Polynomial regression fits a curve instead:

y = a + bx + bx^2 + bx^3 + ...

You're essentially creating extra features (x², x³, etc.) from your original feature to capture non-linear patterns.

Use polynomial regression when:

The scatter plot shows a curved relationship
A linear model has poor R² despite reasonable correlations
You understand that higher-degree polynomials risk overfitting

From Simple to Multiple

Type	Features	Equation
Simple Linear	1 input	y = mx + c
Multiple Linear	2+ inputs	y = b₀ + b₁x₁ + b₂x₂ + ...
Polynomial	1+ input with powers	y = a + bx + bx² + ...

The Designer's Takeaway

Linear regression is the foundation. Even if you never implement one yourself, understanding it helps you:

Interpret dashboards and analytics in ML-powered products
Know when a prediction is just a "best-fit line" and not magic
Design appropriate confidence indicators in UIs that show predictions