Linear Regression from Scratch in Python

28 September 2017 — Written by Mubaris NK
#Python#Tutorial#Machine Learning

Linear Regression is one of the easiest algorithms in machine learning. In this post we will explore this algorithm and we will implement it using Python from scratch.

As the name suggests this algorithm is applicable for Regression problems. Linear Regression is a Linear Model. Which means, we will establish a linear relationship between the input variables(X) and single output variable(Y). When the input(X) is a single variable this model is called Simple Linear Regression and when there are mutiple input variables(X), it is called Multiple Linear Regression.

Simple Linear Regression

We discussed that Linear Regression is a simple model. Simple Linear Regression is the simplest model in machine learning.

Model Representation

In this problem we have an input variable - X and one output variable - Y. And we want to build linear relationship between these variables. Here the input variable is called Independent Variable and the output variable is called Dependent Variable. We can define this linear relationship as follows:

Y=β0+β1XY = \beta_0 + \beta_1X

The β1\beta_1 is called a scale factor or coefficient and β0\beta_0 is called bias coefficient. The bias coeffient gives an extra degree of freedom to this model. This equation is similar to the line equation y=mx+by = mx + b with m=β1m = \beta_1 (Slope) and b=β0b = \beta_0 (Intercept). So in this Simple Linear Regression model we want to draw a line between X and Y which estimates the relationship between X and Y.

But how do we find these coefficients? That's the learning procedure. We can find these using different approaches. One is called Ordinary Least Square Method and other one is called Gradient Descent Approach. We will use Ordinary Least Square Method in Simple Linear Regression and Gradient Descent Approach in Multiple Linear Regression in post.

Ordinary Least Square Method

Earlier in this post we discussed that we are going to approximate the relationship between X and Y to a line. Let's say we have few inputs and outputs. And we plot these scatter points in 2D space, we will get something like the following image.

Linear Regression

And you can see a line in the image. That's what we are going to accomplish. And we want to minimize the error of our model. A good model will always have least error. We can find this line by reducing the error. The error of each point is the distance between line and that point. This is illustrated as follows.


And total error of this model is the sum of all errors of each point. ie.

D=i=1mdi2D = \sum_{i=1}^{m} d_i^2

did_i - Distance between line and ith point.

mm - Total number of points

You might have noticed that we are squaring each of the distances. This is because, some points will be above the line and some points will be below the line. We can minimize the error in the model by minimizing DD. And after the mathematics of minimizing DD, we will get;

β1=i=1m(xixˉ)(yiyˉ)i=1m(xixˉ)2\beta_1 = \frac{\sum_{i=1}^{m} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{m} (x_i - \bar{x})^2} β0=yˉβ1xˉ\beta_0 = \bar{y} - \beta_1\bar{x}

In these equations xˉ\bar{x} is the mean value of input variable X and yˉ\bar{y} is the mean value of output variable Y.

Now we have the model. This method is called Ordinary Least Square Method. Now we will implement this model in Python.

Y=β0+β1XY = \beta_0 + \beta_1X β1=i=1m(xixˉ)(yiyˉ)i=1m(xixˉ)2\beta_1 = \frac{\sum_{i=1}^{m} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{m} (x_i - \bar{x})^2} β0=yˉβ1xˉ\beta_0 = \bar{y} - \beta_1\bar{x}


We are going to use a dataset containing head size and brain weight of different people. This data set has other features. But, we will not use them in this model.. This dataset is available in this Github Repo. Let's start off by importing the data.

# Importing Necessary Libraries
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (20.0, 10.0)

# Reading Data
data = pd.read_csv('headbrain.csv')
(237, 4)
Gender Age Range Head Size(cm^3) Brain Weight(grams)
0 1 1 4512 1530
1 1 1 3738 1297
2 1 1 4261 1335
3 1 1 3777 1282
4 1 1 4177 1590

As you can see there are 237 values in the training set. We will find a linear relationship between Head Size and Brain Weights. So, now we will get these variables.

# Collecting X and Y
X = data['Head Size(cm^3)'].values
Y = data['Brain Weight(grams)'].values

To find the values β1\beta_1 and β0\beta_0, we will need mean of X and Y. We will find these and the coeffients.

# Mean X and Y
mean_x = np.mean(X)
mean_y = np.mean(Y)

# Total number of values
m = len(X)

# Using the formula to calculate b1 and b2
numer = 0
denom = 0
for i in range(m):
    numer += (X[i] - mean_x) * (Y[i] - mean_y)
    denom += (X[i] - mean_x) ** 2
b1 = numer / denom
b0 = mean_y - (b1 * mean_x)

# Print coefficients
print(b1, b0)
0.263429339489 325.573421049

There we have our coefficients.

BrainWeight=325.573421049+0.263429339489HeadSizeBrain Weight = 325.573421049 + 0.263429339489 * Head Size

That is our linear model.

Now we will see this graphically.

# Plotting Values and Regression Line

max_x = np.max(X) + 100
min_x = np.min(X) - 100

# Calculating line values x and y
x = np.linspace(min_x, max_x, 1000)
y = b0 + b1 * x

# Ploting Line
plt.plot(x, y, color='#58b970', label='Regression Line')
# Ploting Scatter Points
plt.scatter(X, Y, c='#ef5423', label='Scatter Plot')

plt.xlabel('Head Size in cm3')
plt.ylabel('Brain Weight in grams')


This model is not so bad. But we need to find how good is our model. There are many methods to evaluate models. We will use Root Mean Squared Error and Coefficient of Determination((R^2) Score).

Root Mean Squared Error is the square root of sum of all errors divided by number of values, or Mathematically,

RMSE=i=1m1m(yi^yi)2RMSE = \sqrt{\sum_{i=1}^{m} \frac{1}{m} (\hat{y_i} - y_i)^2}

Here yi^\hat{y_i} is the ith predicted output values. Now we will find RMSE.

# Calculating Root Mean Squares Error
rmse = 0
for i in range(m):
    y_pred = b0 + b1 * X[i]
    rmse += (Y[i] - y_pred) ** 2
rmse = np.sqrt(rmse/m)

Now we will find R2R^2 score. R2R^2 is defined as follows,

SSt=i=1m(yiyˉ)2SS_t = \sum_{i=1}^{m} (y_i - \bar{y})^2 SSr=i=1m(yiyi^)2SS_r = \sum_{i=1}^{m} (y_i - \hat{y_i})^2 R21SSrSStR^2 \equiv 1 - \frac{SS_r}{SS_t}

SStSS_t is the total sum of squares and SSrSS_r is the total sum of squares of residuals.

R2R^2 Score usually range from 0 to 1. It will also become negative if the model is completely wrong. Now we will find R2R^2 Score.

ss_t = 0
ss_r = 0
for i in range(m):
    y_pred = b0 + b1 * X[i]
    ss_t += (Y[i] - mean_y) ** 2
    ss_r += (Y[i] - y_pred) ** 2
r2 = 1 - (ss_r/ss_t)

0.63 is not so bad. Now we have implemented Simple Linear Regression Model using Ordinary Least Square Method. Now we will see how to implement the same model using a Machine Learning Library called scikit-learn

The scikit-learn approach

scikit-learn is simple machine learning library in Python. Building Machine Learning models are very easy using scikit-learn. Let's see how we can build this Simple Linear Regression Model using scikit-learn.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Cannot use Rank 1 matrix in scikit learn
X = X.reshape((m, 1))
# Creating Model
reg = LinearRegression()
# Fitting training data
reg =, Y)
# Y Prediction
Y_pred = reg.predict(X)

# Calculating RMSE and R2 Score
mse = mean_squared_error(Y, Y_pred)
rmse = np.sqrt(mse)
r2_score = reg.score(X, Y)


You can see that this exactly equal to model we built from scratch, but simpler and less code.

Now we will move on to Multiple Linear Regression.

Multiple Linear Regression

Multiple Linear Regression is a type of Linear Regression when the input has multiple features ((variables)).

Model Representation

Similar to Simple Linear Regression, we have input variable(X) and output variable(Y). But the input variable has nn features. Therefore, we can represent this linear model as follows;

Y=β0+β1x1+β1x2+...+βnxnY = \beta_0 + \beta_1x_1 + \beta_1x_2 + ... + \beta_nx_n

xix_i is the ith feature in input variable. By introducing x0=1x_0 = 1, we can rewrite this equation.

Y=β0x0+β1x1+β1x2+...+βnxnY = \beta_0x_0 + \beta_1x_1 + \beta_1x_2 + ... + \beta_nx_n x0=1x_0 = 1

Now we can convert this eqaution to matrix form.

Y=βTXY = \beta^TX


β=[β0β1β2..βn]T\beta = \begin{bmatrix}\beta_0 & \beta_1 & \beta_2 & .. & \beta_n\end{bmatrix}^T


X=[x0x1x2..xn]TX = \begin{bmatrix}x_0 & x_1 & x_2 & .. & x_n\end{bmatrix}^T

We have to define the cost of the model. Cost bascially gives the error in our model. Y in above equation is the our hypothesis (approximation). We are going to define it as our hypothesis function.

hβ(x)=βTxh_\beta(x) = \beta^Tx

And the cost is,

J(β)=12mi=1m(hβ(x(i))y(i))2J(\beta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\beta(x^{\textrm{(i)}}) - y^{\textrm{(i)}})^2

By minimizing this cost function, we can get find β\beta. We use Gradient Descent for this.

Gradient Descent

Gradient Descent is an optimization algorithm. We will optimize our cost function using Gradient Descent Algorithm.

Step 1

Initialize values β0\beta_0, β1\beta_1,..., βn\beta_n with some value. In this case we will initialize with 0.

Step 2

Iteratively update,

βj:=βjαβjJ(β)\beta_j := \beta_j - \alpha\frac{\partial}{\partial \beta_j} J(\beta)

until it converges.

This is the procedure. Here α\alpha is the learning rate. This operation βjJ(β)\frac{\partial}{\partial \beta_j} J(\beta) means we are finding partial derivate of cost with respect to each βj\beta_j. This is called Gradient.

Read this if you are unfamiliar with partial derivatives.

In step 2 we are changing the values of βj\beta_j in a direction in which it reduces our cost function. And Gradient gives the direction in which we want to move. Finally we will reach the minima of our cost function. But we don't want to change values of βj\beta_j drastically, because we might miss the minima. That's why we need learning rate.

Gradient Descent

The above animation illustrates the Gradient Descent method.

But we still didn't find the value of βjJ(β)\frac{\partial}{\partial \beta_j} J(\beta). After we applying the mathematics. The step 2 becomes.

βj:=βjα1mi=1m(hβ(x(i))y(i))xj(i)\beta_j := \beta_j - \alpha\frac{1}{m}\sum_{i=1}^m (h_\beta(x^{(i)})-y^{(i)})x_{j}^{(i)}

We iteratively change values of βj\beta_j according to above equation. This particular method is called Batch Gradient Descent.


Let's try to implement this in Python. This looks like a long procedure. But the implementation is comparitively easy since we will vectorize all the equations. If you are unfamiliar with vectorization, read this post

We will be using a student score dataset. In this particular dataset, we have math, reading and writing exam scores of 1000 students. We will try to find a predict the score of writing exam from math and reading scores. You can get this dataset from this Github Repo. That's we have 2 features(input variables). Let's start by importing our dataset.

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (20.0, 10.0)
from mpl_toolkits.mplot3d import Axes3D

data = pd.read_csv('student.csv')
(1000, 3)
Math Reading Writing
0 48 68 63
1 62 81 72
2 79 80 78
3 76 83 79
4 59 64 62

We will get scores to an array.

math = data['Math'].values
read = data['Reading'].values
write = data['Writing'].values

# Ploting the scores as scatter plot
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(math, read, write, color='#ef1234')


Now we will generate our X, Y and β\beta.

m = len(math)
x0 = np.ones(m)
X = np.array([x0, math, read]).T
# Initial Coefficients
B = np.array([0, 0, 0])
Y = np.array(write)
alpha = 0.0001

We'll define our cost function.

def cost_function(X, Y, B):
    m = len(Y)
    J = np.sum(( - Y) ** 2)/(2 * m)
    return J
inital_cost = cost_function(X, Y, B)

As you can see our initial cost is huge. Now we'll reduce our cost prediocally using Gradient Descent.

Hypothesis: hβ(x)=βTxh_\beta(x) = \beta^Tx

Loss: (hβ(x)y)(h_\beta(x)-y)

Gradient: (hβ(x)y)xj(h_\beta(x)-y)x_{j}

Gradient Descent Updation: βj:=βjα(hβ(x)y)xj)\beta_j := \beta_j - \alpha(h_\beta(x)-y)x_{j})

def gradient_descent(X, Y, B, alpha, iterations):
    cost_history = [0] * iterations
    m = len(Y)
    for iteration in range(iterations):
        # Hypothesis Values
        h =
        # Difference b/w Hypothesis and Actual Y
        loss = h - Y
        # Gradient Calculation
        gradient = / m
        # Changing Values of B using Gradient
        B = B - alpha * gradient
        # New Cost Value
        cost = cost_function(X, Y, B)
        cost_history[iteration] = cost
    return B, cost_history

Now we will compute final value of β\beta

# 100000 Iterations
newB, cost_history = gradient_descent(X, Y, B, alpha, 100000)

# New Values of B

# Final Cost of new B
[-0.47889172  0.09137252  0.90144884]

We can say that in this model,

Swriting=0.47889172+0.09137252Smath+0.90144884SreadingS_{writing} = -0.47889172 + 0.09137252 * S_{math} + 0.90144884 * S_{reading}

There we have final hypothesis function of our model. Let's calculate RMSE and R2R^2 Score of our model to evaluate.

# Model Evaluation - RMSE
def rmse(Y, Y_pred):
    rmse = np.sqrt(sum((Y - Y_pred) ** 2) / len(Y))
    return rmse

# Model Evaluation - R2 Score
def r2_score(Y, Y_pred):
    mean_y = np.mean(Y)
    ss_tot = sum((Y - mean_y) ** 2)
    ss_res = sum((Y - Y_pred) ** 2)
    r2 = 1 - (ss_res / ss_tot)
    return r2

Y_pred =

print(rmse(Y, Y_pred))
print(r2_score(Y, Y_pred))

We have very low value of RMSE score and a good R2R^2 score. I guess our model was pretty good.

Now we will implement this model using scikit-learn.

The scikit-learn Approach

scikit-learn approach is very similar to Simple Linear Regression Model and simple too. Let's implement this.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# X and Y Values
X = np.array([math, read]).T
Y = np.array(write)

# Model Intialization
reg = LinearRegression()
# Data Fitting
reg =, Y)
# Y Prediction
Y_pred = reg.predict(X)

# Model Evaluation
rmse = np.sqrt(mean_squared_error(Y, Y_pred))
r2 = reg.score(X, Y)


You can see that this model is better than one which we have built from scratch by a small margin.

That's it for Linear Regression. I assume, so far you have understood Linear Regression, Ordinary Least Square Method and Gradient Descent.

All the datasets and codes are available in this Github Repo.

More Resources

  1. Linear Regression Notes by Andrew Ng
  2. A First Course in Machine Learning by Chapman and Hall/CRC - Chapter 1
  3. Machine Learning in Action by Peter Harrington - Chapter 8
  4. Machine Learning Course by Andrew Ng(Coursera) - Week 1

Let me know if you found any errors.

© 2017-2025 Mubaris