One of the simplest supervised machine learning tools used in data science, linear regression permits to find the best-fitting line that correlates data points in a two-dimensional space. Defining this line then enables the prediction of where other data points could be located in the space, if they have the same characteristics as the original data set or if they stand out of it.
To go into the details of linear regression and to program it in Python, this complete series of tutorials by Harrison Kinsley, a.k.a. Sentdex, dives in all the details of linear regression, programming tricks and present example uses of linear regression.
The following videos regroup the Linear Regression tutorials from the Machine Learning with Python series (parts 2 to 12) by Sentdex on Youtube. The final code can also be obtained from Sentdex Python Programming website, in the corresponding linear regression tutorial with more examples, details on the code and links to other key concepts and Python functions.
Note that beyond linear regression, data sets can also be analyzed to extrapolate a linear regression line in multi-dimensional space polynomial regression line in two or multi-dimensional space. For a complete presentation of linear regression, with mathematical formulas and extensions, check the Wikipedia page on Linear Regression.
To go further in each tutorial video, check the comments of the video on Youtube and the corresponding tutorial on the PythonProgramming website.
from statistics import mean import numpy as np import random import matplotlib.pyplot as plt from matplotlib import style style.use('ggplot') def create_dataset(hm,variance,step=2,correlation=False): val = 1 ys =  for i in range(hm): y = val + random.randrange(-variance,variance) ys.append(y) if correlation and correlation == 'pos': val+=step elif correlation and correlation == 'neg': val-=step xs = [i for i in range(len(ys))] return np.array(xs, dtype=np.float64),np.array(ys,dtype=np.float64) def best_fit_slope_and_intercept(xs,ys): m = (((mean(xs)*mean(ys)) - mean(xs*ys)) / ((mean(xs)*mean(xs)) - mean(xs*xs))) b = mean(ys) - m*mean(xs) return m, b def coefficient_of_determination(ys_orig,ys_line): y_mean_line = [mean(ys_orig) for y in ys_orig] squared_error_regr = sum((ys_line - ys_orig) * (ys_line - ys_orig)) squared_error_y_mean = sum((y_mean_line - ys_orig) * (y_mean_line - ys_orig)) print(squared_error_regr) print(squared_error_y_mean) r_squared = 1 - (squared_error_regr/squared_error_y_mean) return r_squared xs, ys = create_dataset(40,40,2,correlation='pos') m, b = best_fit_slope_and_intercept(xs,ys) regression_line = [(m*x)+b for x in xs] r_squared = coefficient_of_determination(ys,regression_line) print(r_squared) plt.scatter(xs,ys,color='#003F72', label = 'data') plt.plot(xs, regression_line, label = 'regression line') plt.legend(loc=4) plt.show()