Multiple linear regression is a supervised learning algorithm that is used to predict a single dependent variable based on multiple independent variables. In multiple linear regression, we try to fit a linear equation that best represents the relationship between the dependent variable and the independent variables.
The equation for multiple linear regression is:
y = b0 + b1*x1 + b2*x2 + ... + bn*xn
Here, y is the dependent variable, b0 is the constant, b1 to bn are the coefficients, and x1 to xn are the independent variables.
The primary objective of multiple linear regression is to minimize the distance between the predicted values and the actual values. This distance is called the error, and it is calculated as the difference between the predicted value and the actual value.
Before applying multiple linear regression, we need to check the assumptions of the model. Here are some of the assumptions of multiple linear regression:
The relationship between the dependent variable and the independent variables should be linear.
The observations should be independent of each other.
The variance of the errors should be constant across all levels of the independent variables.
The errors should follow a normal distribution.
The independent variables should not be highly correlated with each other.
In this section, we'll learn how to implement multiple linear regression in Python using scikit-learn, a popular machine learning library.
Here's a step-by-step guide on how to implement multiple linear regression in Python:
Import the necessary libraries:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
Load the dataset:
data = pd.read_csv('data.csv')
Split the dataset into training and testing sets:
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)
Train the model:
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Make predictions
y_pred = regressor.predict(X_test)
Evaluate the model:
r2_score(y_test, y_pred)
This will give you the R2 score, which is a measure of how well the model fits the data. The R2 score ranges from 0 to 1, where 1 indicates a perfect fit.
Back to Linear Regression Model
In linear regression, there are a few hyperparameters that can be set to control the behavior of the model.
The intercept hyperparameter is the constant term in the linear equation. It determines the value of the dependent variable when all the independent variables are zero.
By default, the intercept is set to True in scikit-learn's LinearRegression class. However, if you want to set the intercept to False, you can do so by passing the fit_intercept=False parameter when creating the LinearRegression object.
Normalization is a technique used to scale the input features to have a similar scale. This is useful when the features have different scales, and it's important to ensure that the magnitude of one feature does not dominate the other.
By default, normalization is set to False in scikit-learn's LinearRegression class. However, if you want to normalize the input features, you can do so by passing the normalize=True parameter when creating the LinearRegression object.
Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. In linear regression, there are two types of regularization: L1 and L2 regularization.
L1 regularization adds the sum of the absolute values of the coefficients to the loss function, while L2 regularization adds the sum of the squares of the coefficients to the loss function.
However we discussed L1 & L2 methods as separate training models in lasso & ridge pages