Logistic Regression (LoR) is a popular statistical technique that is used to model the relationship between a categorical dependent variable and one or more independent variables. It is commonly used for classification tasks where the goal is to predict the probability that an observation belongs to a particular class based on the values of the independent variables.
The basic idea behind LoR is to fit a logistic function to the data, which takes the form:
P(Y=1) = 1 / (1 + e^(-z))
where P(Y=1) is the probability of the dependent variable Y being equal to 1, z is a linear combination of the independent variables, and e is the base of the natural logarithm. The logistic function ensures that the predicted probabilities are always between 0 and 1, which is necessary for classification tasks.
The parameters of the logistic function are estimated using a maximum likelihood approach, which involves finding the values of the parameters that maximize the likelihood of the observed data given the model. This can be done using numerical optimization techniques such as gradient descent or Newton's method.
Now let's take a look at some example code for implementing LoR in Python using the scikit-learn library. For this example, we will be using the famous iris dataset, which consists of 150 samples of iris flowers, with 50 samples from each of three different species.
First, we need to import the necessary libraries and load the data:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, :2] # we will only use the first two features for simplicity
y = iris.target
Next, we need to split the data into training and testing sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now we can create an instance of the LogisticRegression class and fit it to the training data:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)
We can now use the trained model to make predictions on the testing data:
y_pred = clf.predict(X_test)
Finally, we can evaluate the performance of the model using various metrics such as accuracy, precision, recall, and F1 score:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average='weighted'))
print("Recall:", recall_score(y_test, y_pred, average='weighted'))
print("F1 score:", f1_score(y_test, y_pred, average='weighted'))
And that's it! You now have a basic understanding of how to implement LoR in Python using scikit-learn.
A bit more practical discussion:
In addition to the parameters of the logistic function that are estimated from the data, logistic regression also has several hyperparameters that need to be set prior to training the model. Here are some of the most important ones:
penalty: This hyperparameter determines the regularization method used to prevent overfitting. The default value is 'l2', which stands for L2 regularization, but 'l1' (L1 regularization) and 'elasticnet' (a combination of L1 and L2 regularization) are also available.
C: This hyperparameter controls the strength of the regularization. A smaller value of C will increase the amount of regularization, while a larger value will decrease it. The default value is 1.0.
solver: This hyperparameter determines the algorithm used to optimize the logistic regression objective function. The default value is 'lbfgs', but 'newton-cg', 'sag', and 'saga' are also available.
max_iter: This hyperparameter sets the maximum number of iterations for the optimization algorithm. The default value is 100.
multi_class: This hyperparameter determines the method used to handle multiclass classification problems. The default value is 'auto', but 'ovr' (one-vs-rest) and 'multinomial' (softmax regression) are also available.
class_weight: This hyperparameter allows you to weight the classes differently to handle imbalanced datasets. The default value is None, but you can specify a dictionary of weights for each class.
Once you have trained your logistic regression model, it's important to evaluate its performance on a test set. Here are some common metrics used to evaluate classification models:
Accuracy: The proportion of correct predictions out of all predictions made.
Precision: The proportion of true positives (correctly predicted positive instances) out of all instances predicted as positive.
Recall: The proportion of true positives out of all actual positive instances.
F1 score: The harmonic mean of precision and recall, which balances both metrics.
In addition to these metrics, you may also want to plot a confusion matrix to visualize the performance of the model across different classes.
Here's an example of how to train a logistic regression model with custom hyperparameters:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='l1', C=0.1, solver='liblinear', max_iter=1000, multi_class='ovr')
clf.fit(X_train, y_train)
In this example, we're using L1 regularization with a regularization strength of 0.1, the 'liblinear' solver with a maximum of 1000 iterations, and the one-vs-rest method for handling multiclass classification.
Once the model is trained, we can evaluate its performance using the same metrics as before:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average='weighted'))
print("Recall:", recall_score(y_test, y_pred, average='weighted'))
print("F1 score:", f1_score(y_test, y_pred, average='weighted'))
And that's it! By tuning the hyperparameters of the logistic regression model, you can improve its performance on your specific classification task.