Data splitting is a crucial step in machine learning that involves dividing a dataset into training and testing sets. This is important because it allows us to evaluate the performance of our models on new, unseen data. In this article, we'll discuss why data splitting is important, how to split data, and provide examples of how to implement data splitting in Python.
Why is Data Splitting Important?
Data splitting is important because it allows us to evaluate the performance of our machine learning models on new, unseen data. If we were to train our models on the entire dataset, we would have no way of knowing how well the model generalizes to new data. This can lead to overfitting, where the model is overly complex and performs well on the training data but poorly on new data.
How to Split Data
There are several ways to split data, including simple random sampling, stratified sampling, and time-based splitting. In simple random sampling, we randomly select a portion of the data to use for training and the remainder for testing. In stratified sampling, we ensure that the proportions of classes are preserved in both the training and testing sets. In time-based splitting, we split the data based on a specific time period, such as using the first 80%(70% to 80%) of the data for training and the last 20%(20% to 30%) for testing.
Examples of Data Splitting in Python
Let's take a look at some examples of how to split data in Python using scikit-learn, a popular machine learning library.
Simple Random Sampling:
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In this example, we use the train_test_split function to randomly split the data into training and testing sets. The test_size parameter specifies the proportion of data to use for testing, and the random_state parameter ensures that the random splitting is reproducible.
Stratified Sampling:
from sklearn.model_selection import StratifiedShuffleSplit
import pandas as pd
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
In this example, we use the StratifiedShuffleSplit function to ensure that the proportions of classes are preserved in both the training and testing sets. The n_splits parameter specifies the number of splits to perform, and the test_size parameter specifies the proportion of data to use for testing.
Time-based Splitting:
import pandas as pd
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
split_date = '2022-01-01'
X_train = X[X.index < split_date]
y_train = y[y.index < split_date]
X_test = X[X.index >= split_date]
y_test = y[y.index >= split_date]
In this example, we split the data based on a specific time period. We use the index of the dataframe to split the data based on the date, and we create separate training and testing sets based on the split date.
Go to feature engineering in preprocessing