Data Preprocessing Using Data Reduction Techniques In Python
Datasets nowadays are very detailed; including more features in the model makes the model more complex, and the model may be overfitting the data. Some features can be noise and potentially damage the model. By removing those unimportant features, the model may generalize better.
We will see other feature selection methods on the same data set to compare their performances. Use SkLearn website for this.
The dataset used for carrying out data reduction is the ‘Iris’ available in sklearn.datasets library
import numpy as np import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold, RFE,SelectFromModel, SelectKBest, f_classif, chi2,mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report from sklearn.datasets import load_iris
The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.
The dataset now has 8 features now. In that 4 feature are important and another 4 are noise.
We only select features based on the information from the training set, not on the whole data set. We should hold out part of the entire data set as a test set to evaluate the feature selection and model performance. Thus the information from the test set cannot be seen while we conduct feature selection and train the model.
Principal Component Analysis (PCA)
We can speed up the fitting of a machine learning algorithm by changing the optimization algorithm. A more common way of speeding up a machine learning algorithm is using Principal Component Analysis (PCA). It is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss.
For a lot of machine learning applications, it helps to be able to visualize your data. Visualizing two or 3-dimensional data is not that challenging. The Iris dataset used is four-dimensional. We will use PCA to reduce that 4-dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better.
So, now let’s execute PCA for visualization on Iris Dataset
The data frame after using StandardScalar
PCA Projection to 2D
The original data has four columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the actual data, which is four-dimensional into two dimensions. The new components are just the two main dimensions of variation
4 columns are converted to 2 principal columns
Concatenating DataFrame along axis = 1. finalDf is the final DataFrame before plotting the data.
Concatenating target column into a data frame
Now, let’s visualize the data frame, execute the following code:
fig = plt.figure(figsize = (8,8))ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 10)
ax.set_ylabel('Principal Component 2', fontsize = 10)
ax.set_title('2 component PCA', fontsize = 15)targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']for target, color in zip(targets,colors):
indicesToKeep = finalDf['target'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
, finalDf.loc[indicesToKeep, 'principal component 2']
, c = color
, s = 50)
2D representation of dataframe
PCA Projection to 3D
The original data has 4 columns (sepal length, sepal width, petal length, and petal width). This section projects the original data, which is off our dimensional into 3 dimensions. The new components are just the three main dimensions of variation.
Obtaining 3 principal component columns
Now let’s visualize a 3D graph,
Generating 3D graph
Obtained 3D graph
Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.
Univariate Feature Selection
- Univariate feature selection works by selecting the best features based on univariate statistical tests.
- We compare each feature to the target variable to see a statistically significant relationship between them.
- When we analyze the relationship between one feature and the target variable, we ignore the other features. That is why it is called ‘univariate’.
- Each feature has its test score.
- Finally, all the test scores are compared, and the features with top scores will be selected.
Also known as ANOVA,
This score can be used to select the features with the highest values for the test chi-squared statistic from data, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.
It comes in 2 types:
- for classification
- for regression
Recursive Feature Elimination
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) selects features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features, and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of selected features is eventually reached.
RFE using Random Forest Classifier
In summary, we have seen how to use different feature selection methods on the same data and evaluated their performances.