# Data Preprocessing Using Data Reduction Techniques In Python

Datasets nowadays are very detailed; including more features in the model makes the model more complex, and the model may be overfitting the data. Some features can be noise and potentially damage the model. By removing those unimportant features, the model may generalize better.

We will see other feature selection methods on the same data set to compare their performances. Use SkLearn website for this.

**Dataset Used**

The dataset used for carrying out data reduction is the ‘Iris’ available in sklearn.datasets library

`import numpy as np import matplotlib.pyplot as plt `

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import VarianceThreshold, RFE,SelectFromModel, SelectKBest, f_classif, chi2,mutual_info_classif

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix, classification_report from sklearn.datasets import load_iris

Load dataset

The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.

The dataset now has 8 features now. In that 4 feature are important and another 4 are noise.

We only select features based on the information from the training set, not on the whole data set. We should hold out part of the entire data set as a test set to evaluate the feature selection and model performance. Thus the information from the test set cannot be seen while we conduct feature selection and train the model.

Spliting Dataset

# Principal Component Analysis (PCA)

We can speed up the fitting of a machine learning algorithm by changing the optimization algorithm. A more common way of speeding up a machine learning algorithm is using Principal Component Analysis (PCA). It is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss.

For a lot of machine learning applications, it helps to be able to visualize your data. Visualizing two or 3-dimensional data is not that challenging. The Iris dataset used is four-dimensional. We will use PCA to reduce that 4-dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better.

So, now let’s execute **PCA for visualization** on Iris Dataset

Load Library

The data frame after using StandardScalar

**PCA Projection to 2D**

The original data has four columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the actual data, which is four-dimensional into two dimensions. The new components are just the two main dimensions of variation

4 columns are converted to 2 principal columns

Concatenating DataFrame along axis = 1. finalDf is the final DataFrame before plotting the data.

Concatenating target column into a data frame

Now, let’s visualize the data frame, execute the following code:

`fig = plt.figure(figsize = (8,8))ax = fig.add_subplot(1,1,1) `

ax.set_xlabel('Principal Component 1', fontsize = 10)

ax.set_ylabel('Principal Component 2', fontsize = 10)

ax.set_title('2 component PCA', fontsize = 15)targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

colors = ['r', 'g', 'b']for target, color in zip(targets,colors):

indicesToKeep = finalDf['target'] == target

ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']

, finalDf.loc[indicesToKeep, 'principal component 2']

, c = color

, s = 50)

ax.legend(targets)

ax.grid()

2D representation of dataframe

**PCA Projection to 3D**

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). This section projects the original data, which is off our dimensional into 3 dimensions. The new components are just the three main dimensions of variation.

Obtaining 3 principal component columns

Now let’s visualize a 3D graph,

Generating 3D graph

Obtained 3D graph

**Variance Threshold**

Variance Threshold* *is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.

variance threshold

**Univariate Feature Selection**

- Univariate feature selection works by selecting the best features based on univariate statistical tests.
- We compare each feature to the target variable to see a statistically significant relationship between them.
- When we analyze the relationship between one feature and the target variable, we ignore the other features. That is why it is called ‘univariate’.
- Each feature has its test score.
- Finally, all the test scores are compared, and the features with top scores will be selected.

*f_classif*

Also known as ANOVA,

ANOVA Test

2. *chi2*

This score can be used to select the features with the highest values for the test chi-squared statistic from data, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

chi2 test

3. *mutual_info_classif*

It comes in 2 types:

- for classification
- for regression

mutual_info_classif Test

**Recursive Feature Elimination**

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) selects features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features, and the importance of each feature is obtained either through a *coef_* attribute or through a *feature_importances_* attribute. Then, the least important features are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of selected features is eventually reached.

RFE using Random Forest Classifier

In summary, we have seen how to use different feature selection methods on the same data and evaluated their performances.