Data Preprocessing Using Data Reduction Techniques In Python

import numpy as np import matplotlib.pyplot as plt  
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold, RFE,SelectFromModel, SelectKBest, f_classif, chi2,mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report from sklearn.datasets import load_iris

Principal Component Analysis (PCA)

fig = plt.figure(figsize = (8,8))ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 10)
ax.set_ylabel('Principal Component 2', fontsize = 10)
ax.set_title('2 component PCA', fontsize = 15)targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']for target, color in zip(targets,colors):
indicesToKeep = finalDf['target'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
, finalDf.loc[indicesToKeep, 'principal component 2']
, c = color
, s = 50)
  • Univariate feature selection works by selecting the best features based on univariate statistical tests.
  • We compare each feature to the target variable to see a statistically significant relationship between them.
  • When we analyze the relationship between one feature and the target variable, we ignore the other features. That is why it is called ‘univariate’.
  • Each feature has its test score.
  • Finally, all the test scores are compared, and the features with top scores will be selected.
  1. f_classif
  1. for classification
  2. for regression




