Dimension Reduction in Python

The longer we were in it, the smaller it seemed to get. – William Beebe

Introduction

In the real world, datasets can comprise thousands of observations and hundreds of features. In data science and machine learning, datasets with many columns or features are called high-dimensional. While the more information the better, many features usually lead to complex models that overfit on the training data; and when models overfit, they perform poorly in test datasets. That is precisely when dimension reduction comes in handy. 

Dimension reduction is the removal or elimination of features that provide little or no contribution towards model predictions. The removed features either have no variance in the prediction or their variance is so little, that when they’re ignored, there is little or no impact on the model’s measurement metric (such as accuracy, ROC, RMSE etc). These features add “noise” to the model which increases the chances of overfitting. For instance, in a classification problem, if dimension reduction from 60 features to five features doesn’t result in a change in the model’s accuracy, then the remaining 55 features have an insignificant contribution to the model. 

Dimension reduction is therefore an important step in data preprocessing for its application in feature selection. It is also applied after feature extraction and the creation of new features in the dataset to minimize duplication. In addition, it has the following benefits:

      • Reducing features results in a less complex dataset
      • Minimizes chances of overfitting
      • Requires less time to compute as the dataset has been simplified
      • Reduces amount of disk space required
      • Informs of the most important features in our predictions

    In this article, we will talk about dimension reduction while using the speed-dating dataset retrieved from Kaggle. The data was collected from 8378 individuals on a speed-dating platform, with questionnaires administered at different stages of the process. We will use the variables in the dataset to determine if we can predict whether the date was a match or not. Using the “not so clean” dataset, I have created a Kaggle notebook where I have conducted step-by-step pre-processing of the dataset through checking and converting data types, removing unwanted characters, dropping missing values and showing different techniques in dimension reduction. You can access it on my Kaggle profile here

    Dimension reduction techniques

    Dropping duplicate features

    Dropping duplicate features is the easiest method of reducing the dimensions of a dataset. Duplicate features could already be present in the dataset, or they could arise after feature engineering. In our speed-dating dataset, we recreate the age difference variable by calculating the age difference between the individual and their partner. We will therefore remain with other age variables with similar information but of less importance. These variables are dropped from the dataset.

    The dataset also contains variables with information from numeric columns in bins/groups. For example, variables of float data type like funny, ambitious, attractive, sincere, intelligence e.t.c. are converted to groups that show the ranges of each value, and are stored in other variables whose column names start with the prefix d_. Essentially, these variables carry the same information as the variables they were derived from. Since models prefer data in numeric format, these grouped features need to be dropped as they have duplicate information. Removing such features from the dataset is a simple method of dimension reduction.

    After removing the duplicated features, we instantiate different models and test their prediction accuracy. An xgbooxt classifier has the highest accuracy of 0.875. We will use this as the base standard with which further dimension reduction will be measured.

    Feature selection with Random Forests

    Random forest is an ensemble method that uses decision trees as its base estimator for modelling in both regression and classification problems. Each estimator is trained separately on a bootstrap sample of a similar size as the training set resulting in a reduction of error through the additional randomization. In classification problems, like in our case, the final prediction is by majority voting of the random forest classifier while in regression problems, it is computed by averaging.  

    Tree-based methods including the random forest classifier can measure the importance of each feature and output its importance in percentage form. This is accessed after fitting the training data and using the attribute .feature_importances on the model. 

    rf = RandomForestClassifier()
    rf.fit(X_train, y_train)
    rf_importance = pd.Series(rf.feature_importances_, index = rf.feature_names_in_)
    rf_importance = rf_importance.sort_values(ascending = False)
    rf_importance.head()

    attractive_o                  0.050226
    like                                  0.045524
    shared_interests_o     0.037946
    funny_o                         0.037564
    funny_partner             0.037016

    According to the random forest classifier, the features above have the greatest contribution to the prediction. Feature selection can be done by selecting features with a contribution greater than a certain value. For example, in our speed-date matching problem, we selected features with a contribution greater than 1%, which resulted in the selection of 50 out of 70 features and an improved accuracy of 0.880 down from 0.875 when 70 features were used. 

    xg_cl = xgb.XGBClassifier()
    top_features = rf_importance[rf_importance>0.01].index
    xg_cl.fit(X_train[top_features], y_train)
    y_pred = xg_cl.predict(X_test[top_features])
    accuracy = (np.sum(y_pred == y_test))/(len(y_pred))
    print(f'Dropping features from {len(rf_importance)} to {len(top_features)} improves our accuracy to {accuracy}')

    Dropping features from 70 to 50 improves our accuracy to 0.880

    Feature selection with Extreme Gradient Boosting

    Extreme Gradient Boosting or xgboost in short is a modelling technique popular because of its speed, performance and scalability. It can be used to train very large datasets by harnessing the processing power of all CPUs of modern computers. Xgboost is used in both regression and classification problems. 
     
    Xgboost is used in feature selection by outputting feature importance values from a trained model. To leverage the performance and efficiency of the algorithm, the dataset is converted to a DMatrix format which is used for training. The .get_score() method is used on the trained model to give the contribution of each feature using an F-score metric. The higher the score, the more important the feature is. 
    dating_matrix = xgb.DMatrix(data = X_train, label = y_train)
    params = {'objective':'binary:logistic'}
    xgb_clf = xgb.train(dtrain = dating_matrix, params = params, num_boost_round = 10)
    xgb_features = pd.DataFrame(xgb_clf.get_score(), index = ['score']).T
    xgb_features['score'].sort_values(ascending=False).head()

    attractive_o                   28.0
    guess_prob_liked        22.0
    shared_interests_o     21.0
    funny_partner             21.0
    pref_o_intelligence    20.0

    The features above are the top five features selected by xgboost. To evaluate our model, we select features with an F-score greater than 3 which results in the selection of 47 features. Evaluating our model with these features outputs an accuracy of 0.871.

    xgb_cl = xgb.XGBClassifier()
    xgb_top_features = xgb_features[xgb_features['score'] > 3].index xgb_cl.fit(X_train[xgb_top_features],y_train) y_pred = xgb_cl.predict(X_test[xgb_top_features]) accuracy = (np.sum(y_pred == y_test))/(len(y_pred)) print(f'Dropping features from 70 to {len(xgb_top_features)} results in an accuracy of {accuracy}')

    Dropping features to 47 results in an accuracy of 0.871

    Dimension Reduction with Recursive Feature Elimination – RFE

    RFE is a feature selection algorithm that produces feature importances or feature coefficients when a model is passed to it. It fits the model, drops the weakest features, and repeats the process until the specified number of features is attained. It can work with any model that can output coefficients or feature importance values. 

    The desired model is passed to the RFE algorithm together with the number of features wanted. The algorithm fits the model and makes predictions using the number of features passed. In this case, using the xgboost classifier model and selecting 15 features, down from 70 still gives a good accuracy of 0.867, down from 0.875, not a very big difference with such a massive reduction in features.  

    from sklearn.feature_selection import RFE
    rfe = RFE(estimator = xgb.XGBClassifier(), n_features_to_select = 15)
    rfe.fit(X_train,y_train)
    y_pred = rfe.predict(X_test)
    accuracy = (np.sum(y_pred == y_test))/(len(y_pred))
    print(f'Model accuracy is {accuracy}')

    Model accuracy is 0.867

    After fitting the algorithm to the training dataset, the .support_ attribute is used to give a mask with True and False values for features selected and those not selected respectively. 

    mask = rfe.support_
    xgb_top_15_features = X_train.columns[mask].tolist()
    print(f'\n Top 15 features in the xgb classifier \n {xgb_top_15_features}')

    Top 15 features in the xgb classifier

    [‘pref_o_intelligence’, ‘attractive_o’, ‘funny_o’, ‘shared_interests_o’, ‘attractive_important’, ‘intellicence_important’, ‘attractive’, ‘attractive_partner’, ‘funny_partner’, ‘hiking’, ‘movies’, ‘expected_num_matches’, ‘like’, ‘guess_prob_liked’, ‘met’]

    Selecting features from multiple models using RFE

    This involves passing different classifiers, one by one, into the RFE model to generate the masks with True/False values for different features. We can select 30 features from each classifier and create masks that can be used to choose features selected by all the classifiers passed in the RFE algorithm.

    rfe = RFE(estimator = RandomForestClassifier(), n_features_to_select = 30)
    rfe.fit(X_train,y_train)
    mask_rfe = rfe.support_

    Repeat the process by passing other classifiers in the RFE model  In our situation we add the Logistic Regressor and the xgboost Classifier. 11 features get selected by all the models out of the 30.

    mask_all_models = np.sum([mask_rfe, mask_log,mask_xgb], axis=0)
    mask = mask_all_models == 3 
    most_important_features = X_train.columns[mask] 
    most_important_features.shape

    (11,)

    Selecting features selected by all classifiers results in an accuracy of 0.866, again not very far from 0.875 when all the features were used. 

    xgb_cl = xgb.XGBClassifier(objective = 'binary:logistic')
    xgb_cl.fit(X_train[most_important_features],y_train)
    y_pred = xgb_cl.predict(X_test[most_important_features])
    accuracy = (np.sum(y_pred == y_test))/(len(y_pred))
    print(f'The accuracy of 11 features selected by all models is {accuracy}')

    The accuracy of 11 features selected by all models is 0.866

    Dimension Reduction with Principal Component Analysis – PCA

    PCA is a dimension-reduction technique that decorrelates variables with any sort of correlation and outputs the variance of features using the attribute explained_variance_ratio_. To reduce the dataset to a specific number of features, you pass that number to the n_components argument when instantiating PCA. However, since this is usually not known, PCA is first instantiated without the argument, and then the variance obtained can be used to determine the best value for n_components. 

    from sklearn.decomposition import PCA
    pca = PCA()
    pca.fit(X_train, y_train)
    variance = pca.explained_variance_ratio_

    A plot of explained variance ratio is a good indicator for the ideal number of n_components to be used in PCA. The number of components is picked from the elbow where there is an abrupt shift in explained variance. In this case, the value is 11. We will then use these 11 components in PCA to transform the data before fitting. 

    dimension reduction in python
    Explained variance ratio

    Reducing the dataset to 11 features using PCA results in a drop in accuracy from 0.875 to 0.851.

    pca = PCA(n_components = 11)
    xgb_cl = xgb.XGBClassifier()
    
    X_train_transformed = pca.fit_transform(X_train)
    X_test_tranformed = pca.transform(X_test)
    
    xgb_cl.fit(X_train_transformed,y_train)
    y_pred = xgb_cl.predict(X_test_tranformed)
    accuracy = (np.sum(y_pred == y_test))/(len(y_pred))
    print(f'The accuracy of 11 components is {accuracy}')

    The accuracy of 11 components is 0.851

    Conclusion

    Dimension reduction is a powerful technique that can convert a high-dimensional dataset to a more manageable size by dropping unnecessary features from a dataset. This minimizes overfitting and results in a sizeable dataset that executes quicker and requires less disk space. Dimension reduction is highly applicable in datasets with many features and it’s a necessary step that allows the selection of features with significant variance in the model. 

    While the techniques for the reduction or selection of features are varied, it is upon every individual to choose the most appropriate depending on your situation, whether to reduce the dimensionality of the dataset or to increase the accuracy of the model. The choice is yours!

     

    About Post Author

    Leave a Comment

    Your email address will not be published. Required fields are marked *

    Scroll to Top