Published Paper


A Robust Feature Selection Framework for Effective Processing of Machine Learning Datasets

Samera Uga Otor
Nigeria
Page: 299-316
Published on: 2023 September

Abstract

Most machine learning datasets are riddled with noise, outliers, redundant features, and blank entries. These datasets must be properly formatted for the learning models to process them and produce agood result using data preprocessing techniques such as data cleansing, feature selection, and feature engineering. Therefore, a feature selection framework was developed in this study. The framework defined a list of datasets, feature selection score functions for regressors and classifiers such as Chi-square, ANOVA, Pearson’s correlation, and regressors and classifiers such as Decision Tree, Multilayer Perception Neural Network, K-nearest neighbor, and Random Forest as a pipeline. The framework was designed to choose between a regression predictive modeling anda classification predictive modeling based on the data type of the output variable. It also allows for the number of datasets, feature selection scores, regressors and classifiers to be increased or reduced as desired.

The framework was tested using the datasets CIC-DDoS2019, XIIoTID, DDoS-SDN, and DoS/DDoS-MQTT-IoT. The datasets were subjected to; several preprocessing techniques for data cleansing, which included filling the not-a-number values, infinity values, special characters, empty values, and converting negative values to positive values as needed and several feature engineering procedures, such as label imputers, encoders, and scalars. The datasets were then evaluated to get the features with the best scores for each dataset as either a classification or regression problem. Furthermore, to test for feature stability, the datasets were evaluated using recursive feature elimination (RFE). Results show that for the CIC-DDoS2019 and XIIoTID datasets, f-classif selected the best features with an accuracy of 99% to 100%. For DDoS-SDN datasets, f-regression with Random Forest regressor selected the best features with MSE of 0.0005 and R2 of 0.998%, and for DoS/DDoS-MQTT-IoT datasets, mutual info regressor with Random Forest selected the best features with MSE of 0.0128 and R2 of 94% respectively. For feature stability, the consistent features are supplied for researchers who intend to use the dataset for further research.

 

PDF