Imbalanced Big Data Classification using Feature Selection Under-Sampling

Ch. Sarada; M. SathyaDevi

Ch. Sarada
M. SathyaDevi

Abstract

Imbalanced learning is the classification problem where the number of observations of one class, far surpasses the number of observations of another class. Different sampling approaches are proposed for paired and Multi-Class imbalanced classification. Paired Imbalanced classification encompasses two classes: one of them is majority, while the other one is a minority class. Multi-Class imbalanced classification contains more than two classes for classification. Under-sampling technique is the better sampling technique among conventional approaches. However, existing approaches may not work in the Big Data environment, as considering all the features might compromise the performance of the system. In this work, a novel method is presented which takes into account only the essential features, as well as, deals with massive data as in Big Data environment. In the proposed system, Feature Selection Under-Sampling technique is used for resampling the data. Feature selection is the vital step because it not only decreases the dimensionality of data but also helps classifier to run faster, and accuracy can also be improved. Over that, SVM learning classifier is adopted to construct the model and test the data. The proposed system is implemented using MapReduce framework by integrating statistical analytical tool R.