A Comparative Study For Imbalanced Data Techniques Of Classification Algorithms

Dede Brahma Arianto; Siti Nurrahmasita

doi:10.53866/jimi.v5i4.949

Authors

Dede Brahma Arianto Universitas Faletehan
Siti Nurrahmasita Universitas Syiah Kuala

DOI:

https://doi.org/10.53866/jimi.v5i4.949

Keywords:

Imbalanced Data, Machine Learning, Classification, RUS, SMOTETomek, SMOTE

Abstract

One of the main challenges in data processing using machine learning is the imbalanced data distribution, where minority classes are often underrepresented, leading to biased predictions in classification algorithms such as K-Nearest Neighbors (KNN), Naive Bayes, and Support Vector Machine (SVM). This study aims to address this issue by applying Random Undersampling (RUS), Synthetic Minority Oversampling Technique (SMOTE), and hybrid approaches such as SMOTETomek. Using the NHANES dataset, this study evaluates the effectiveness of these methods in reducing bias and improving classification performance. The hybrid sampling technique performed the best, increasing sensitivity to minority classes, resulting in more balanced predictions. Models tested using metrics such as accuracy, precision, recall, and F1-score showed that SVM achieved the highest accuracy of 98.8% after hyperparameter tuning. This study also emphasizes the importance of hyperparameter optimization, including parameters such as C and gamma for SVM, k values for KNN, and smoothing factors for Gaussian Naive Bayes, to improve model reliability. These findings emphasize the importance of effective data preprocessing techniques and model optimization in dealing with imbalanced datasets. Implementing these approaches will ensure more accurate data analysis, as well as provide valuable insights for decision-making and policies aimed at improving imbalanced case.

References

Ahmed, H. A., Hameed, A., & Bawany, N. Z. (2022). Network intrusion detection using oversampling technique and machine learning algorithms. PeerJ Computer Science, 8, e820. https://doi.org/10.7717/peerj-cs.820

Chen, Y.-R., Leu, J.-S., Huang, S.-A., Wang, J.-T., & Takada, J.-I. (2021). Predicting Default Risk on Peer-to-Peer Lending Imbalanced Datasets. IEEE Access, 9, 73103–73109. https://doi.org/10.1109/ACCESS.2021.3079701

Elreedy, D., Atiya, A. F., & Kamalov, F. (2024). A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Machine Learning, 113(7), 4903–4923. https://doi.org/10.1007/s10994-022-06296-4

Ericha Apriliyani, & Salim, Y. (2022). Analisis performa metode klasifikasi Naïve Bayes Classifier pada Unbalanced Dataset. Indonesian Journal of Data and Science, 3(2), 47–54. https://doi.org/10.56705/ijodas.v3i2.45

Ghorbani, R., & Ghousi, R. (2020). Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques. IEEE Access, 8, 67899–67911. https://doi.org/10.1109/ACCESS.2020.2986809

Hairani, H., Anggrawan, A., & Priyanto, D. (2023). Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link. JOIV : International Journal on Informatics Visualization, 7(1), 258. https://doi.org/10.30630/joiv.7.1.1069

Hoyos-Osorio, J., Alvarez-Meza, A., Daza-Santacoloma, G., Orozco-Gutierrez, A., & Castellanos-Dominguez, G. (2021). Relevant information undersampling to support imbalanced data classification. Neurocomputing, 436, 136–146. https://doi.org/10.1016/j.neucom.2021.01.033

Indra Buana, M., & Brahma Arianto, D. (2024). Analisis Sentimen Ulasan Pengguna Aplikasi ZenPro dengan Implementasi Algoritma Support Vector Machine (SVM). Adopsi Teknologi dan Sistem Informasi (ATASI), 3(1), 45–52. https://doi.org/10.30872/atasi.v3i1.1092

Khleel, N. A. A., & Nehéz, K. (2023). A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method. Journal of Intelligent Information Systems, 60(3), 673–707. https://doi.org/10.1007/s10844-023-00793-1

Liu, B., & Tsoumakas, G. (2020). Dealing with class imbalance in classifier chains via random undersampling. Knowledge-Based Systems, 192, 105292. https://doi.org/10.1016/j.knosys.2019.105292

Syahira, N., & Arianto, D. B. (2024). Prediksi Tingkat Kualitas Udara Dengan Pendekatan Algoritma K-Nearest Neighbor. Jurnal Ilmiah Informatika Komputer, 29(1), 45–59. https://doi.org/10.35760/ik.2024.v29i1.10069

Tariq, M. A., Sargano, A. B., Iftikhar, M. A., & Habib, Z. (2023). Comparing Different Oversampling Methods in Predicting Multi-Class Educational Datasets Using Machine Learning Techniques. Cybernetics and Information Technologies, 23(4), 199–212. https://doi.org/10.2478/cait-2023-0044

Untoro, M. C., & Yusuf, M. A. N. M. (2023). Evaluate of Random Undersampling Method and Majority Weighted Minority Oversampling Technique in Resolve Imabalanced Dataset. IT Journal Research and Development, 8(1), 1–13. https://doi.org/10.25299/itjrd.2023.12412

Wardoyo, R., Wirawan, I. M. A., & Pradipta, I. G. A. (2022). Oversampling Approach Using Radius-SMOTE for Imbalance Electroencephalography Datasets. Emerging Science Journal, 6(2), 382–398. https://doi.org/10.28991/ESJ-2022-06-02-013

Wongvorachan, T., He, S., & Bulut, O. (2023). A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining. Information, 14(1), 54. https://doi.org/10.3390/info14010054

A Comparative Study For Imbalanced Data Techniques Of Classification Algorithms

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

INFORMATION DAS JURNAL

AKREDITASI JURNAL

Barcode

TEMPLATE JURNAL DAS

SIDE INFORMATION DAS

STATISTIK JURNAL DAS

INDEXING JOURNAL

Tools

RJI & DAS

CALL CENTER