A Comparative Study For Imbalanced Data Techniques Of Classification Algorithms
DOI:
https://doi.org/10.53866/jimi.v5i4.949Keywords:
Imbalanced Data, Machine Learning, Classification, RUS, SMOTETomek, SMOTEAbstract
One of the main challenges in data processing using machine learning is the imbalanced data distribution, where minority classes are often underrepresented, leading to biased predictions in classification algorithms such as K-Nearest Neighbors (KNN), Naive Bayes, and Support Vector Machine (SVM). This study aims to address this issue by applying Random Undersampling (RUS), Synthetic Minority Oversampling Technique (SMOTE), and hybrid approaches such as SMOTETomek. Using the NHANES dataset, this study evaluates the effectiveness of these methods in reducing bias and improving classification performance. The hybrid sampling technique performed the best, increasing sensitivity to minority classes, resulting in more balanced predictions. Models tested using metrics such as accuracy, precision, recall, and F1-score showed that SVM achieved the highest accuracy of 98.8% after hyperparameter tuning. This study also emphasizes the importance of hyperparameter optimization, including parameters such as C and gamma for SVM, k values for KNN, and smoothing factors for Gaussian Naive Bayes, to improve model reliability. These findings emphasize the importance of effective data preprocessing techniques and model optimization in dealing with imbalanced datasets. Implementing these approaches will ensure more accurate data analysis, as well as provide valuable insights for decision-making and policies aimed at improving imbalanced case.
References
Ahmed, H. A., Hameed, A., & Bawany, N. Z. (2022). Network intrusion detection using oversampling technique and machine learning algorithms. PeerJ Computer Science, 8, e820. https://doi.org/10.7717/peerj-cs.820
Chen, Y.-R., Leu, J.-S., Huang, S.-A., Wang, J.-T., & Takada, J.-I. (2021). Predicting Default Risk on Peer-to-Peer Lending Imbalanced Datasets. IEEE Access, 9, 73103–73109. https://doi.org/10.1109/ACCESS.2021.3079701
Elreedy, D., Atiya, A. F., & Kamalov, F. (2024). A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Machine Learning, 113(7), 4903–4923. https://doi.org/10.1007/s10994-022-06296-4
Ericha Apriliyani, & Salim, Y. (2022). Analisis performa metode klasifikasi Naïve Bayes Classifier pada Unbalanced Dataset. Indonesian Journal of Data and Science, 3(2), 47–54. https://doi.org/10.56705/ijodas.v3i2.45
Ghorbani, R., & Ghousi, R. (2020). Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques. IEEE Access, 8, 67899–67911. https://doi.org/10.1109/ACCESS.2020.2986809
Hairani, H., Anggrawan, A., & Priyanto, D. (2023). Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link. JOIV : International Journal on Informatics Visualization, 7(1), 258. https://doi.org/10.30630/joiv.7.1.1069
Hoyos-Osorio, J., Alvarez-Meza, A., Daza-Santacoloma, G., Orozco-Gutierrez, A., & Castellanos-Dominguez, G. (2021). Relevant information undersampling to support imbalanced data classification. Neurocomputing, 436, 136–146. https://doi.org/10.1016/j.neucom.2021.01.033
Indra Buana, M., & Brahma Arianto, D. (2024). Analisis Sentimen Ulasan Pengguna Aplikasi ZenPro dengan Implementasi Algoritma Support Vector Machine (SVM). Adopsi Teknologi dan Sistem Informasi (ATASI), 3(1), 45–52. https://doi.org/10.30872/atasi.v3i1.1092
Khleel, N. A. A., & Nehéz, K. (2023). A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method. Journal of Intelligent Information Systems, 60(3), 673–707. https://doi.org/10.1007/s10844-023-00793-1
Liu, B., & Tsoumakas, G. (2020). Dealing with class imbalance in classifier chains via random undersampling. Knowledge-Based Systems, 192, 105292. https://doi.org/10.1016/j.knosys.2019.105292
Syahira, N., & Arianto, D. B. (2024). Prediksi Tingkat Kualitas Udara Dengan Pendekatan Algoritma K-Nearest Neighbor. Jurnal Ilmiah Informatika Komputer, 29(1), 45–59. https://doi.org/10.35760/ik.2024.v29i1.10069
Tariq, M. A., Sargano, A. B., Iftikhar, M. A., & Habib, Z. (2023). Comparing Different Oversampling Methods in Predicting Multi-Class Educational Datasets Using Machine Learning Techniques. Cybernetics and Information Technologies, 23(4), 199–212. https://doi.org/10.2478/cait-2023-0044
Untoro, M. C., & Yusuf, M. A. N. M. (2023). Evaluate of Random Undersampling Method and Majority Weighted Minority Oversampling Technique in Resolve Imabalanced Dataset. IT Journal Research and Development, 8(1), 1–13. https://doi.org/10.25299/itjrd.2023.12412
Wardoyo, R., Wirawan, I. M. A., & Pradipta, I. G. A. (2022). Oversampling Approach Using Radius-SMOTE for Imbalance Electroencephalography Datasets. Emerging Science Journal, 6(2), 382–398. https://doi.org/10.28991/ESJ-2022-06-02-013
Wongvorachan, T., He, S., & Bulut, O. (2023). A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining. Information, 14(1), 54. https://doi.org/10.3390/info14010054
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Dede Brahma Arianto, Siti Nurrahmasita

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

















