Feature Extraction and Stacking Model Analysis for Gallstone Disease Prediction based on UCI Dataset
DOI:
https://doi.org/10.6919/ICJE.202511_11(11).0005Keywords:
Gallstone Disease; Ensemble Learning; Feature Extraction; Stacking Model; Principal Component Analysis; Biomedical Data Mining; Predictive Modeling.Abstract
Gallstone disease (GD) is a prevalent and multifactorial hepatobiliary disorder with complex etiological mechanisms involving metabolic, genetic, and environmental factors. Accurate early diagnosis is crucial to prevent complications such as cholecystitis, pancreatitis, and bile duct obstruction. However, conventional diagnostic procedures, including ultrasonography and computed tomography, are costly, operator-dependent, and not always feasible for population-level screening. In this study, a comprehensive machine learning framework is proposed to identify and predict gallstone occurrence using the UCI Gallstone Disease Dataset. The methodology integrates advanced data preprocessing, normalization, principal component analysis (PCA), multi-strategy feature selection, and a stacking ensemble learning architecture. After data cleaning and Z-score normalization, PCA was employed to reduce redundancy and extract latent diagnostic features explaining over 95% of the total variance. Three feature selection strategies-mutual information (MI), L1-regularized logistic regression, and random forest (RF) importance-were integrated to select the most discriminative clinical features. A stacking model combining Random Forest (RF), XGBoost (XGB), and Support Vector Machine (SVM) as base learners with Logistic Regression (LR) as the meta-classifier was implemented. The ensemble demonstrated strong generalization performance with cross-validation AUC = 0.8789 ± 0.0283, Accuracy = 0.8120 ± 0.0405, and test AUC = 0.9102, Accuracy = 0.8125, Recall = 0.8438, Specificity = 0.7812, and MCC = 0.6262. These findings indicate that the proposed approach effectively balances sensitivity and specificity, offering a practical computational model for gallstone risk screening and diagnosis.
Downloads
References
[1] UCI Machine Learning Repository. (2025). Gallstone-1 (tabular). University of California, Irvine.
[2] Esen, İ., Arslan, H., Aktürk Esen, S., Gülşen, M., Kültekin, N., & Özdemir, O. (2024). Early prediction of gallstone disease with a machine learning-based method from bioimpedance and laboratory data. Medicine, 103(8), e37258.
[3] Zhang, M., Mao, M., Zhang, C., Hu, F., Cui, P., Li, G., Shi, J., Wang, X., & Shan, X. (2022). Blood lipid metabolism and the risk of gallstone disease: A multi-center cross-sectional study and meta-analysis. Lipids in Health and Disease, 21, 26.
[4] Yuan, S., et al. (2021). Obesity, type 2 diabetes, lifestyle factors, and risk of gallstone disease: A Mendelian randomization study. Clinical Gastroenterology and Hepatology, 19(12), 2540–2548.e18.
[5] Sahu, S. K., et al. (2024). Diagnosis of gallbladder disease using artificial intelligence: A comprehensive review. Discover Artificial Intelligence, 4, 79.
[6] Asghari, S., Nematzadeh, H., Akbari, E., & Motameni, H. (2023). Mutual information-based filter hybrid feature selection method for medical datasets using feature clustering. Multimedia Tools and Applications, 82, 42617–42639.
[7] Vinutha, M. R., Chandrika, & Kokatnoor, S. A. (2023). EPCA-Enhanced principal component analysis for medical data classification. SN Computer Science, 4, 272.
[8] Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 International Core Journal of Engineering

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.




