Feature Extraction and Stacking Model Analysis for Gallstone Disease Prediction based on UCI Dataset

Authors

  • Yaya Li
  • Wenjing Zhang
  • Lin Wang
  • Xinquan Di
  • Zezhen Wang

DOI:

https://doi.org/10.6919/ICJE.202511_11(11).0005

Keywords:

Gallstone Disease; Ensemble Learning; Feature Extraction; Stacking Model; Principal Component Analysis; Biomedical Data Mining; Predictive Modeling.

Abstract

Gallstone disease (GD) is a prevalent and multifactorial hepatobiliary disorder with complex etiological mechanisms involving metabolic, genetic, and environmental factors. Accurate early diagnosis is crucial to prevent complications such as cholecystitis, pancreatitis, and bile duct obstruction. However, conventional diagnostic procedures, including ultrasonography and computed tomography, are costly, operator-dependent, and not always feasible for population-level screening. In this study, a comprehensive machine learning framework is proposed to identify and predict gallstone occurrence using the UCI Gallstone Disease Dataset. The methodology integrates advanced data preprocessing, normalization, principal component analysis (PCA), multi-strategy feature selection, and a stacking ensemble learning architecture. After data cleaning and Z-score normalization, PCA was employed to reduce redundancy and extract latent diagnostic features explaining over 95% of the total variance. Three feature selection strategies-mutual information (MI), L1-regularized logistic regression, and random forest (RF) importance-were integrated to select the most discriminative clinical features. A stacking model combining Random Forest (RF), XGBoost (XGB), and Support Vector Machine (SVM) as base learners with Logistic Regression (LR) as the meta-classifier was implemented. The ensemble demonstrated strong generalization performance with cross-validation AUC = 0.8789 ± 0.0283, Accuracy = 0.8120 ± 0.0405, and test AUC = 0.9102, Accuracy = 0.8125, Recall = 0.8438, Specificity = 0.7812, and MCC = 0.6262. These findings indicate that the proposed approach effectively balances sensitivity and specificity, offering a practical computational model for gallstone risk screening and diagnosis.

Downloads

Download data is not yet available.

References

[1] UCI Machine Learning Repository. (2025). Gallstone-1 (tabular). University of California, Irvine.

[2] Esen, İ., Arslan, H., Aktürk Esen, S., Gülşen, M., Kültekin, N., & Özdemir, O. (2024). Early prediction of gallstone disease with a machine learning-based method from bioimpedance and laboratory data. Medicine, 103(8), e37258.

[3] Zhang, M., Mao, M., Zhang, C., Hu, F., Cui, P., Li, G., Shi, J., Wang, X., & Shan, X. (2022). Blood lipid metabolism and the risk of gallstone disease: A multi-center cross-sectional study and meta-analysis. Lipids in Health and Disease, 21, 26.

[4] Yuan, S., et al. (2021). Obesity, type 2 diabetes, lifestyle factors, and risk of gallstone disease: A Mendelian randomization study. Clinical Gastroenterology and Hepatology, 19(12), 2540–2548.e18.

[5] Sahu, S. K., et al. (2024). Diagnosis of gallbladder disease using artificial intelligence: A comprehensive review. Discover Artificial Intelligence, 4, 79.

[6] Asghari, S., Nematzadeh, H., Akbari, E., & Motameni, H. (2023). Mutual information-based filter hybrid feature selection method for medical datasets using feature clustering. Multimedia Tools and Applications, 82, 42617–42639.

[7] Vinutha, M. R., Chandrika, & Kokatnoor, S. A. (2023). EPCA-Enhanced principal component analysis for medical data classification. SN Computer Science, 4, 272.

[8] Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259.

Downloads

Published

2025-11-22

Issue

Section

Articles

How to Cite

Li, Y., Zhang, W., Wang, L., Di, X., & Wang, Z. (2025). Feature Extraction and Stacking Model Analysis for Gallstone Disease Prediction based on UCI Dataset. International Core Journal of Engineering, 11(11), 51-58. https://doi.org/10.6919/ICJE.202511_11(11).0005