Diabetes Classification Model

A machine learning model for predicting diabetes risk based on health indicators. Trained on the CDC Diabetes Health Indicators dataset with XGBoost and SMOTE for handling class imbalance.

Model Details

  • Architecture: XGBoost Classifier with SMOTE oversampling
  • Dataset: CDC Diabetes Health Indicators (50,737 samples, 21 features)
  • Task: Binary classification (Diabetes / No Diabetes)
  • Framework: scikit-learn, XGBoost, imbalanced-learn

Performance

Metric Value
Accuracy 85.51%
Precision (Diabetes) 47.40%
Recall (Diabetes) 30.08%
F1-Score (Diabetes) 36.80%
ROC-AUC 82.04%
MCC 30.04%

Features

The model uses 21 health indicator features:

  1. HighBP - High blood pressure (0/1)
  2. HighChol - High cholesterol (0/1)
  3. CholCheck - Cholesterol check in past 5 years (0/1)
  4. BMI - Body Mass Index (continuous)
  5. Smoker - Have you smoked 100+ cigarettes in your life (0/1)
  6. Stroke - Ever had a stroke (0/1)
  7. HeartDiseaseorAttack - Coronary heart disease or myocardial infarction (0/1)
  8. PhysActivity - Physical activity in past 30 days (0/1)
  9. Fruits - Consume fruit 1+ times per day (0/1)
  10. Veggies - Consume vegetables 1+ times per day (0/1)
  11. HvyAlcoholConsump - Heavy alcohol consumption (0/1)
  12. AnyHealthcare - Have any kind of health care coverage (0/1)
  13. NoDocbcCost - Could not see doctor because of cost (0/1)
  14. GenHlth - General health (1-5 scale)
  15. MentHlth - Days of poor mental health (0-30)
  16. PhysHlth - Days of poor physical health (0-30)
  17. DiffWalk - Difficulty walking or climbing stairs (0/1)
  18. Sex - Gender (0=Female, 1=Male)
  19. Age - Age category (1-13)
  20. Education - Education level (1-6)
  21. Income - Income level (1-8)

Usage

import pickle
import numpy as np

# Load model
with open("diabetes_classifier.pkl", "rb") as f:
    model = pickle.load(f)

# Make prediction
# Features: [HighBP, HighChol, CholCheck, BMI, Smoker, Stroke, HeartDiseaseorAttack,
#            PhysActivity, Fruits, Veggies, HvyAlcoholConsump, AnyHealthcare,
#            NoDocbcCost, GenHlth, MentHlth, PhysHlth, DiffWalk, Sex, Age,
#            Education, Income]
sample = np.array([[1, 1, 1, 28.5, 0, 0, 0, 1, 1, 1, 0, 1, 0, 2, 0, 0, 0, 0, 9, 5, 6]])
prediction = model.predict(sample)
probability = model.predict_proba(sample)

print(f"Prediction: {'Diabetes' if prediction[0] == 1 else 'No Diabetes'}")
print(f"Probability: {probability[0][1]:.4f}")

Training Details

  • Preprocessing: StandardScaler for feature normalization
  • Class Imbalance: SMOTE (Synthetic Minority Over-sampling Technique)
  • Train/Test Split: 80/20 stratified split
  • Cross-validation: 5-fold stratified CV for ensemble models
  • Hyperparameters:
    • n_estimators: 200
    • max_depth: 5
    • learning_rate: 0.05
    • subsample: 0.8
    • colsample_bytree: 0.8

Limitations

  • The dataset is based on CDC BRFSS survey data and may not generalize to all populations
  • Class imbalance remains a challenge (14% positive cases)
  • Model should not replace professional medical diagnosis
  • Features are self-reported survey responses

Citation

@dataset{cdc_diabetes_health_indicators,
  title = {CDC Diabetes Health Indicators},
  author = {naabiil},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/naabiil/CDC_Diabetes_Health_Indicators}
}

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support