Diabetes Classification Model

A machine learning model for predicting diabetes risk based on health indicators. Trained on the CDC Diabetes Health Indicators dataset with XGBoost and SMOTE for handling class imbalance.

Model Details

Architecture: XGBoost Classifier with SMOTE oversampling
Dataset: CDC Diabetes Health Indicators (50,737 samples, 21 features)
Task: Binary classification (Diabetes / No Diabetes)
Framework: scikit-learn, XGBoost, imbalanced-learn

Performance

Metric	Value
Accuracy	85.51%
Precision (Diabetes)	47.40%
Recall (Diabetes)	30.08%
F1-Score (Diabetes)	36.80%
ROC-AUC	82.04%
MCC	30.04%

Features

The model uses 21 health indicator features:

HighBP - High blood pressure (0/1)
HighChol - High cholesterol (0/1)
CholCheck - Cholesterol check in past 5 years (0/1)
BMI - Body Mass Index (continuous)
Smoker - Have you smoked 100+ cigarettes in your life (0/1)
Stroke - Ever had a stroke (0/1)
HeartDiseaseorAttack - Coronary heart disease or myocardial infarction (0/1)
PhysActivity - Physical activity in past 30 days (0/1)
Fruits - Consume fruit 1+ times per day (0/1)
Veggies - Consume vegetables 1+ times per day (0/1)
HvyAlcoholConsump - Heavy alcohol consumption (0/1)
AnyHealthcare - Have any kind of health care coverage (0/1)
NoDocbcCost - Could not see doctor because of cost (0/1)
GenHlth - General health (1-5 scale)
MentHlth - Days of poor mental health (0-30)
PhysHlth - Days of poor physical health (0-30)
DiffWalk - Difficulty walking or climbing stairs (0/1)
Sex - Gender (0=Female, 1=Male)
Age - Age category (1-13)
Education - Education level (1-6)
Income - Income level (1-8)

Usage

import pickle
import numpy as np

# Load model
with open("diabetes_classifier.pkl", "rb") as f:
    model = pickle.load(f)

# Make prediction
# Features: [HighBP, HighChol, CholCheck, BMI, Smoker, Stroke, HeartDiseaseorAttack,
#            PhysActivity, Fruits, Veggies, HvyAlcoholConsump, AnyHealthcare,
#            NoDocbcCost, GenHlth, MentHlth, PhysHlth, DiffWalk, Sex, Age,
#            Education, Income]
sample = np.array([[1, 1, 1, 28.5, 0, 0, 0, 1, 1, 1, 0, 1, 0, 2, 0, 0, 0, 0, 9, 5, 6]])
prediction = model.predict(sample)
probability = model.predict_proba(sample)

print(f"Prediction: {'Diabetes' if prediction[0] == 1 else 'No Diabetes'}")
print(f"Probability: {probability[0][1]:.4f}")

Training Details

Preprocessing: StandardScaler for feature normalization
Class Imbalance: SMOTE (Synthetic Minority Over-sampling Technique)
Train/Test Split: 80/20 stratified split
Cross-validation: 5-fold stratified CV for ensemble models
Hyperparameters:
- n_estimators: 200
- max_depth: 5
- learning_rate: 0.05
- subsample: 0.8
- colsample_bytree: 0.8

Limitations

The dataset is based on CDC BRFSS survey data and may not generalize to all populations
Class imbalance remains a challenge (14% positive cases)
Model should not replace professional medical diagnosis
Features are self-reported survey responses

Citation

@dataset{cdc_diabetes_health_indicators,
  title = {CDC Diabetes Health Indicators},
  author = {naabiil},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/naabiil/CDC_Diabetes_Health_Indicators}
}

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support