Diabetes Classification Model
A machine learning model for predicting diabetes risk based on health indicators. Trained on the CDC Diabetes Health Indicators dataset with XGBoost and SMOTE for handling class imbalance.
Model Details
- Architecture: XGBoost Classifier with SMOTE oversampling
- Dataset: CDC Diabetes Health Indicators (50,737 samples, 21 features)
- Task: Binary classification (Diabetes / No Diabetes)
- Framework: scikit-learn, XGBoost, imbalanced-learn
Performance
| Metric | Value |
|---|---|
| Accuracy | 85.51% |
| Precision (Diabetes) | 47.40% |
| Recall (Diabetes) | 30.08% |
| F1-Score (Diabetes) | 36.80% |
| ROC-AUC | 82.04% |
| MCC | 30.04% |
Features
The model uses 21 health indicator features:
- HighBP - High blood pressure (0/1)
- HighChol - High cholesterol (0/1)
- CholCheck - Cholesterol check in past 5 years (0/1)
- BMI - Body Mass Index (continuous)
- Smoker - Have you smoked 100+ cigarettes in your life (0/1)
- Stroke - Ever had a stroke (0/1)
- HeartDiseaseorAttack - Coronary heart disease or myocardial infarction (0/1)
- PhysActivity - Physical activity in past 30 days (0/1)
- Fruits - Consume fruit 1+ times per day (0/1)
- Veggies - Consume vegetables 1+ times per day (0/1)
- HvyAlcoholConsump - Heavy alcohol consumption (0/1)
- AnyHealthcare - Have any kind of health care coverage (0/1)
- NoDocbcCost - Could not see doctor because of cost (0/1)
- GenHlth - General health (1-5 scale)
- MentHlth - Days of poor mental health (0-30)
- PhysHlth - Days of poor physical health (0-30)
- DiffWalk - Difficulty walking or climbing stairs (0/1)
- Sex - Gender (0=Female, 1=Male)
- Age - Age category (1-13)
- Education - Education level (1-6)
- Income - Income level (1-8)
Usage
import pickle
import numpy as np
# Load model
with open("diabetes_classifier.pkl", "rb") as f:
model = pickle.load(f)
# Make prediction
# Features: [HighBP, HighChol, CholCheck, BMI, Smoker, Stroke, HeartDiseaseorAttack,
# PhysActivity, Fruits, Veggies, HvyAlcoholConsump, AnyHealthcare,
# NoDocbcCost, GenHlth, MentHlth, PhysHlth, DiffWalk, Sex, Age,
# Education, Income]
sample = np.array([[1, 1, 1, 28.5, 0, 0, 0, 1, 1, 1, 0, 1, 0, 2, 0, 0, 0, 0, 9, 5, 6]])
prediction = model.predict(sample)
probability = model.predict_proba(sample)
print(f"Prediction: {'Diabetes' if prediction[0] == 1 else 'No Diabetes'}")
print(f"Probability: {probability[0][1]:.4f}")
Training Details
- Preprocessing: StandardScaler for feature normalization
- Class Imbalance: SMOTE (Synthetic Minority Over-sampling Technique)
- Train/Test Split: 80/20 stratified split
- Cross-validation: 5-fold stratified CV for ensemble models
- Hyperparameters:
- n_estimators: 200
- max_depth: 5
- learning_rate: 0.05
- subsample: 0.8
- colsample_bytree: 0.8
Limitations
- The dataset is based on CDC BRFSS survey data and may not generalize to all populations
- Class imbalance remains a challenge (14% positive cases)
- Model should not replace professional medical diagnosis
- Features are self-reported survey responses
Citation
@dataset{cdc_diabetes_health_indicators,
title = {CDC Diabetes Health Indicators},
author = {naabiil},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/naabiil/CDC_Diabetes_Health_Indicators}
}
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support