Predicting Mental Well-being via Physical & Lifestyle Metrics - EDA & Modeling
Presentation
Watch the project overview:
Quick Links & Live Demo
- Hugging Face Space - Open in Hugging Face
- Analysis Notebook - View IPYNB File
- Classification Model - Use this model
- Regression Model - Use this model
- Presentation - Watch Project Overview
- Project Roadmap - Jump to Workflow
Project Overview
This project provides an analytical framework to examine how physiological indicators, clinical health data, and daily habits relate to mental health outcomes using the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) dataset. The development process follows a clear progression from exploratory data analysis and feature engineering to the construction of regression and classification models. The result is a comprehensive system that goes beyond simple observations to provide two production-ready models exported for deployment.
Research Question
How do physiological indicators and lifestyle habits influence an individual's Mental Well-being?
To address this question, the project is structured around three integrated objectives:
- Exploratory Data Analysis (EDA): To identify initial patterns and determine which physical, clinical, and lifestyle variables hold the strongest relationship with reported mental distress.
- Regression Modeling: Predicting the numerical poor mental health Score (ranging from 0 to 30 days). We will treat this score as a continuous target variable to establish a model, identifying the impact of physiological and lifestyle features.
- Classification Modeling: Segmenting individuals into distinct risk categories derived from their mental health scores to evaluate how effectively physical and clinical metrics can distinguish between different levels of distress.
Motivation and Significance
This research is driven by an interest in understanding how our broader physical state—including clinical diagnoses, physiological indicators, and lifestyle choices—shapes our mental well-being. By analyzing these multi-faceted connections, the study aims to identify the specific factors that serve as significant indicators of an individual's mental health status.
Each phase of the project contributes to this understanding. The exploratory data analysis serves to uncover hidden correlations and validate health patterns within the large-scale dataset. The regression analysis provides a quantitative method to measure the specific impact of these factors and understand broader trends in mental well-being. The classification model then translates these findings into a practical tool for risk assessment, identifying the thresholds where health data can effectively signal the need for mental health intervention.
Project Workflow
Raw Dataset (445,132 rows × 40 features)
↓
Part 1: Data Cleaning & Exploratory Analysis (EDA)
- Deduplication and removal of non-predictive features
- Categorical binning and normalization of long-form descriptions
- Handling missing values and inconsistency filtering
- Outlier Treatment
- Descriptive statistics and correlation analysis
- Research-driven visualization: 6 specific questions addressing health and lifestyle behaviors
- Final Cleaned Dataset: 302,472 rows × 27 features
↓
Part 2: Baseline Regression Modeling
- Standard 80/20 train-test split
- Implementation of baseline Linear Regression
- Performance evaluation (MAE, RMSE, R²)
- Residual analysis and initial feature importance mapping
↓
Part 3: Advanced Feature Engineering & Clustering
- Data scaling, encoding, and Log transformations
- Generating polynomial and interaction features from top coefficients
- Unsupervised clustering: K-Means vs. DBSCAN via Silhouette scores
- PCA-based visualization and cluster profiling
- Final Engineered Dataset: 44 features
↓
Part 4: Model Optimization & Selection
- Comparative training: Linear Regression (re-trained), Random Forest, and Gradient Boosting
- Evaluation against the baseline performance
- Selection of top performer: Gradient Boosting
↓
Part 5: Regression-to-Classification Strategy
- Target distribution analysis (62% zero-value concentration)
- Defining business-rule thresholds for mental health risk tiers
- Visualizing class balance and category distribution
↓
Part 6: Classification & Final Evaluation
- Training Logistic Regression, Random Forest, and Gradient Boosting
- Evaluating performance via Precision, Recall, F1-score, and Support
- Final Selection: Random Forest (optimized for Recall sensitivity)
↓
Final Output: Two Deployment-Ready Models (Regression & Classification)
Data Source and Description
The data utilized in this project was acquired from the Personal Key Indicators of Heart Disease Kaggle repository, initially curated to analyze personal key indicators of heart disease. However, rather than using the author's pre-processed dataset, this project utilizes the raw, unmodified Behavioral Risk Factor Surveillance System (BRFSS) survey file provided within the same repository. The BRFSS is a prominent annual telephone survey conducted by the CDC. Because this raw dataset contains a vast, unfiltered array of general health, clinical, and behavioral metrics, it provided the flexibility to shift the analytical focus away from heart disease. This inherent versatility naturally established the foundation for the core research question: How do physiological indicators and lifestyle habits influence an individual's Mental Well-being?
Dataset Specifications
- Total dataset size: 445,132 records.
- Categorical variables: 34 features (object).
- Numerical variables: 6 features (float64).
Feature Categorization
The 40 raw features provide a comprehensive view of the respondents' health profiles. They are organized into the following logical domains:
| Category | Description | Features |
|---|---|---|
| Administrative & Demographics | Personal background and location information. | State, Sex, AgeCategory, RaceEthnicityCategory |
| General Health & Physiological | Self-reported overall health and physical metrics. | GeneralHealth, PhysicalHealthDays, MentalHealthDays, HeightInMeters, WeightInKilograms, BMI |
| Clinical Diagnoses | Historical and current medical conditions and chronic diseases. | HadHeartAttack, HadAngina, HadStroke, HadAsthma, HadSkinCancer, HadCOPD, HadDepressiveDisorder, HadKidneyDisease, HadArthritis, HadDiabetes |
| Functional & Mobility | Difficulties related to physical, sensory, and cognitive daily tasks. | DeafOrHardOfHearing, BlindOrVisionDifficulty, DifficultyConcentrating, DifficultyWalking, DifficultyDressingBathing, DifficultyErrands |
| Lifestyle & Behaviors | Daily habits and personal routines impacting health. | PhysicalActivities, SleepHours, SmokerStatus, ECigaretteUsage, AlcoholDrinkers |
| Medical Care & Prevention | Interaction with healthcare services and preventive measures. | LastCheckupTime, ChestScan, HIVTesting, FluVaxLast12, PneumoVaxEver, TetanusLast10Tdap, HighRiskLastYear, CovidPos, RemovedTeeth |
Target Variable Definition
The focal point of this analysis is the MentalHealthDays feature, which is subsequently renamed to poor_mental_health_days during the data refinement phase to improve code readability. This variable records the respondent's answer regarding how many days during the past 30 days their mental health was not good, encompassing factors like stress, depression, and emotional problems.
The structure of this variable makes it an ideal target for a comprehensive machine learning pipeline. During the exploratory data analysis phase, its numerical distribution enables deep statistical correlation testing against the categorical health metrics. For the baseline and advanced regression modeling, its continuous numerical range (spanning from 0 to 30) allows the algorithms to predict specific numeric outcomes based on health indicators. Finally, this continuous nature provides the flexibility needed for the classification phase, allowing us to establish logical business-rule thresholds to convert the numeric scores into distinct mental health risk categories.
Exploratory Data Analysis (EDA)
1. Initial Data Assessment
A diagnostic review of the raw dataset was performed to identify potential integrity issues, noise, and data quality gaps. The assessment revealed several critical areas requiring attention:
- Redundancy: Identification of 157 duplicate records within the raw data.
- Missing Value Density: Significant gaps were identified across multiple features, particularly in lifestyle and physiological variables.
View detailed null count table
| Feature | Null Count |
|---|---|
AlcoholDrinkers |
42,691 |
BMI |
42,352 |
ECigaretteUsage |
31,835 |
SmokerStatus |
31,636 |
DifficultyErrands |
21,968 |
DifficultyConcentrating |
20,559 |
DifficultyWalking |
20,327 |
DifficultyDressingBathing |
20,229 |
BlindOrVisionDifficulty |
17,882 |
DeafOrHardOfHearing |
16,967 |
RaceEthnicityCategory |
13,736 |
PhysicalHealthDays |
10,900 |
MentalHealthDays (Target) |
9,037 |
AgeCategory |
8,438 |
SleepHours |
5,413 |
HadAngina |
4,379 |
HadHeartAttack |
3,039 |
HadDepressiveDisorder |
2,786 |
HadArthritis |
2,607 |
HadCOPD |
2,193 |
HadKidneyDisease |
1,900 |
HadAsthma |
1,747 |
HadStroke |
1,531 |
GeneralHealth |
1,176 |
PhysicalActivities |
1,069 |
HadDiabetes |
1,061 |
2. Data Cleaning Process
The data refinement phase was designed to transform the noisy raw survey data into a high-signal dataset suitable for machine learning.
Step 1: Structural Refinement & Feature Removal
Non-predictive features and redundant markers were removed to reduce dimensionality and focus on relevant mental health predictors.
View removed features and logic
| Category | Features Removed | Reason for Removal |
|---|---|---|
| Administrative | State |
Geographic data is outside the scope of this physiological/behavioral study. |
| Clinical Specifics | HadSkinCancer, ChestScan, CovidPos, HIVTesting |
Low correlation with the target variable or represents acute rather than chronic health states. |
| Preventive Care | LastCheckupTime, FluVaxLast12, PneumoVaxEver, TetanusLast10Tdap, HighRiskLastYear |
These represent healthcare access rather than direct psychological predictors. |
| Redundant Metrics | RemovedTeeth, HeightInMeters, WeightInKilograms |
RemovedTeeth provided low signal; physical metrics are encapsulated in the BMI feature. |
Step 2: Missing Value Strategy & Target Integrity
A rigorous strategy was applied to handle missing information while preserving the clinical integrity of the dataset.
View imputation and handling details
- Deduplication: Removal of all 157 identified duplicate rows.
- Target Variable Protection: All 9,037 records with missing values in the
MentalHealthDayscolumn were dropped to ensure the model is trained only on verified labels. - Smart Imputation:
AgeCategoryandSleepHours: Imputed using the Median to maintain central tendency.GeneralHealthandPhysicalActivities: Imputed using the Mode (most frequent value) due to the low percentage of missingness.- Other physiological indicators with missing values were removed to prevent bias and ensure clinical accuracy.
Step 3: Categorical Binning & Normalization
Categorical features often contained long-form text or non-uniform descriptions (e.g., complex RaceEthnicityCategory labels).
View normalization process
These were mapped and converted into concise, uniform categories to improve model interpretability and prepare for encoding. This ensured that all categorical inputs have consistent labels across the entire dataset.
Step 4: Logic Validation & Contradiction Filtering
To ensure data reliability, records containing logical impossibilities or "survey noise" were identified and removed.
View outlier and contradiction stats
- Sleep Outliers: 662 records reporting >14 hours of sleep.
- Health Contradictions: 7,820 records where individuals reported "Great Health" but also recorded 25+ physically sick days.
- Demographic Inconsistencies: 849 records of individuals aged ≤25 reporting chronic elderly-onset conditions (Heart Attack/Stroke/COPD).
Step 5: Final Target Analysis, Renaming & Reordering
View final renaming and column order
All column names were standardized to snake_case for improved code readability (e.g., MentalHealthDays to poor_mental_health_days). Finally, the dataset was logically reordered for clarity:
- Demographics:
sex,age_category,race_ethnicity. - Body & Lifestyle:
bmi,general_health,sleep_hours,physical_activities,smoker_status,e_cigarette_usage,alcohol_drinkers. - General Physical Metrics:
poor_physical_health_days. - Medical History:
had_heart_attackthroughhad_diabetes. - Disabilities:
deaf_or_hard_of_hearingthroughdifficulty_errands. - Target Variable:
poor_mental_health_days.
An initial distribution analysis revealed a significant "Right Tail" in the target variable, which guided the modeling strategy.
3. Outlier Treatment
To ensure high data quality and model stability, an advanced anomaly detection phase was performed on the numerical features: bmi, sleep_hours, and poor_physical_health_days.
Methodology: Isolation Forest
The Isolation Forest algorithm was selected for this task due to its superior ability to detect anomalies in a multi-dimensional space. Unlike standard univariate methods (such as IQR) that only identify outliers on a per-feature basis, Isolation Forest evaluates how features interact. A specific data point might appear normal in isolation but becomes an outlier when its combination with other variables is statistically improbable within the dataset's overall structure.
A 5% contamination threshold was strategically applied. In large-scale, self-reported telephone surveys like the BRFSS, a wide spectrum of diverse data and high variance are naturally expected. This 5% limit acts as a balanced ceiling—it is aggressive enough to purge extreme, high-leverage outliers that would otherwise skew the model, while being conservative enough to preserve the massive sample size required for robust analysis.
View outlier detection results and impact
- Outliers Detected: 15,920 records were identified as anomalies and removed from the dataset.
- Biomedical Plausibility: A primary impact of this process was seen in the
bmifeature. The maximum value was reduced from 97.65 (a clinically extreme and statistically rare value that likely represents a data entry error or an extreme outlier) to a more plausible maximum of 54.39. This ensures the model focuses on representative physiological trends rather than extreme noise. - Final Dataset Dimensions: 302,472 records × 27 features.
4. Descriptive Statistics
This section provides a high-level statistical overview of the cleaned dataset, offering insights into the distribution of categorical health indicators and the central tendencies of numerical physiological metrics.
Categorical Features Summary
- Key Findings:
- Demographic Profile: The dataset is well-represented with a slight majority of female respondents (154,084).
- Health Perception: Most participants perceive their status as "Very Good" (107,603), indicating a generally healthy baseline for the surveyed population.
- Clinical Indicators: While most chronic conditions show a "No" majority, approximately 19% of the population reported having a Depressive Disorder, which serves as a significant feature for predicting mental well-being.
View detailed categorical summary table
| Feature | Count | Unique | Top | Freq |
|---|---|---|---|---|
sex |
302,472 | 2 | female | 154,084 |
race_ethnicity |
302,472 | 5 | white | 228,656 |
general_health |
302,472 | 5 | very good | 107,603 |
physical_activities |
302,472 | 2 | yes | 238,933 |
smoker_status |
302,472 | 2 | no | 268,007 |
e_cigarette_usage |
302,472 | 2 | no | 286,689 |
alcohol_drinkers |
302,472 | 2 | yes | 168,108 |
had_heart_attack |
302,472 | 2 | no | 287,253 |
had_angina |
302,472 | 2 | no | 285,291 |
had_stroke |
302,472 | 2 | no | 291,049 |
had_asthma |
302,472 | 2 | no | 259,432 |
had_copd |
302,472 | 2 | no | 282,231 |
had_depressive_disorder |
302,472 | 2 | no | 244,610 |
had_kidney_disease |
302,472 | 2 | no | 289,916 |
had_arthritis |
302,472 | 2 | no | 203,135 |
had_diabetes |
302,472 | 4 | no | 254,420 |
deaf_or_hard_of_hearing |
302,472 | 2 | no | 276,837 |
blind_or_vision_difficulty |
302,472 | 2 | no | 288,705 |
difficulty_concentrating |
302,472 | 2 | no | 273,202 |
difficulty_walking |
302,472 | 2 | no | 264,080 |
difficulty_dressing_bathing |
302,472 | 2 | no | 294,920 |
difficulty_errands |
302,472 | 2 | no | 285,994 |
Numerical Features Summary
- Key Findings:
- Body Composition: The average BMI of 28.21 falls within the overweight category, with a median of 27.41, suggesting a consistent trend across the population.
- Target Distribution: Both physical and mental health days exhibit a median of 0, confirming that while most respondents report no distress, a specific subset experiences significant challenges (up to 30 days).
- Sleep Habits: Sleep patterns are remarkably stable, with a mean of 7.03 hours, aligning with general health recommendations.
View detailed numerical summary table
| Feature | Mean | Std | Min | 25% | 50% | 75% | Max |
|---|---|---|---|---|---|---|---|
age_category |
52.91 | 18.06 | 18.00 | 40.00 | 55.00 | 70.00 | 80.00 |
bmi |
28.21 | 5.73 | 12.02 | 24.19 | 27.41 | 31.45 | 54.39 |
sleep_hours |
7.03 | 1.22 | 1.00 | 6.00 | 7.00 | 8.00 | 14.00 |
poor_physical_health_days |
3.11 | 6.97 | 0.00 | 0.00 | 0.00 | 2.00 | 30.00 |
poor_mental_health_days |
3.49 | 7.07 | 0.00 | 0.00 | 0.00 | 3.00 | 30.00 |
Feature Correlation Analysis
Key Findings:
- Strongest Relative Positive Correlation: The strongest positive relationship identified was between
poor_physical_health_daysandpoor_mental_health_days, reinforcing the intrinsic link between physical ailment and psychological distress. - Sleep Impact (Negative Correlation): A negative correlation of -0.13 was found between
sleep_hoursand mental well-being. This indicates that fewer hours of sleep are directly associated with an increase in reported poor mental health days.

- Strongest Relative Positive Correlation: The strongest positive relationship identified was between
5. Research-Driven Visualizations
This phase of the Exploratory Data Analysis (EDA) moves beyond basic statistics to uncover the complex interactions between lifestyle, physical health, and psychological states. To systematically explore these factors, we have structured our analysis around six core questions, each examining the relationship between the target variable and key factors within the dataset.
Q1: How do BMI and Poor Physical Health Days impact Mental Well-being across different sleep levels?
- Insight: Sleep acts as a "risk multiplier." In the Short Sleep group (<6h), the dark pink regions (high mental distress) are overwhelmingly dominant. Chronic lack of sleep appears to make the individual significantly more vulnerable to the negative mental impacts of high BMI and poor physical health.
- Conclusion: Sufficient sleep functions as a biological "shock absorber." Even when BMI or physical health metrics are suboptimal, individuals in the regular and long sleep categories maintain significantly lower average levels of mental distress compared to those in the short sleep group.
Q2: How do habits like exercise, smoking, and alcohol consumption interact within General Health categories to shape Mental Well-being?
- Insight: General health perception is the primary "anchor" for mental well-being. However, the lack of physical activity creates a visible "darkening" of distress levels across all health categories, especially within the Poor health segment.
- Conclusion: There is a clear "Cumulative Risk Path." The highest mental distress scores are found at the intersection of Poor Health → No Physical Activity → Smoker (Yes). Interestingly, alcohol consumption in the Excellent/Very Good categories does not always trigger a dramatic spike in distress, suggesting its impact may be context-dependent.
Q3: How does the distribution of diminished Mental Well-being evolve across age groups for different genders?
- Insight: A consistent gender gap exists across the entire lifespan. Women (pink) report significantly higher average days of poor mental health than men (dark blue). This is most pronounced in early adulthood (age 20), where young women average ~8 days of distress per month compared to ~5 days for men.
- Conclusion: Mental distress follows a clear linear downward trend as individuals age. While the younger population bears the highest psychological burden, seniority appears to be associated with greater emotional stability and fewer reported "bad days" for both genders.
Q4: How do specific physical ailments and chronic conditions impact Mental Well-being?
- Insight: The "Topographic Peaks" of distress (reaching ~16%) are clearly localized among individuals diagnosed with Asthma or those who have suffered a Heart Attack. Conversely, the "Healthy" baseline (back row) remains in a low-distress valley (9-10%).
- Conclusion: There is a significant Comorbidity Gap. Transitioning from a "Healthy" status to being "Diagnosed" with a chronic condition increases the frequency of mental distress days by 5% to 7% on average. This mountain-vs-plain structure confirms that physical ailments are not just bodily issues but primary drivers of psychological health.
Q5: How do ethnicity and physical functional difficulties interact to influence Mental Well-being?
- Insight: Functional disabilities (e.g., Difficulty Concentrating or Errands) act as a heavier "psychological weight" than sensory ones (Vision/Hearing). These functional impairments push a large portion of the population toward the upper limit of the scale (15–30 days of distress), significantly altering the density of the violin plots.
- Conclusion: Resilience patterns vary across ethnic groups. The White group consistently shows a wider "base" at 0 days across various disabilities. This suggests that even under physical limitations, this group may have higher access to support systems or resources that mitigate the immediate impact of a disability on mental health.
Q6: How does a formal depressive disorder diagnosis correlate with self-reported Mental Well-being?
- Insight: A formal clinical diagnosis completely collapses the "Zero Baseline." For undiagnosed individuals, there is a massive concentration at 0 days; however, for those with a Depressive Disorder diagnosis, this concentration disappears, and data points disperse across the entire 0-30 day spectrum.
- Conclusion: The poor_mental_health_days feature serves as a highly reliable measurement tool, accurately reflecting the medical reality of the respondents. The visible "heaping" effect (spikes at 5, 10, 15, 20, and 30 days) highlights a common psychological tendency in subjective reporting to round estimates to familiar intervals.
EDA Summary: The Path to Modeling
The Exploratory Data Analysis has successfully transformed a raw dataset of 445k records into a clean, high-signal foundation of 302,472 observations. Through this multi-dimensional analysis, we have identified that Physical Health Days, General Health Perception, and Sleep Duration are among the most potent predictors of our target variable.
With these insights, we move forward to the Modeling Phase, where we will leverage these behavioral and clinical indicators to build predictive systems capable of identifying individuals at risk of high mental distress.
Baseline Regression Modeling
Regression Goal
The primary objective of this phase is to establish a performance benchmark for predicting the number of poor mental health days per month. By implementing a simple, unoptimized Linear Regression, we aim to quantify the linear relationships between our 26 features and the target variable, providing a "baseline" to measure the effectiveness of more complex models later in the pipeline.
Model Setup & Partitioning
Following a methodical data science workflow, the initial setup was kept straightforward to ensure reproducibility:
Feature Selection: All 26 available features from the cleaned dataset were utilized to capture global interactions.
Train-Test Split: A standard 80/20 split was applied using simple random sampling.
Reproducibility: A fixed Random Seed (42) was used to ensure consistent results across different environments.
Implementation: The model was trained using the LinearRegression class from scikit-learn with default parameters.
Performance Evaluation
The initial results from our baseline model confirm that while there is a discernible signal in the data, the relationship between physiological traits and mental health is far from linear.
| Metric | Score | Interpretation |
|---|---|---|
| MAE | 3.6770 | On average, our predictions are off by ~3.7 days. |
| MSE | 34.2292 | Mean Squared Error. |
| RMSE | 5.8506 | Standard deviation of the prediction errors. |
| R² Score | 0.3157 | The model explains approximately 31.6% of the variance. |
Contextual Note: An R² of 0.31 reflects the inherent complexity of our target variable. Predicting human emotions and mental states based on physical indicators is significantly more abstract than predicting mathematical constants or market prices. These results strongly suggest that the data contains non-linear patterns that a simple regression cannot capture.
Residual & Prediction Analysis
To understand where the model fails, we analyzed the error patterns through diagnostic visualizations.
- Under-prediction Trends: In the Actual vs. Predicted plot (left), we observe a clear horizontal trend. Even when respondents report a full month (30 days) of distress, the model tends to predict only 15–20 days. It regresses toward the median, struggling to account for extreme cases.
- Error Distribution: The Residuals Distribution (right) shows a narrow peak around zero, but with a significant right-skewed tail. This confirms that the model consistently under-predicts the severity of mental distress for high-risk individuals.
Linear Regression Coefficients
By examining the model’s coefficients, we can identify which factors the baseline model perceives as the primary drivers of mental well-being.
- Primary Risk Factors (Turquoise):
- Clinical Indicators: A formal diagnosis of a Depressive Disorder and Difficulty Concentrating are the strongest predictors of increased distress days.
- Health Perception: Self-identifying as being in Poor or Fair general health serves as a high-signal warning for the model.
- Protective Factors (Pink):
- Gender: Being Male is the most significant statistical factor in reducing the predicted number of distressed days.
- Sleep Hygiene: As Sleep Hours increase, predicted distress days decrease, capturing a direct and intuitive biological link.
- Evidence of Model Limitations: Certain variables, such as "Difficulty Walking," appear on the protective side (pink). This logical contradiction suggests that the data contains complex, non-linear interactions that a linear baseline cannot effectively untangle.
Next Steps: Moving Beyond the Baseline
The baseline phase has achieved its goal: it mapped the linear landscape of our data and demonstrated its limitations. The relatively low R² and the clear under-prediction of high-distress cases justify our next strategic move: Feature Engineering and the implementation of Non-Linear Ensemble Models to capture the true complexity of human mental health.
Advanced Feature Engineering & Clustering
The baseline results indicated that linear models struggle to capture the complex, non-linear nature of mental health data. To address this, we implemented a sophisticated data engineering pipeline designed to provide the models with richer contextual intelligence.
1. Feature Transformation & Interaction Synthesis
- Target Log Transformation: To handle the significant right-skewed distribution of
poor_mental_health_days(0-30 days), we applied a log transformation. This stabilizes the variance and allows models to better differentiate between varying levels of distress. - Encoding & Scaling: All 22 categorical features were transformed. Binary features were mapped to 0/1, while multi-class variables (like
general_health) were expanded via One-Hot Encoding, resulting in 34 initial features. All numerical data were standard-scaled to prevent unit-bias. - Interaction & Polynomial Synthesis: Using the top coefficients from our baseline, we engineered interaction terms between the most impactful variables: Difficulty Concentrating, Depression Diagnosis, Sleep Duration, and Poor General Health. Each combination was added as a new feature, bringing the dataset to 40 columns.
2. Unsupervised Clustering:
A significant emphasis was placed on identifying hidden sub-populations within the data. We performed a rigorous comparison between different clustering paradigms to find the most effective segmentation:
A. The Search for Optimal Groups
We first utilized the Elbow Method (integrated with K-Means training) to observe the inertia trend across different 'k' values. While the curve showed a general improvement, it lacked a definitive "break" point. To resolve the ambiguity between 3 or 4 groups and to benchmark our approach, we employed the Silhouette Score to compare K-Means (k=3, k=4) against DBSCAN.
|
|
B. Reconciling Low Scores with Visual Evidence
Although the absolute Silhouette scores are low (near zero), this is expected in high-dimensional, large-scale social surveys where boundaries between human behaviors are naturally fluid. We chose to move forward with K-Means (k=3) as it yielded the highest relative score. To validate this, we projected the 40-dimensional space into 2D using PCA (Principal Component Analysis).
- PCA Analysis: The visualization reveals a surprisingly clear structural distinction.
- The low Silhouette score likely stems from the partial overlap between Cluster 0 and Cluster 1.
- However, Cluster 2 emerges as a completely isolated "island" of outliers, representing the most extreme health cases.
- The visual clarity of these three groupings provided high confidence that the model successfully captured distinct risk profiles despite the statistical noise.
3. Cluster Profiling: The "Human Story" Behind the Clusters
To understand who these clusters represent, we performed an anomalous feature detection for each group, measuring how much each feature deviates from the Global Average in terms of Standard Deviations (SD).
Cluster 1: The Healthy Mainstream (The Pink Mass)
- Profile: Represents the majority of the population.
- Characteristics: All values are remarkably close to the global average (e.g., Depression: -0.12 SD).
- PCA Role: This is the dense "anchor" of the dataset, centered at the 0,0 axis.
Cluster 0: Mental & Cognitive Struggle (The Turquoise Shift)
- Profile: Individuals facing specific psychological and cognitive burdens.
- Characteristics: Significant spikes in Concentration Issues (+3.06 SD) and Depression (+2.06 SD).
- The Multiplier Effect: The engineered interaction of
Depression x Concentrationshows a massive 6.28 SD, highlighting the heavy cumulative weight of these combined conditions.
Cluster 2: Extreme Multi-morbidity Risk (The Purple Outliers)
- Profile: High-risk individuals suffering from a "perfect storm" of physical and mental illness.
- Characteristics: The interaction between
Poor General Health x Concentration Issuesreaches a staggering +18.70 SD, whilePoor Health x Depressionstands at +7.31 SD. - Lifestyle: This group reports significantly lower physical activity (-0.87 SD) and poor sleep hygiene.
- PCA Role: These are the distant points on the far right, isolated from the rest of the population due to their extreme health profiles.
4. Integration into the Final Dataset
To leverage these insights for the modeling phase, we integrated the clustering results back into the main dataset:
- One-Hot Cluster Encoding: Added three binary columns:
Cluster_Mental_Cognitive_Issue,Cluster_Mainstream_Healthy, andCluster_Extreme_Risk. - Distance to Centroid: Added a
Distance_to_Centroidfeature to quantify the intensity of a respondent's profile (how far they are from the "average" member of their group).
Final Engineered Dataset Dimensions: 302,472 records × 44 features.
Baseline Conclusion: This phase transformed abstract data points into categorized health profiles. While these new features provide the upcoming ensemble models with critical "contextual intelligence", they also highlight the difficulty of predicting mental health: even with high-dimensional clusters and interactions, the subtle nuances of human experience remain a challenge to model with perfect accuracy.
Model Competition & Evaluation
In this phase, we evaluated multiple modeling approaches to find the optimal solution for our prediction task. We re-trained the initial Linear Regression model using the newly engineered 44-feature dataset to isolate the direct impact of our data processing. Simultaneously, we trained Random Forest and Gradient Boosting models to capture the non-linear complexities identified during the exploratory phase.
Performance Benchmarking
We evaluated the models using a standard 80/20 train-test split. The primary focus was placed on MAE (Mean Absolute Error) as it provides the most interpretable metric for our goal: predicting the actual number of mental health days.
| Model Name | MAE (Days) | RMSE (Days) | R² Score |
|---|---|---|---|
| Initial Baseline (Linear) | 3.6770 | 5.8506 | 0.3157 |
| Linear Regression (Engineered) | 3.0812 | 6.2635 | 0.2157 |
| Random Forest (SKlearn) | 3.1311 | 6.2497 | 0.2192 |
| Gradient Boosting | 3.0391 | 6.2161 | 0.2275 |
Analyzing the Results: The Trade-off
At first glance, the decrease in R² and the slight increase in RMSE compared to the initial baseline might seem counterintuitive. However, this is a common outcome when dealing with Log-Transformed targets and highly subjective survey data:
- MAE Improvement: Our primary achievement is the significant reduction in MAE (from 3.67 to 3.03 days). This means that on average, our predictions are now much closer to the real experience of the majority of respondents.
- The Impact of Log Transformation: By applying a log transformation, the model was trained to minimize errors in the log-scale, which effectively de-emphasized the extreme outliers in the 30-day "Right Tail." While this reduces the R² (which is sensitive to total variance explained), it creates a model that is far more accurate for the "typical" respondent.
- Human Subjectivity: Predicting "feelings" in increments of days is inherently noisy. A model that achieves a lower MAE is more reliable for real-world application than one that chases a higher R² by overfitting to extreme outliers.
Key Predictors of Mental Distress
To understand what drives our model's decisions, we visualized the Feature Importance of the winning Gradient Boosting model.
- Primary Predictors: A formal diagnosis of
Depressive Disorderand the respondent'sAge Categoryemerged as the most significant indicators of mental distress days. - Top Predictors: Other influential factors include
physical healthstatus anddifficulty concentrating. - Engineering Impact: Notably, our engineered Cluster labels and Interaction terms (e.g., Depressive x Concentration) appear among the top predictors, confirming that the additional context provided by our data engineering phase successfully assisted the model in navigating the data's complexity.
Final Model Selection & Usage
The Winner: Gradient Boosting Regressor
The Gradient Boosting model was selected as the final model for this project. It demonstrated the best ability to learn from its own errors iteratively, resulting in the lowest MAE across all tests. It successfully balanced the impact of clinical history with subtle lifestyle factors and the unsupervised health profiles we identified during clustering.
How to Use This Model
Since this repository contains the finalized regression model, you can integrate it into your own Python environment by following these steps:
- Download: Navigate to the Files and versions tab and download
regression_model.pkl. - Upload: Add the file to your local or cloud working directory.
- Load: Use the following snippet to load the model:
import pickle
# Load the winning model
model_path = "regression_model.pkl"
with open(model_path, "rb") as f:
model = pickle.load(f)
Transitioning from Regression to Classification
While the regression model provided a detailed granular view of the drivers behind mental health distress, we decided to enhance the project's clinical utility by introducing a Classification perspective. By shifting from predicting specific days to identifying risk levels, we can transform the model into a strategic screening tool focused on Risk Identification.
This dual approach allows us to not only understand the intensity of distress but also to effectively distinguish between individuals who are stable and those who are beginning to experience mental health challenges.
Data Distribution & Strategy Rationale
Before training our classifiers, we analyzed the target variable's distribution to determine the most effective split strategy.
Statistical Findings:
- Mathematical Median: 0.0
- Samples with exactly 0 days: 150,850 (62.34%)
- Samples with 1 or more days: 91,127 (37.66%)
Strategic Choice: Business-Rule Threshold
While a Median Split is often preferred to avoid outlier influence, our Probability Density Profile shows that over 62% of participants reported 0 days of poor mental health. This places the median within the "Zero Group," making a standard median split mathematically non-functional (as both sides of the split would fall into the same category).
We therefore implemented a Business Rule Threshold:
- Class 0 (High Well-being): 0 days reported.
- Class 1 (Coping / Challenges): 1+ days reported.
This choice provides results in a relatively balanced distribution (approx. 62/38), which is optimal for training our classifiers.
Class Balancing & Evaluation Strategy
The resulting distribution shows that 62.3% of the participants are categorized as having High Well-being (0 days), while 37.7% fall into the Coping / Challenges (1+ days) group. Although the classes are not extremely skewed, the "Coping" group is under-represented.
Prioritizing Recall for Risk Sensitivity
Due to this imbalance, Accuracy will not be our primary evaluation metric. A naive model could achieve 62.3% accuracy by simply predicting "High Well-being" for all instances, failing to identify those experiencing challenges.
Therefore, our evaluation will prioritize the Recall score for the "Coping" class.
- Impact over Precision: This ensures the model is sensitive enough to capture individuals facing mental health difficulties.
- False Negatives: In this context, a "False Negative" (missing someone in need) is considered more critical than a "False Positive" (flagging someone for additional check-up who is actually stable).
By focusing on Recall, we ensure that our model acts as a robust safety net for risk detection.
Train & Eval Classification Models
In this stage, we trained three classification models to identify individuals at risk of mental health distress. The evaluation process is designed to find a model that provides a balance between overall accuracy and high sensitivity to the target population.
Performance Evaluation Metrics
To assess the effectiveness of our classifiers, we utilized three primary metrics:
Accuracy: Measures the proportion of correct predictions out of total predictions.
Note: While easy to understand, accuracy can be misleading if classes are imbalanced.
Precision: Answers the question: "When the model predicts positive, how often is it correct?"
Note: Focuses on the accuracy of positive flags (minimizing False Positives).
Recall: Answers the question: "Out of all actual positives, how many did we find?"
Note: Focuses on completeness (minimizing False Negatives).
1. Logistic Regression
|
|
2. Gradient Boosting
|
|
3. Random Forest
|
|
Comparative Analysis of Results
The evaluation shows distinct trade-offs between the three models based on the prioritized metrics:
- Random Forest: This model achieved the highest Recall (0.56) and identified the largest number of individuals in distress (TP: 12,676). Crucially, it produced the lowest number of False Negatives (10,045), making it the most sensitive model in our testing. While its overall accuracy is slightly lower than the others, its ability to capture the target class is superior.
- Gradient Boosting: While this model demonstrated the highest Accuracy (0.75) and Precision (0.73), it proved to be overly cautious. It missed approximately 600 more individuals in distress (FN: 10,630) compared to the Random Forest, which is a significant gap in a welfare context.
- Logistic Regression: Functioning as our baseline, this model showed the weakest performance in identifying individuals in need, yielding the lowest Recall (0.52) and the highest number of False Negatives (10,945).
Final Model Selection & Usage
The Winner: Random Forest Classifier
The Random Forest model was selected as the final champion for this project. In the context of mental health and screening, Recall is the primary priority. We prioritize minimizing False Negatives (missing people in need) over maintaining high precision. Random Forest provides the most effective safety net by ensuring the highest percentage of people facing challenges are detected and placed on the radar of support systems.
How to Use This Model
Since this repository contains the finalized classification model, you can integrate it into your own Python environment by following these steps:
- Download: Navigate to the Files and versions tab and download
classification_model.pkl. - Upload: Add the file to your local or cloud working directory.
- Load: Use the following snippet to load the model:
import pickle
# Load the winning model
model_path = "classification_model.pkl"
with open(model_path, "rb") as f:
model = pickle.load(f)
Final summary
This research project represents a comprehensive attempt to map the invisible architecture of mental well-being through physiological and lifestyle metrics. By moving from raw, self-reported survey data to specialized predictive models, we have uncovered critical insights into how our daily physical reality shapes our internal state.
- The Nonlinear Reality of Mental Health: The initial performance of the Linear Baseline served as a statistical indication that human distress does not follow a simple, incremental path. Mental health is a result of complex interactions. As discovered through feature interaction synthesis, when factors like chronic depression and cognitive difficulty overlap, the impact on well-being doesn't just add up—it multiplies. The engineering phase successfully captured these non-linearities, as evidenced by the significant reduction in error metrics.
- The Power of "Hidden Profiles" (Clustering): Through unsupervised clustering, we moved beyond individual variables to identify distinct human profiles. We successfully isolated high-risk groups, such as the "Extreme Multi-morbidity" segment—individuals whose health indicators deviate significantly from the norm. Recognizing these profiles allows for a more nuanced approach than traditional clinical categories, helping us see the person behind the symptoms.
- From Measurement to Prevention: The implementation of both Regression and Classification models provides a dual-layered analytical framework:
- Precision in Intensity: The regression model achieved a refined Mean Absolute Error (MAE) of approximately 3 days. In the context of subjective human emotion, this represents a meaningful level of predictive, allowing for a granular understanding of the magnitude of distress.
- Sensitivity in Risk: The classification model, optimized for Recall, acts as a sensitive safety net. It ensures that the "Coping" population is identified and placed on the radar of support systems, prioritizing the prevention of False Negatives.
- The Bottom Line: By integrating physiological data with behavioral patterns, we have built a framework that bridges the gap between raw data and actionable insights. Together, these models demonstrate an ability to both measure the intensity of distress and identify critical risk thresholds. This synergy could potentially assist health systems in moving from reactive treatment toward proactive intervention, offering a framework to help identify individuals in need before they reach a point of crisis.
Project Limitations
While the models provide valuable insights, several limitations should be considered:
- Correlation vs. Causality: This analysis is based on observational data and identifies significant statistical associations. However, it does not establish direct causality.
- Data Scope & Missing Features: The dataset lacks several critical dimensions that could significantly refine predictive accuracy. This includes qualitative indicators (such as recent major life events or family medical history) and broader demographic/lifestyle context (such as marital status, socioeconomic background, and daily habits). Integrating these features is essential for reducing bias and improving the model's precision.
- Statistical Uncertainty: The current predictive models operate below a 95% confidence threshold. Therefore, results should be interpreted as indicative trends and general patterns rather than definitive diagnostic or predictive tools.
Notebook & Libraries
To view the complete data analysis, cleaning process, and visualizations in the official IPYNB file, click the button below:
The analysis and modeling were performed using the following Python environment:
# System and utility
import os
import random
import warnings
# Data manipulation
import numpy as np
import pandas as pd
# Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
# Machine Learning - Preprocessing, Models & Evaluation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import (
mean_absolute_error, mean_squared_error, r2_score,
accuracy_score, precision_score, recall_score, f1_score,
classification_report, confusion_matrix, ConfusionMatrixDisplay
)
# Model serialization
import pickle
Lia Prop | May 2026
- Downloads last month
- -