π¦ Loan Default Prediction β Credit Risk EDA & Modeling
Author: Uri Sivan
Assignment: Assignment #2 β Classification, Regression, Clustering & Evaluation
Dataset: Loan Default Dataset β Kaggle
Repository: Uris001/loan-default-risk-predictor
π₯ Presentation
π Project Overview
This project builds a full end-to-end machine learning pipeline to predict loan default risk using a real-world mortgage dataset of approximately 148,000 loans. The pipeline progresses from raw data through exploratory analysis, feature engineering, unsupervised clustering, regression modeling, and multi-class classification β ending with two production-ready models exported for deployment.
Research Question:
Given loan application data available at origination time, can we accurately predict which loans will default β and assign each loan to a meaningful risk tier (Low / Medium / High)?
Why this matters: In mortgage lending, a single missed default costs the lender the full outstanding principal plus legal, servicing, and provisioning costs. A model that correctly flags high-risk loans at origination time can prevent billions in portfolio losses β but only if it is built without data leakage and is grounded in real financial logic.
πΊοΈ Full Project Workflow
Raw Dataset (148,670 rows Γ 28 features)
β
Part 2: EDA
βββ Column cleanup and renaming
βββ Missingness co-occurrence analysis (heatmap before imputation)
βββ Domain-grounded imputation β 8 columns, 4 different strategies
βββ Invalid value detection and removal
βββ Outlier detection β IQR analysis + log transforms on 4 monetary cols
βββ Duplicate removal
βββ Descriptive statistics + correlation heatmap
βββ Univariate analysis β 7 numeric + 12 categorical features
βββ Chi-square + CramΓ©r's V β all 17 categorical features
βββ 5 bivariate research questions with dedicated visualizations
β
Part 3: Baseline Linear Regression
βββ 34 model features (after one-hot encoding of 27 cleaned columns)
βββ 80/20 stratified split (SEED=42)
βββ MAE=0.3227, RMSE=0.3944, RΒ²=0.1555, AUC=0.693
βββ Feature importance via coefficients
β
Part 4: Feature Engineering
βββ 10 new features (3 ratio + 7 binary interaction flags)
βββ ColumnTransformer pipeline (StandardScaler + OneHotEncoder + passthrough)
βββ PCA: 9 numeric β 5 orthogonal components (98.6% variance)
βββ K-Means K=4 (elbow method) + t-SNE + PCA visualization
βββ Final: 54-feature matrix, zero leakage
β
Part 5: Three Improved Regression Models
βββ Linear Regression (engineered) β AUC 0.809 (+16.7% over baseline)
βββ Logistic Regression β AUC 0.812
βββ Gradient Boosting β AUC 0.882 β WINNER
βββ ROC + Precision-Recall curves, confusion matrices, feature importance
βββ best_regression_model.pkl β HuggingFace
β
Part 7: Regression β Classification
βββ Business rule thresholds: 0.20 / 0.40
βββ 3 classes: Low Risk (9.4% DR) / Medium Risk (26.7% DR) / High Risk (61.8% DR)
βββ 52.4pp spread validates threshold quality
β
Part 8: Three Classification Models
βββ Random Forest (300 trees, class_weight=balanced)
βββ XGBoost (tuned via RandomizedSearchCV) β WINNER
βββ K-Nearest Neighbors (K=15, distance weights)
βββ Classification reports + confusion matrices + threshold analysis
βββ best_model_xgboost.pkl β HuggingFace
β
Upload notebook + README + models β HuggingFace
Record presentation β Add link to README
π Repository Contents
| File | Description |
|---|---|
Uri_Sivan_Assignment_2.ipynb |
Full notebook β all parts with outputs |
best_model_xgboost.pkl |
Winning classification model (XGBoost tuned) |
best_regression_model.pkl |
Winning regression model (Gradient Boosting) |
README.md |
This file |
plots/cleaning_summary.png |
Data cleaning waterfall chart |
plots/ltv_dti_heatmap.png |
Q1 β LTV Γ DTI compound risk |
plots/income_loan_scatter.png |
Q2 β Income vs loan amount scatter |
plots/default_by_decile.png |
Q2 β Default rate by decile |
plots/age_region_heatmap.png |
Q3 β Age Γ region interaction |
plots/credit_bureau_vs_loan_type.png |
Q4 β Credit bureau leakage proof |
plots/age_gender_default.png |
Q5 β Age Γ gender default rates |
plots/cluster_profiles.png |
K-Means cluster default rates |
plots/roc_curves.png |
ROC + PR curves β all regression models |
plots/confusion_matrices_part5.png |
Confusion matrices β Part 5 |
plots/feature_imprtance_part_5.png |
Feature importance β Part 5 models |
plots/feature_engineering_impact.png |
Before/after engineering comparison |
plots/three_models_evaluation.png |
Classification model comparison |
plots/confusion_matrices_part8.png |
Confusion matrices β Part 8 |
plots/feature_importance_part_8.png |
Feature importance β Part 8 models |
plots/threshold_analysis.png |
Threshold analysis β XGBoost |
π Dataset Description
| Property | Value |
|---|---|
| Source | Kaggle β Loan Default Dataset |
| Raw size | 148,670 rows Γ 28 features |
| After cleaning | 146,829 rows Γ 27 features |
| Baseline model input | 34 features (post one-hot encoding) |
| Engineered model input | 54 features |
| Target | Status β binary (0 = Repaid, 1 = Defaulted) |
| Class distribution | 75.66% repaid / 24.34% defaulted |
| Geography | US mortgage market |
| Time period | 2019 |
π Raw Feature Dictionary β All 28 Original Columns
The raw dataset contains 28 columns across four categories. Eight were excluded before modeling (see Section 2.4 for justification).
Loan Characteristics
| Feature | Type | Description | Kept |
|---|---|---|---|
loan_amount |
Numeric | Total loan disbursed in USD | β (log-transformed) |
loan_to_value_ratio |
Numeric | Loan amount Γ· property value Γ 100 | β |
term_months |
Numeric | Loan duration in months (360=30yr, 180=15yr, etc.) | β (bucketed) |
loan_type |
Categorical | Type 1 / Type 2 / Type 3 β mortgage product category | β |
loan_limit |
Categorical | Conforming / Non-conforming (agency limit compliance) | β |
loan_purpose |
Categorical | p1 / p2 / p3 / p4 β undocumented codes | β (no codebook) |
negative_amortization |
Binary | Yes/No β balance can grow over time | β |
interest_only_flag |
Binary | Yes/No β interest-only payment period | β |
lump_sum_payment_flag |
Binary | Yes/No β large irregular payment option | β |
business_or_commercial |
Binary | Yes/No β loan for business purpose | β |
Property Characteristics
| Feature | Type | Description | Kept |
|---|---|---|---|
property_value |
Numeric | Appraised value of collateral property in USD | β (log-transformed) |
occupancy_type |
Categorical | Primary residence / Investment / Secondary home | β |
total_units |
Categorical | 1U / 2U / 3U / 4U β number of units in property | β |
construction_type |
Categorical | sb (site-built) β 100% single value |
β (zero variance) |
secured_by |
Categorical | home β 99.9% single value |
β (zero variance) |
security_type |
Categorical | direct β 99.9% single value |
β (zero variance) |
Borrower Characteristics
| Feature | Type | Description | Kept |
|---|---|---|---|
income |
Numeric | Monthly gross income of primary applicant in USD | β (log-transformed) |
debt_to_income_ratio |
Numeric | Total monthly debt payments Γ· gross monthly income | β |
credit_score |
Numeric | Credit score 500β900 β creditworthiness measure | β (r=0.003 with target) |
age_group |
Categorical | <25 / 25-34 / 35-44 / 45-54 / 55-64 / 65-74 / >74 | β |
gender |
Categorical | Male / Female / Joint / Sex Not Available | β |
credit_bureau |
Categorical | CIB / CRIF / EQUI / EXP β bureau used for credit check | β (leakage β EQUI=100% default) |
coapplicant_credit_bureau |
Categorical | Same as credit_bureau for co-applicant | β (same leakage) |
credit_worthiness |
Categorical | l1 / l2 β lender's internal risk classification | β (leakage β set post-underwriting) |
Loan Pricing and Process
| Feature | Type | Description | Kept |
|---|---|---|---|
interest_rate |
Numeric | Loan interest rate in % β set by lender post-approval | β (leakage) |
interest_rate_spread |
Numeric | Rate minus benchmark rate β derived from interest_rate | β (inherits leakage) |
upfront_charges |
Numeric | Origination fee charged at closing in USD | β (data artifact β 0% default in no-fee segment) |
approved_in_advance |
Binary | Yes/No β pre-approval before property selection | β |
submission_channel |
Categorical | Retail / Broker / Direct β how application was submitted | β |
region |
Categorical | North / North-East / Central / South | β |
open_credit_flag |
Binary | Yes/No β open credit line exists | β (CramΓ©r's V < 0.01) |
Identifiers (always excluded)
| Feature | Type | Description |
|---|---|---|
loan_id |
ID | Unique loan identifier β no predictive value |
year |
Numeric | Single value (2019) β zero variance |
Feature Count Summary
| Stage | Features | Notes |
|---|---|---|
| Raw dataset | 28 | All original columns including IDs |
| After dropping IDs + zero-variance | 21 | loan_id, year, construction_type, secured_by, security_type, loan_purpose removed |
| After excluding leakage + no-signal | 13 raw columns | credit_score, interest_rate, interest_rate_spread, credit_worthiness, credit_bureau, coapplicant_credit_bureau, upfront_charges, open_credit_flag removed |
| After one-hot encoding (baseline) | 34 model features | Categorical expansion for Part 3 |
| After feature engineering | 54 model features | 10 new features + PCA + cluster features |
π Part 2: Exploratory Data Analysis
2.1 Initial Column Audit and Cleanup
Before any analysis, every column was audited for informativeness:
- Renamed all 28 columns to readable
snake_casenames - Dropped 5 zero-variance / identifier columns β these carry zero predictive value:
loan_id(unique ID),year(single value: 2019),construction_type(99.9%sb),secured_by(99.9%home),security_type(99.9%direct) - Dropped
loan_purposeβ codes p1βp4 with no codebook available. Including undocumented codes as features would embed unknown biases into the model. - Relabeled all categorical codes to readable strings (
cfβconforming,prβprimary_residence,preβyes, etc.)
2.2 Missingness Analysis β Co-occurrence Heatmap First
Before imputing a single value, a missingness co-occurrence heatmap was computed
across all columns. This revealed that interest_rate, interest_rate_spread,
upfront_charges, and debt_to_income_ratio are missing on the same rows β
corresponding to applications that did not reach final funding. This is structural
missingness, not random. Understanding this pattern drove the imputation strategy.
| Column | Missing N | % | Strategy | Justification |
|---|---|---|---|---|
term_months |
41 | 0.03% | Drop rows | Random clerical gaps |
negative_amortization |
121 | 0.08% | Drop rows | Independent missingness |
age_group + submission_channel |
200 | 0.13% | Drop rows | Co-occurring on same 200 rows |
approved_in_advance |
908 | 0.61% | Drop rows | Independent missingness |
loan_limit |
3,344 | 2.25% | Mode imputation | 91% conforming β safe to fill |
income |
10,410 | 7.0% | 2D binning: loan decile Γ credit score band | Preserves income-leverage relationship |
property_value + LTV |
15,131 | 10.2% | Back-derivation from median LTV by loan decile | Keeps both columns mechanically consistent |
debt_to_income_ratio |
24,121 | 16.2% | 2D binning: credit band Γ income decile | Uses strongest predictors, no leakage |
interest_rate |
36,439 | 24.5% | 2D binning: credit band Γ LTV band | Structural missingness on non-funded applications |
2.3 Invalid Values and Outlier Treatment
Invalid values converted to NaN (then imputed):
income == 0β mechanically impossible for a funded mortgageloan_to_value_ratio > 150β division artifact, not a real loaninterest_rate == 0β unfunded applications
Rows dropped:
income < $1,000/monthβ 538 rows. Below $1,000 cannot sustain any mortgage payment.LTV > 150β 33 rows after NaN conversion.
Outlier detection β IQR analysis on all numeric columns:
| Column | Skewness (raw) | Treatment | Skewness (after) |
|---|---|---|---|
loan_amount |
1.8 | log1p transform |
0.12 |
property_value |
4.6 | log1p transform |
β0.04 |
income |
18.0 | log1p transform |
0.16 |
upfront_charges |
2.1 | log1p transform |
0.09 |
term_months |
β | Bucketed: 30yr (82%), 15yr (9%), 20yr (4%), 25yr (2%), other (4%) | β |
loan_to_value_ratio |
0.3 | Retained β near-normal | β |
debt_to_income_ratio |
0.8 | Retained β acceptable | β |
Log transforms reduced skewness by 90%+ across all four monetary columns.
2.4 Feature Exclusions β Eight Columns Removed With Evidence
| Feature | Evidence | Reason |
|---|---|---|
credit_score |
Pearson r = 0.003; near-uniform 500β900 distribution | Pre-publication filtering removed the predictive range |
interest_rate |
Set by lender post-approval | Post-origination leakage β encodes the outcome |
interest_rate_spread |
Mechanically derived from interest_rate |
Inherits leakage from parent column |
credit_worthiness |
Lender's internal risk tier (l1/l2) | Set after underwriting β leakage |
credit_bureau |
CramΓ©r's V = 0.5929; EQUI = 100% default across ALL loan types | Q4 proves this is post-default label assignment |
coapplicant_credit_bureau |
Same mechanism as credit_bureau |
Same leakage concern |
upfront_charges |
No-fee segment: exactly 0.0% default (N=20,582) | Structurally impossible β data artifact |
open_credit_flag |
CramΓ©r's V = 0.0096 | Below noise threshold |
2.5 Data Cleaning Summary
| Step | Rows After | Change |
|---|---|---|
| Raw dataset | 148,670 | β |
| Drop zero-variance + loan_purpose | 148,670 | Columns only |
| Drop term_months nulls | 148,629 | β41 |
| Drop negative_amortization nulls | 148,508 | β121 |
| Drop age_group / submission nulls | 148,308 | β200 |
| Drop approved_in_advance nulls | 147,400 | β908 |
| Impute loan_limit (mode) | 147,400 | 0 rows |
| Impute income (2D binning) | 147,400 | 0 rows |
| Drop income < $1,000 | 146,862 | β538 |
| Impute property_value / LTV (back-derive) | 146,862 | 0 rows |
| Drop LTV > 150 | 146,829 | β33 |
| Impute DTI + interest_rate (2D binning) | 146,829 | 0 rows |
| Final clean dataset | 146,829 | β1,841 total (1.2%) |
2.6 Descriptive Statistics and Correlation Analysis
Key findings from the correlation analysis:
loan_amount_logβproperty_value_log: r = 0.85 β strongest multicollinearity pairloan_amount_logβincome_log: r = 0.66 β second strongestloan_to_value_ratioβStatus: r = +0.12 β strongest raw numeric predictorincome_logβStatus: r = β0.18 β strongest protective numeric predictorcredit_scoreβStatus: r = +0.003 β confirms exclusion decision
2.7 Univariate Analysis β Key Findings
Numeric features β default rate by quintile:
| Feature | Bottom Quintile DR | Top Quintile DR | Direction |
|---|---|---|---|
loan_amount |
29.8% | 22.4% | Inverse (larger = safer) |
property_value |
31.5% | 19.1% | Strong inverse |
income |
36.8% | 19.5% | Strongest inverse |
loan_to_value_ratio |
13.6% (<60%) | 22.5% (90%+) | Non-linear peak at 75β90% |
debt_to_income_ratio |
lower | 28β43% band = peak | Non-linear |
Categorical features β CramΓ©r's V ranking (all 17 features tested):
| Feature | CramΓ©r's V | Status |
|---|---|---|
credit_bureau |
0.5929 | Excluded β leakage |
lump_sum_payment_flag |
0.1894 | Retained |
negative_amortization |
0.1523 | Retained |
coapplicant_credit_bureau |
0.1446 | Excluded β leakage |
submission_channel |
0.1198 | Retained |
loan_type |
0.0885 | Retained |
open_credit_flag |
0.0096 | Excluded β no signal |
2.8 Five Bivariate Research Questions
Q1 β Does leverage (LTV) Γ debt burden (DTI) create compound risk?
"Is the combination of high LTV and high DTI more dangerous than either alone?"
Finding: The peak default rate (68.9%) occurs at LTV Q3 (75β90%) Γ DTI Q2βQ4 β not at the maximum values of either variable. LTV Q5 (highest leverage) defaults less than Q3 because very high LTV loans required mortgage insurance and stricter underwriting that pre-screened the worst borrowers. The compound risk interaction is non-linear and invisible to any model that treats LTV and DTI as independent additive predictors.
Modeling implication:
Created is_compound_risk β a binary flag for the 75β90% LTV AND DTI mid-band zone.
This single cell reaches 3Γ the dataset average default rate.
Q2 β Does loan-to-income ratio outperform absolute income or loan amount?
"Is affordability stress β not income or loan size β the real driver?"
Finding: Defaulters earn ~32% less but borrow only ~10% less than repaid borrowers. The risk gradient runs diagonally along the loan-to-income ratio, not along either axis independently. The decile plot confirms: income has the strongest and most consistent monotonic gradient (36.8% β 19.5%), loan amount is weaker and shallower, and LTI is near-flat for deciles 0β6 but spikes to 35.2% at decile 9 β a tail-risk feature.
Modeling implication:
Engineered lti_ratio_log. Created is_extreme_lti binary flag for the top LTI decile only.
Raw LTI as continuous predictor discarded β its signal concentrates entirely in the tail.
Q3 β Do age and geography interact to create localized hotspots?
"Are young or elderly borrowers in specific regions disproportionately risky?"
Finding: North-East region contains two structural extreme cells: under-25 at 50.0% default and over-74 at 44.7% default β both more than double the dataset average. The North region is consistently the safest across all age groups (19.7%β28.1%). The individual CramΓ©r's V for age (0.049) and region (0.048) are modest, but their interaction creates cells with 2Γ the dataset average β signal invisible to any model using only main effects.
Modeling implication:
Created is_northeast_under25 and is_northeast_over74 binary flags.
Used North as the reference (lowest-risk) category in one-hot encoding.
Q4 β Which credit bureau Γ loan type combinations are most dangerous?
"Do specific credit bureau and loan type combinations reveal leakage?"
Finding:
Three bureaus (CIB, CRIF, EXP) show realistic moderate default rates (13%β26%) across all
loan types. The EQUI bureau shows 100.0% default rate across every single loan type without
exception. A perfect 100% default rate uniform across all product types and borrower profiles
cannot be a risk signal β it is forensic evidence of post-default label assignment. This
explains the anomalous CramΓ©r's V of 0.5929 for credit_bureau β it was not signal, it
was leakage.
Modeling implication:
credit_bureau and coapplicant_credit_bureau excluded from all models.
loan_type retained β the 12-point spread (13% vs 25%) is real product-driven variation.
Q5 β Does gender and applicant type affect default risk across the life cycle?
"Do joint applicants systematically outperform individual borrowers at every age?"
Finding: Joint applicants have the lowest default rate at every single age group (17.5%β24.4%). Male applicants show the steepest age-related increase (30.6% at <25 β 34.5% at >74). All four groups follow a U-shaped age pattern with the trough at 35β44 β peak earning years. The joint-male gap widens with age: ~7 points at <25 β ~10 points at >74.
Modeling implication:
Created is_joint_prime_age flag for joint applicants aged 35β54 β the safest identifiable
demographic segment.
π Part 3: Baseline Linear Regression
Goal: Establish a reproducible, leakage-free performance floor before any feature engineering. Every subsequent model must beat this benchmark.
Feature count: After cleaning and one-hot encoding, the 27 remaining raw columns expand to 34 model features (categorical columns encode into multiple binary columns).
Design decisions:
- 34 model features: log-transformed monetary + bounded numeric + one-hot categoricals
- 80/20 stratified split: preserves the 24.34% default rate in both sets
random_state=42: all results are fully reproducibleStandardScalerfit on train only: zero test set leakageLinearRegression()with default parameters: no regularization, no tuning
Results:
| Metric | Train | Test | Gap |
|---|---|---|---|
| MAE | 0.3223 | 0.3227 | 0.0004 |
| MSE | 0.1552 | 0.1555 | 0.0003 |
| RMSE | 0.3939 | 0.3944 | 0.0005 |
| RΒ² | 0.1575 | 0.1555 | 0.0020 |
| ROC-AUC | β | 0.693 | β |
| F1 (Default) | β | 0.244 | β |
| Accuracy | β | 77.8% | β |
| FNR | β | 57.1% | β |
Key observations:
- No overfitting β train/test gap < 0.002 across all metrics
- RΒ² = 0.1555 β explains 15.6% of default variance. Real signal exists, 84% unexplained
- FNR = 57.1% β misses more than half of actual defaults. Not deployable.
- Score distributions overlap heavily β both classes peak at ~0.25
Top coefficient features:
| Feature | Coefficient | Direction |
|---|---|---|
lump_sum_payment_flag_yes |
+0.5251 | Risk-increasing |
negative_amortization_yes |
+0.1836 | Risk-increasing |
term_category_25yr |
+0.1784 | Risk-increasing |
loan_limit_non_conforming |
+0.1027 | Risk-increasing |
occupancy_type_primary_residence |
β0.1125 | Protective |
property_value_log |
β0.08 | Protective |
income_log |
β0.07 | Protective |
Key finding: Loan product type features dominate β not borrower financial metrics. The type of mortgage selected predicts default more strongly than income, LTV, or DTI. This directly shaped Part 4 feature engineering.
βοΈ Part 4: Feature Engineering
Feature engineering was the single most impactful step in the entire pipeline β more impactful than any model choice. Every feature below is directly traceable to a specific EDA finding.
4.1 Ten New Features
| Feature | Type | EDA Source | Default Rate Signal |
|---|---|---|---|
lti_ratio_log |
Continuous | Q2: risk runs along LTI diagonal | Tail spikes to 35.2% at decile 9 |
loan_to_property |
Continuous | Alternative leverage, independent of LTV imputation | Complements LTV |
monthly_debt_est |
Continuous | DTI Γ income / 100 β absolute debt burden | Magnitude, not just ratio |
is_extreme_lti |
Binary flag | Q2: top decile spike | 35.2% vs 23% baseline |
is_compound_risk |
Binary flag | Q1: LTV 75β90% AND DTI mid-band | Up to 68.9% default |
is_25yr_term |
Binary flag | Term analysis: 56.4% default | 2Γ any other term category |
is_northeast_under25 |
Binary flag | Q3: North-East Γ under-25 | 50.0% default |
is_northeast_over74 |
Binary flag | Q3: North-East Γ over-74 | 44.7% default |
is_joint_prime_age |
Binary flag | Q5: joint applicants aged 35β54 | 17.5β19.7% default |
is_exotic_product |
Binary flag | Baseline top-3 coefficients | Consolidates neg_amort + interest_only + lump_sum |
4.2 Scikit-Learn ColumnTransformer Pipeline
All transformations fit on train only, applied to test. Zero leakage.
ColumnTransformer
βββ StandardScaler β 8 numeric features (mean=0, std=1)
βββ OneHotEncoder β 14 categorical features (drop_first=True) β 30 columns
βββ passthrough β 7 binary flags (already 0/1, no scaling needed)
4.3 PCA β Compressing Correlated Numeric Features
loan_amount_log, property_value_log, and income_log correlate at
0.66β0.85 β severe enough to inflate coefficient variance in linear models.
PCA compresses the 9 numeric features into 5 orthogonal components that
carry 98.6% of the original variance while eliminating multicollinearity.
| Component | Variance | Cumulative | What it captures |
|---|---|---|---|
| PC1 | 34.7% | 34.7% | Wealth β loan amount, property value, income move together |
| PC2 | 24.7% | 59.4% | Affordability stress β LTI, loan-to-property, monthly debt |
| PC3 | 19.1% | 78.5% | Leverage β LTV and DTI capture collateral and debt burden |
| PC4 | 10.1% | 88.6% | Residual orthogonal variation |
| PC5 | 9.4% | 98.0% | Residual orthogonal variation |
4.4 K-Means Clustering β Borrower Segmentation
K=4 selected via elbow method β the rate of inertia reduction flattens most noticeably between K=4 and K=5, and four clusters produce four financially interpretable borrower segments.
Clusters were validated with two dimensionality reduction methods: PCA projects the global variance structure and confirms the segments occupy different regions; t-SNE reveals local neighborhood coherence and confirms the clusters are not arbitrary partitions.
| Cluster | N (Train) | Default Rate | Mean Dist | Financial Profile |
|---|---|---|---|---|
| 2 | 19,498 | 13.8% | 2.244 | Low LTV, high income, low LTI β conservative borrowers with strong repayment capacity |
| 3 | 18,959 | 19.5% | 2.390 | Below-average risk achieved through diverse financial profiles β the most internally varied cluster |
| 0 | 44,039 | 25.2% | 1.539 | Typical mortgage borrower β standard product, moderate leverage, closest to the portfolio average |
| 1 | 34,967 | 31.8% | 1.896 | High LTI and LTV combined with exotic product flags β the primary target for risk intervention |
The 18-point spread (13.8% β 31.8%) confirms the segmentation captures real financial structure, not statistical noise. Cluster 1 represents 30% of the training portfolio at 31.8% default.
Cluster features added to the model:
cluster_id(one-hot, 3 columns) β discrete segment membership; which of the four borrower archetypes this loan most closely resemblescluster_distβ Euclidean distance to centroid; a high distance signals an atypical loan within its segment, which carries different risk than a central membercluster_default_rateβ excluded: this encodes the averageStatusvalue of each cluster computed from training labels β indirect target leakage that would inflate all downstream metrics
4.5 Feature Engineering Impact β Isolated Proof
The same Linear Regression model, same hyperparameters, same stratified split β only the feature matrix changed:
| Stage | Features | ROC-AUC | RΒ² | F1 (Default) |
|---|---|---|---|---|
| Raw features (Part 3) | 34 | 0.693 | 0.1555 | 0.244 |
| Engineered features (Part 4) | 54 | 0.809 | 0.2539 | 0.519 |
| Gain | +20 | +0.116 | +0.098 (+63%) | +0.275 |
AUC improved by 16.7%, RΒ² by 63%, and F1 on the default class more than doubled β all with zero model change. This is the strongest possible evidence that feature engineering drove performance, not model selection.
4.6 Final Feature Matrix
| Category | Count | Source |
|---|---|---|
| Numeric (scaled) | 8 | StandardScaler |
| Binary flags | 7 | EDA interaction flags |
| One-hot encoded | 30 | OneHotEncoder (14 categoricals) |
| PCA components | 5 | Numeric compression |
| Cluster features | 4 | K-Means (id Γ 3 + dist) |
| Total | 54 | All fit on train only |
π Part 5: Three Improved Regression Models
All three are genuine regression models outputting continuous default probability scores in [0, 1]. RΒ² is the primary regression metric. Classification metrics (AUC, F1, Accuracy) are derived by applying a 0.5 threshold to the scores. All trained on the same 54-feature matrix, same split, same seed.
Model 1 β Linear Regression (Engineered Features)
Same OLS architecture as Part 3 baseline, retrained on 54-feature matrix. Any improvement over Part 3 isolates the contribution of feature engineering alone.
| Metric | Train | Test |
|---|---|---|
| MAE | 0.2808 | 0.2781 |
| MSE | 0.1385 | 0.1374 |
| RMSE | 0.3721 | 0.3707 |
| RΒ² | 0.2480 | 0.2539 |
| ROC-AUC | β | 0.8094 |
| F1 (Default) | β | 0.5187 |
| Accuracy | β | 80.63% |
| FNR | β | 57.1% |
| FPR | β | 7.2% |
Model 2 β Ridge Regression
L2-regularized linear regression (Ξ±=1.0). Handles multicollinearity in the correlated PCA + ratio feature block by shrinking unstable coefficients toward zero.
| Metric | Train | Test |
|---|---|---|
| MAE | 0.2788 | 0.2782 |
| MSE | 0.1382 | 0.1371 |
| RMSE | 0.3718 | 0.3702 |
| RΒ² | 0.2480 | 0.2537 |
| ROC-AUC | β | 0.8093 |
| F1 (Default) | β | 0.5188 |
| Accuracy | β | 80.62% |
| FNR | β | 57.1% |
| FPR | β | 7.3% |
Model 3 β Gradient Boosting Regressor β WINNER
Sequential tree ensemble minimizing regression loss on the binary 0/1 target. Outputs continuous scores directly. Captures non-linear feature interactions natively β no explicit interaction engineering needed.
| Metric | Train | Test |
|---|---|---|
| MAE | 0.1998 | 0.1991 |
| MSE | 0.0928 | 0.0924 |
| RMSE | 0.3046 | 0.3040 |
| RΒ² | 0.4948 | 0.4967 |
| ROC-AUC | β | 0.8807 |
| F1 (Default) | β | 0.7223 |
| Accuracy | β | 88.63% |
| FNR | β | 39.3% |
| FPR | β | 2.4% |
No overfitting: Train RΒ² = 0.4948, Test RΒ² = 0.4967 β test marginally exceeds train.
Full Comparison Table
| Model | MAE | RMSE | RΒ² | ROC-AUC | F1 (Default) | Accuracy | FNR | FPR |
|---|---|---|---|---|---|---|---|---|
| Baseline LR (Part 3) | 0.3227 | 0.3944 | 0.1094 | 0.693 | 0.244 | 77.8% | 57.1% | 7.2% |
| Linear Reg (Engineered) | 0.2781 | 0.3707 | 0.2539 | 0.809 | 0.519 | 80.6% | 57.1% | 7.2% |
| Ridge Regression | 0.2782 | 0.3702 | 0.2537 | 0.809 | 0.519 | 80.6% | 57.1% | 7.3% |
| Gradient Boosting Regressor | 0.1991 | 0.3040 | 0.4967 | 0.881 | 0.722 | 88.6% | 39.3% | 2.4% |
Confusion matrix highlights:
| Model | False Negatives | False Positives | FNR | FPR |
|---|---|---|---|---|
| Linear Reg (Engineered) | 4,083 | 1,605 | 57.1% | 7.2% |
| Ridge Regression | 4,080 | 1,611 | 57.1% | 7.3% |
| Gradient Boosting Regressor | 2,806 | 533 | 39.3% | 2.4% |
GBR catches 1,277 more actual defaults while simultaneously generating 1,072 fewer false alarms β reducing both error types simultaneously, which only happens with genuinely better discrimination.
What these numbers mean:
Feature engineering alone β same OLS model, same hyperparameters β doubled RΒ² from 0.109 to 0.254 and improved AUC by 16.7%. This is the most important finding in the regression section: the quality of features contributed more than any model change.
Ridge adding near-zero improvement over Linear Regression confirms that PCA upstream had already resolved the multicollinearity concern β regularization was solving a problem that no longer existed.
Gradient Boosting Regressor achieves RΒ²=0.497 because loan default is fundamentally non-linear. The compound risk zone (LTV 75β90% AND DTI mid-band) reaching 68.9% default cannot be expressed as a sum of independent feature contributions β it requires a model that captures multiplicative interactions. GBR discovers these automatically through sequential tree splits. The confusion matrix confirms genuine discrimination improvement: 1,277 fewer missed defaults and 1,072 fewer false alarms simultaneously β reducing both error types at once only happens with better underlying signal, not threshold adjustment.
Feature Importance β Part 5
Key findings:
is_compound_riskis the dominant feature in Ridge Regression (+0.44 coefficient) β consistent with being #1 in GBR feature importances- Linear Regression shows
loan_to_propertywith an inflated coefficient (~1.2Γ10β·) due to scale β this is a visualization artifact from the passthrough path, not a modeling problem. Ridge regularizes this correctly. - GBR top features:
is_compound_risk(0.29) >loan_to_value_ratio(0.12) >loan_to_property(0.11). Bothcluster_distandis_25yr_termappear β confirming clustering and the 25yr term flag added real signal.
Winner Declaration
Winner: Gradient Boosting Regressor
RΒ² nearly doubles from the best linear model (0.254 β 0.497). The model explains
49.7% of default variance β versus 25.4% for Linear and Ridge. Every metric improves
simultaneously. Exported as best_regression_model.pkl.
Why Ridge β Linear Regression: Regularization helps when multicollinearity is severe or features outnumber observations. Neither is critical here β 54 features, 117,463 rows, PCA already compressed the correlated numeric block. Ridge added stability but not predictive improvement.
Why GBR dominates: Sequential boosting concentrates each tree on the hardest residual cases. Non-linear compound risk interactions (LTV Γ DTI) discovered automatically. Trees are scale-invariant β no standardization artifacts.
π Part 6: Winning Regression Model Export
File: best_regression_model.pkl | RΒ²: 0.4967 | AUC: 0.881 | Accuracy: 88.6%
π·οΈ Part 7: Regression β Classification
| Class | Label | Threshold | N (Train) | Train % | True Default Rate |
|---|---|---|---|---|---|
| 0 | Low Risk | score < 0.20 | 65,268 | 55.6% | 9.4% |
| 1 | Medium Risk | 0.20 β€ score < 0.40 | 27,880 | 23.7% | 26.7% |
| 2 | High Risk | score β₯ 0.40 | 24,315 | 20.7% | 61.8% |
52.4pp spread validates threshold quality. Imbalance ratio 2.68:1 β corrected
with class_weight='balanced'.
| Error Type | Consequence | Cost |
|---|---|---|
| False Negative | Approved β defaults β principal loss | 5β10Γ higher |
| False Positive | Rejected β missed revenue | Opportunity cost |
Primary metric: Macro F1 | Secondary: Recall on Class 2 (High Risk)
Why business rule thresholds β not statistical splits:
Median split collapses three operationally distinct tiers into two, losing the ability to differentiate standard review from enhanced scrutiny loans. Quantile binning forces equal class sizes regardless of risk distribution, producing classes with no financial meaning. The 0.20 / 0.40 thresholds were chosen because the resulting true default rates (9.4% / 26.7% / 61.8%) span 52.4 percentage points β validating that the regression scores carry real financial signal. The score distribution confirms this: the 0.40 threshold cleanly separates the long right tail of stressed borrowers from the main distribution, which is why High Risk captures 61.7% true defaults while representing only 20.7% of the portfolio. Each tier maps directly to a lending action β Low Risk to streamlined approval, Medium Risk to standard review, High Risk to enhanced scrutiny or manual underwriting.
π§ Part 8: Classification Models
Three Models
| Model | Architecture | Key Parameters |
|---|---|---|
| Random Forest | 300 independent trees | max_depth=12, balanced |
| XGBoost (tuned) | Sequential boosted trees | RandomizedSearchCV, 20 iter, 3-fold CV |
| K-Nearest Neighbors | Distance-based | K=15, distance weights, Euclidean |
XGBoost Tuning Results
| Model | Macro F1 | Accuracy | ROC-AUC |
|---|---|---|---|
| XGBoost (default) | 0.9463 | 0.9507 | 0.9953 |
| XGBoost (tuned) | 0.9662 | 0.9696 | 0.9982 |
Note on high metrics: Labels derived from regression scores on the same feature matrix β classifiers learn to replicate tier assignments. The operationally meaningful validation is the true default rate within each predicted tier:
| Predicted Class | N Loans | True Default Rate |
|---|---|---|
| Low Risk | 15,788 | 8.7% |
| Medium Risk | 7,361 | 26.4% |
| High Risk | 6,217 | 61.7% |
52.8pp spread β the model is deployable. Loans flagged as High Risk default at 61.7% β 2.5Γ the portfolio average. Low Risk loans at 8.7% β safe for auto-approval.
Why the classification metrics appear near-perfect:
The labels were derived from the regression model's predicted scores on the same feature matrix the classifiers train on β so the classifiers are learning to replicate a deterministic bucketing rule, not predicting raw defaults from scratch. Near-perfect replication of a deterministic threshold is expected and is not overfitting. The true default rate table above is the operationally correct validation: it measures whether the risk assignments align with real financial outcomes. A 52.8pp spread between Low and High Risk tiers, with High Risk defaulting at 61.7% versus a 24.3% portfolio average, confirms the model is deployable.
Why XGBoost beat Random Forest: Sequential error correction focuses each tree on the loans previous trees got wrong β the hard Medium/High boundary cases where the risk is ambiguous. Random Forest averages 300 independent trees and cannot iteratively concentrate on difficult cases. For this specific boundary problem, sequential learning wins.
Why both beat KNN: In 54 dimensions, Euclidean distances between all points converge toward the same value β nearest neighbors become geometrically meaningless. Tree models build explicit split rules using one feature at a time, remaining valid in high-dimensional spaces where KNN memorizes without generalizing.
Evaluation Results
Threshold Analysis
Optimal threshold for Class 2: 0.42 (not default 0.50). At 0.42, F1 is maximized for the High Risk class. A lender should deploy at 0.42 β given false negatives cost 5β10Γ more than false positives, this is the operationally correct operating point.
Feature Importance β Classification Models
is_compound_risk ranks #1 in both Random Forest (0.23) and XGBoost (0.27).
Convergence across two structurally different model families confirms this is real
signal β not a model-specific artifact. cluster_dist appears in both top-20
rankings β K-Means segmentation added genuine atypicality signal.
Winner β XGBoost (Tuned)
Winner: XGBoost β best_model_xgboost.pkl
Wins on every operational metric. Sequential error correction targets the hard boundary cases between Medium and High Risk. KNN degrades in 54 dimensions β curse of dimensionality makes nearest neighbors meaningless.
π Final Evaluation β Key Results
| Milestone | Metric | Value |
|---|---|---|
| Baseline Linear Regression | RΒ² | 0.1555 |
| Baseline Linear Regression | AUC | 0.693 |
| After Feature Engineering (same model) | RΒ² | 0.2539 (+63%) |
| After Feature Engineering (same model) | AUC | 0.809 (+16.7%) |
| Ridge Regression | RΒ² | 0.2537 |
| Gradient Boosting Regressor | RΒ² | 0.4967 |
| Gradient Boosting Regressor | AUC | 0.881 |
| Gradient Boosting Regressor | Accuracy | 88.6% |
| Gradient Boosting Regressor | FPR | 2.4% |
| Regression β Classification spread | Class 0 vs 2 DR | 9.4% vs 61.8% (+52.4pp) |
| K-Means cluster spread | Low vs High DR | 13.8% vs 31.8% (+18pp) |
| XGBoost β Low Risk tier | True DR | 8.7% |
| XGBoost β Medium Risk tier | True DR | 26.4% |
| XGBoost β High Risk tier | True DR | 61.7% |
| XGBoost β Tier spread | 52.8pp |
Bonus Work
- t-SNE alongside PCA for cluster validation
- Business rule thresholding with financial domain justification
- Interactive Plotly visualizations (LTVΓDTI heatmap + cluster profiles)
- RandomizedSearchCV hyperparameter tuning on XGBoost
- Two-panel threshold analysis β optimal deployment point (0.42)
- ColumnTransformer pipeline β production-ready ML engineering
- Comprehensive README with full feature dictionary and embedded visuals
π How to Load and Use the Models
import pickle
import numpy as np
with open("best_model_xgboost.pkl", "rb") as f:
clf_model = pickle.load(f)
with open("best_regression_model.pkl", "rb") as f:
reg_model = pickle.load(f)
# Both expect the 54-feature engineered matrix from Part 4
y_score = np.clip(reg_model.predict(X_new), 0, 1)
y_class = clf_model.predict(X_new)
class_map = {0: "Low Risk", 1: "Medium Risk", 2: "High Risk"}
risk_labels = [class_map[c] for c in y_class]
π¦ Requirements
pandas>=1.3 numpy>=1.21 scikit-learn>=1.0 xgboost>=1.5
matplotlib>=3.4 seaborn>=0.11 plotly>=5.0 scipy>=1.7
π Assignment Structure
| Part | Description | Key Output |
|---|---|---|
| Part 2 | EDA β 11 subsections, 20+ plots | Cleaned 146,829-row dataset |
| Part 3 | Baseline Linear Regression (34 features) | RΒ²=0.1555, AUC=0.693 |
| Part 4 | Feature engineering + PCA + clustering | 54-feature matrix |
| Part 5 | Linear Reg + Ridge + GBR | GBR winner RΒ²=0.497, AUC=0.881 |
| Part 6 | Export regression winner | best_regression_model.pkl |
| Part 7 | Regression β Classification | 3 tiers, 52.4pp spread |
| Part 8 | RF + XGBoost + KNN | XGBoost winner |
π Key Design Decisions
| Decision | Justification |
|---|---|
Exclude upfront_charges |
0% default in no-fee segment β data artifact |
Exclude credit_score |
Pearson r = 0.003 |
Exclude interest_rate |
Post-approval pricing β leakage |
Exclude credit_bureau |
EQUI = 100% default β post-default assignment |
Remove cluster_default_rate |
Target encoding = leakage |
| Use Ridge not Lasso | Feature elimination not desired β all features interpretable |
| Use GBR not GBC | Regression task requires continuous score output with RΒ² |
| Stratified split | Preserves 24.34% class rate |
| Fit transformers on train only | Zero test set information |
| Deploy at threshold 0.42 | Optimal Class 2 F1 β not default 0.50 |
| Recall > Precision | False negatives cost 5β10Γ more |
| Macro F1 as primary metric | Equal penalty for ignoring any risk class |
Assignment #2 β Data Science Program | May 2026
- Downloads last month
- -
















