πŸ“Š LinkedIn Job Posting Engagement Analysis

Which LinkedIn job posting characteristics predict candidate engagement (views) β€” and how well can engagement be predicted or classified using only posting-level features?

Personal motivation: As someone in entrepreneurship, understanding which job posting features attract candidates is directly relevant to future hiring decisions.


πŸ“Ή Presentation Video

<video src="https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4" controls style="max-width:720px;">


πŸš€ Interactive Dashboard

πŸ‘‰ Open the LinkedIn Job Engagement Dashboard

Tab Description
🎯 Engagement Predictor Enter posting details β†’ get predicted views + High/Normal classification in real time
πŸ“Š EDA Dashboard All 5 EDA findings as interactive charts
ℹ️ About Feature groups, model details, limitations

πŸ“‹ Dataset at a Glance

Property Value
Source LinkedIn Job Postings β€” arshkon/linkedin-job-postings (Kaggle)
Original size 123,850 rows Γ— 49 columns
Working sample 30,000 rows Β· random_state=42
After join with companies 30,000 rows Γ— 40 columns
After cleaning 29,572 rows Γ— 51 columns (in df_model)
Train / Test split 23,657 / 5,915 (80/20, random_state=42)
Regression target log_views = log1p(views) β€” log-transformed to handle right skew
Classification target high_engagement β€” top 25% of training views (threshold derived from training set only)

⚠️ Scope & Limitations

LinkedIn's algorithm, sponsored status, and company follower counts drive the majority of view variance and are unobservable in this dataset. Models use posting-level features only. The practical goal is ranking postings by predicted engagement, not exact point prediction. Results show associations, not causal relationships.


πŸ—‚οΈ Repository Files

File Description
notebook.ipynb Full pipeline: Cleaning β†’ EDA β†’ Feature Engineering β†’ Clustering β†’ Regression β†’ Classification β†’ Bonus
linkedin_regression_model.pkl Winning regression model: Random Forest (Tuned via RandomizedSearchCV)
linkedin_classification_model.pkl Winning classification model: Decision Tree (max_depth=8, class_weight="balanced")
regression_model_results.csv Full regression model comparison table
classification_model_results.csv Full classification model comparison table

🧹 Data Cleaning Pipeline

7 steps from 123,850 raw rows to a clean, leakage-free modelling matrix:

Step 1 β€” Reproducible sampling
        123,850 rows β†’ sample(n=30,000, random_state=42)
        Joined with companies.csv on company_id (left join, rows preserved)
        Result: 30,000 rows Γ— 40 columns

Step 2 β€” Duplicate & missing target removal
        Removed duplicate rows
        Dropped rows where views is NaN or negative
        Result: 29,572 usable rows

Step 3 β€” Date parsing
        listed_time, original_listed_time, expiry, closed_time β†’ parsed to datetime
        Extracted: posting_year, posting_month, posting_dayofweek, posting_weekend

Step 4 β€” Missing value analysis & column dropping
        Threshold: >70% missing β†’ drop
        Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%),
                 remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%)

Step 5 β€” Leakage columns excluded
        expiry, applies β†’ removed (post-publication outcomes)
        views β†’ kept as target only, never as a feature

Step 6 β€” Salary imputation strategy
        has_salary_info = 1 if salary present, else 0
        salary_midpoint computed from min/max salary where available
        Missing salary β†’ imputed inside sklearn Pipeline on training data only

Step 7 β€” Log transformation of target
        Raw views: mean=14.9, std=98.8, max=9,949 β€” heavily right-skewed
        log_views = log1p(views) β€” compresses scale, improves regression fit
        Predictions converted back via expm1() for interpretation
        Outliers (IQR method): 4,074 (13.8%) β€” kept, not removed

πŸ” EDA β€” 5 Research Questions

Note on notebook ordering: Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented below in order of business impact.


πŸ’° Q2 β€” Salary Transparency vs Views

No salary info   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~12 avg views   (70.1% of postings)
Has salary info  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘  ~21 avg views   (29.9% of postings)
                                             +74.3% lift βœ“

Only 8,562 of 29,572 postings (29.9%) disclose salary. Transparent postings attract 74.3% more views on average. This is the highest-leverage, lowest-cost recruiter action available.


πŸ“ Q3 β€” Description Length vs Views

< 100 words    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~8 avg views   β€” signals incomplete posting
100–250 words  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~13 avg views
250–500 words  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  ~24 avg views  PEAK β˜… β€” sweet spot
500–750 words  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  ~18 avg views
> 1000 words   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~10 avg views  β€” overwhelms candidates

Non-linear relationship confirmed. Sweet spot: 250–500 words. This motivated description_density β€” the #1 feature in the winning regression model.


πŸ“… Q4 β€” Day of Week vs Views

Monday    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  39 avg views  β˜… best day  (n=1,837)
Tuesday   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘  25 avg views
Wednesday β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  22 avg views
Thursday  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘  18 avg views
Friday    β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘   7 avg views  βœ— worst day (n=10,076)
Saturday  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  28 avg views  (weekend β€” n=2,116 total, noisier)
Sunday    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  28 avg views  (weekend β€” noisier)

Counterintuitive finding: Weekend postings show higher averages (~28), but the weekend sample is tiny (2,116 obs total) making these estimates unreliable. Monday is the clear best weekday at 39 avg views. The day-of-week signal is modest and should not override content features.


πŸ’Ό Q1 β€” Work Type vs Views

Contract    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  29.97 avg views  median: 7.0
Internship  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘  25.71 avg views  median: 5.0
Full-time   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  13.70 avg views  median: 4.0  ← 80% of volume
Other       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  11.27 avg views  median: 4.0
Part-time   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘   9.59 avg views  median: 4.0

Contract and Internship roles show the highest engagement. However, Full-time dominates volume (23,674 of 29,572 postings = 80%). Work type is a useful predictive feature but should not be interpreted as causal.


πŸŽ“ Q5 β€” Seniority Level vs Views

Entry-level  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  18 avg views  n=792
Senior-level β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  16 avg views  n=3,577
Other/Mid    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  15 avg views  n=25,203

Entry vs Senior: +12.4% more views
Entry vs Other:  +18.9% more views

Supply-side effect β€” more candidates qualify for junior roles, so the pool is larger. is_entry_role carries predictive signal because it proxies for candidate pool size, not intrinsic desirability.


πŸ”₯ Feature Correlation with log(views+1)

Feature                      Corr    Direction   Note
─────────────────────────────────────────────────────────────────────
desc_salary_interaction      +0.18   ↑ views     strongest single predictor
has_salary_info              +0.14   ↑ views     salary transparency
salary_log                   +0.12   ↑ views     salary level
description_density          +0.10   ↑ views     content quality
description_word_count       +0.08   ↑ views     description length
is_software_role             +0.08   ↑ views     tech role demand
is_data_role                 +0.07   ↑ views     data role demand
is_entry_role                +0.06   ↑ views     larger candidate pool
posting_weekend              -0.04   ↓ views     small negative signal
is_senior_role               -0.03   ↓ views     smaller candidate pool
─────────────────────────────────────────────────────────────────────
Internal correlations (structural β€” not data leakage):
salary_log ↔ salary_midpoint  +0.96  log transform of same variable
desc_wc ↔ desc_density        +0.55  density uses length in formula
is_software ↔ is_data         +0.35  often co-occur in job titles
is_senior ↔ is_entry          -0.28  mutually exclusive by construction

Most features show weak linear correlation β€” no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations.

🌑️ Correlation Heatmap (feature-to-feature + target)

                          log   desc  has   sal   desc  is_   is_   is_   post  is_
                          views dens  sal   log   wc    soft  data  entr  wknd  snr
──────────────────────────────────────────────────────────────────────────────────────
log_views              β”‚  1.00  0.10  0.14  0.12  0.08  0.08  0.07  0.06 -0.04 -0.03
description_density    β”‚  0.10  1.00  0.02  0.04  0.55  0.01  0.01 -0.01  0.00  0.00
has_salary_info        β”‚  0.14  0.02  1.00  0.72  0.03  0.06  0.07 -0.03 -0.01 -0.02
salary_log             β”‚  0.12  0.04  0.72  1.00  0.04  0.05  0.06 -0.02 -0.01 -0.01
description_word_count β”‚  0.08  0.55  0.03  0.04  1.00  0.01  0.01 -0.01  0.00  0.00
is_software_role       β”‚  0.08  0.01  0.06  0.05  0.01  1.00  0.35 -0.08  0.00 -0.05
is_data_role           β”‚  0.07  0.01  0.07  0.06  0.01  0.35  1.00 -0.06  0.00 -0.04
is_entry_role          β”‚  0.06 -0.01 -0.03 -0.02 -0.01 -0.08 -0.06  1.00  0.01 -0.28
posting_weekend        β”‚ -0.04  0.00 -0.01 -0.01  0.00  0.00  0.00  0.01  1.00  0.00
is_senior_role         β”‚ -0.03  0.00 -0.02 -0.01  0.00 -0.05 -0.04 -0.28  0.00  1.00
──────────────────────────────────────────────────────────────────────────────────────
Key structural correlations:
  salary_log ↔ has_salary_info  +0.72  same underlying signal, different form
  desc_wc    ↔ desc_density     +0.55  density formula uses word count
  is_software ↔ is_data         +0.35  frequently co-occur in job titles
  is_entry   ↔ is_senior        -0.28  mutually exclusive flags

The heatmap confirms no multicollinearity crisis β€” the highest inter-feature correlation (salary_log ↔ has_salary_info at 0.72) is a structural relationship between two forms of the same signal, not a data problem. All correlations with log_views are weak, validating the move to non-linear tree-based models.


βš™οΈ Feature Engineering β€” 20 Base + 6 Cluster = 30 Total Features

Group Features
Text length title_length, title_word_count, description_length, description_word_count
Text structure description_density β˜…, title_desc_ratio
Salary salary_midpoint, salary_range, has_salary_info, salary_log
Role keywords is_senior_role, is_entry_role, is_software_role, is_data_role, is_manager_role, is_sales_role, is_marketing_role, is_remote_text
Interactions desc_salary_interaction β˜…, senior_salary, weekend_remote, title_desc_word_interaction, salary_density_interaction, salary_description_interaction, title_density_interaction
Clustering cluster_0, cluster_1, cluster_2, cluster_3, cluster_4, cluster_5

Missing value strategy:

  • Columns with >70% missing β†’ dropped
  • Salary β†’ has_salary_info flag + salary_midpoint where available; remaining NaN imputed inside sklearn Pipeline on training data only
  • Remaining numeric β†’ SimpleImputer(strategy="median") inside Pipeline

πŸ”΅ Clustering β€” KMeans k=6

Features used for clustering (12 total, leakage-checked): title_word_count, description_word_count, salary_log, description_density, has_salary_info, is_senior_role, is_entry_role, is_software_role, is_data_role, is_manager_role, is_sales_role, is_marketing_role

Methods used to select k:

  1. Elbow method β€” inconclusive, no sharp elbow
  2. KMeans silhouette scores on full training matrix
  3. Cluster-size stability table
  4. Interactive K-Means widget (visualization aid β€” uses sample)
  5. Hierarchical clustering dendrogram (Ward linkage, 300 obs)
  6. Agglomerative clustering comparison (k=2–10)
Silhouette scores by k (full training matrix):

  k=2   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.198  smallest cluster: 6,830 (28.9%)
  k=3   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.221  smallest cluster: 2,100 (8.9%)
  k=4   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  0.312  ← strong BUT largest=72% of data
  k=5   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.250  smallest: 526 (unstable)
  k=6   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.290  ← SELECTED β˜…  smallest: 583 (2.5%)
  k=7   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.286  singleton cluster appeared
  k=8+                               singleton clusters appeared

Why NOT k=10 (highest score): singleton cluster (1 observation)
Why NOT k=4 (strong score):   largest cluster = 72% β€” not meaningful separation
Why k=6: no singletons, stable sizes, interpretable profiles, silhouette 0.290

Cluster profiles at k=6 (training set n=23,657):

Cluster Label Size Share Key Signal
0 Manager-focused 4,571 19% is_manager_role=1.00
1 General / Mixed 13,055 55% No dominant role signal
2 Salary-transparent 1,940 8% has_salary_info=1.00
3 Data roles 1,451 6% is_data_role=1.00
4 Software roles 2,057 9% is_software_role=1.00
5 Entry / low salary 583 2% Smallest cluster

Official final silhouette score: 0.290 (full training matrix)

Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them.


πŸ“ˆ Regression β€” Predicting log1p(views)

Baseline

Mean Baseline (predict training mean for all observations):
  RMSE_log = 0.8708   RΒ² = -0.0002   ← floor every model must beat
  MAE_views β‰ˆ 10.64

Baseline Linear Regression (20 features, no clustering):
  RMSE_log = 0.8425   RΒ² = 0.0639

Full Model Comparison (after feature engineering + clustering)

Model RMSE_log ↓ RΒ² ↑ Notes
Random Forest (Tuned) β˜… 0.8347 0.0811 RandomizedSearchCV winner
Random Forest (Controlled) 0.8349 0.0807 Manual constraints
Gradient Boosting 0.8370 0.0770 β€”
Linear Regression + Features 0.8420 0.0640 β€”
RidgeCV 0.8420 0.0640 β€”
Lasso Regression 0.8430 0.0640 β€”
PCA + Linear Regression 0.8440 0.0600 15 components, 96.3% variance
Mean Baseline 0.8708 -0.0002 Floor

Key lessons:

  • Unrestricted RF β†’ train RΒ²=0.854, test RΒ²=0.003 (massive overfit). Fixed by max_depth, min_samples_split, min_samples_leaf, max_features constraints.
  • 3-fold CV mean RMSE_log: 0.8747 (Β±0.0125) β€” stable across folds
  • Outlier robustness test: capping views at 99th pct β†’ RMSE_log 0.8147, RΒ²=0.0812

Top Feature Importances (RF Tuned)

description_density          β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  #1 β€” content quality proxy
description_length           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘  #2 β€” raw description size
description_word_count       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  #3 β€” word count
title-description interactionβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  #4 β€” combined text signal
is_software_role             β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘  #5 β€” tech role demand
is_data_role                 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘  #6 β€” data role demand
salary_log / has_salary_info β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  #7+ β€” salary signals

desc_salary_interaction ranks #2 in SHAP analysis but further down in Gini importance β€” both agree on description quality and salary as top drivers.

Why RΒ² = 0.081 Is Acceptable

RΒ² = 0.081 β†’ model explains ~8% of variance in log(views+1)

βœ“ Beats mean baseline (RΒ²β‰ˆ0) β€” real posting-level signal captured
βœ“ Social engagement inherently noisy β€” platform factors dominate
βœ“ 92% of variance from unobservable sources (algorithm, followers, ads)
βœ“ Practical use = ranking postings, not forecasting exact counts

🟠 Classification β€” High Engagement vs. Normal

Target: high_engagement = 1 if views β‰₯ 75th percentile of TRAINING views
Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1)
Feature matrix: X_clf uses 24 features (see notebook cell 207)
Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance)

Model Comparison

Model F1 (Class 1) Recall (Class 1) Notes
Decision Tree β˜… HIGHEST HIGHEST max_depth=8, class_weight="balanced"
Logistic Regression near-best high Close to DT in F1
Random Forest moderate lower Lowest FP count
Dummy Baseline 0.00 0.00 Always predicts Class 0

5-fold CV F1: 0.4424 Β± 0.0152 β€” stable, no lucky split

Error Cost Analysis

FN (missed high-engagement) = most costly error
  β†’ Company fails to prioritize, promote, or learn from a strong posting

FP (false alarm) = also costly
  β†’ Recruiter wastes time and budget on a posting that won't perform

Decision Tree minimises FN (catches most high-engagement postings) but produces more FP. Random Forest minimises FP (fewest false alarms) but misses more high-engagement postings.


πŸ’‘ Business Insights

  1. Salary transparency is the single highest-leverage action β€” 74.3% more views for free. Fewer than 30% of postings disclose salary today.
  2. Description structure matters β€” description_density was the #1 feature in both models. Sweet spot: 250–500 words.
  3. Tech roles attract disproportionate engagement β€” is_software_role and is_data_role carry real signal beyond salary.
  4. Work type is associated with engagement β€” contract roles lead, but full-time dominates volume (80%).
  5. Platform factors dominate β€” RΒ²β‰ˆ0.08 is expected and acceptable. Model value is in ranking postings, not exact prediction.

🎁 Bonus Work

🧠 SHAP Explainability

SHAP mean |value| β€” RF Tuned regression (test observations)

description_density      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  strongest positive impact ↑
desc_salary_interaction  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘  salary Γ— description synergy ↑
salary_log               β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  salary level ↑
has_salary_info          β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘  disclosed β†’ more views ↑
posting_weekend          β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  weekend β†’ fewer views ↓

desc_salary_interaction ranks #2 in SHAP but lower in Gini β€” confirms it captures genuine non-linear interaction that neither feature achieves alone.

πŸ“Š Feature Importance: Regression vs Classification

                        Regression RF    Classification DT
description_density     #1               #2
desc_salary_interaction #2 (SHAP)        varies
salary_log              #7+              varies
is_entry_role           lower            rises in classification
is_data_role            #6               varies
──────────────────────────────────────────────────────────
Agreement:  description quality + salary dominate both models
Divergence: seniority/role flags matter more for threshold-crossing
            (classification) than for predicting exact counts (regression)

πŸ”¬ Additional Extras

  • Interactive K-Means Widget β€” explore different k values visually (notebook cell 4.11)
  • Hierarchical Clustering Dendrogram β€” Ward linkage, 300 obs sample (cell 4.12)
  • Agglomerative Clustering Diagnostic β€” k=2–10 comparison (cell 4.13)
  • Outlier Robustness Test β€” views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped
  • 3-fold CV for regression β€” mean RMSE_log 0.8747 Β± 0.0125

πŸ› οΈ How to Use the Models

import pickle, numpy as np

with open("linkedin_regression_model.pkl", "rb") as f:
    reg_model = pickle.load(f)
with open("linkedin_classification_model.pkl", "rb") as f:
    clf_model = pickle.load(f)

# Regression β€” predict log(views+1), convert back to raw view estimate
log_views_pred = reg_model.predict(X_test_fe)
views_pred = np.expm1(log_views_pred)

# Classification β€” predict high-engagement label (0 = Normal, 1 = High)
label = clf_model.predict(X_clf)

Regression model expects 30-column X_test_fe (including cluster dummies). Classification model expects 24-column X_clf (see notebook cell 207). Run the full pipeline in the notebook to produce compatible feature matrices.


Assignment 2 β€” Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings Β· arshkon/linkedin-job-postings (Kaggle)

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support