Instructions to use MichaelYitzchak/Linkedin_Job_Engagement with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use MichaelYitzchak/Linkedin_Job_Engagement with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("MichaelYitzchak/Linkedin_Job_Engagement", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
- π LinkedIn Job Posting Engagement Analysis
- πΉ Presentation Video
- π Interactive Dashboard
- π Dataset at a Glance
- β οΈ Scope & Limitations
- ποΈ Repository Files
- π§Ή Data Cleaning Pipeline
- π EDA β 5 Research Questions
- βοΈ Feature Engineering β 20 Base + 6 Cluster = 30 Total Features
- π΅ Clustering β KMeans k=6
- π Regression β Predicting
log1p(views) - π Classification β High Engagement vs. Normal
- π‘ Business Insights
- π Bonus Work
- π οΈ How to Use the Models
- πΉ Presentation Video
π LinkedIn Job Posting Engagement Analysis
Which LinkedIn job posting characteristics predict candidate engagement (views) β and how well can engagement be predicted or classified using only posting-level features?
Personal motivation: As someone in entrepreneurship, understanding which job posting features attract candidates is directly relevant to future hiring decisions.
πΉ Presentation Video
<video src="https://huggingface.co/datasets/YOUR_USERNAME/YOUR_REPO/resolve/main/presentation.mp4" controls style="max-width:720px;">
π Interactive Dashboard
π Open the LinkedIn Job Engagement Dashboard
| Tab | Description |
|---|---|
| π― Engagement Predictor | Enter posting details β get predicted views + High/Normal classification in real time |
| π EDA Dashboard | All 5 EDA findings as interactive charts |
| βΉοΈ About | Feature groups, model details, limitations |
π Dataset at a Glance
| Property | Value |
|---|---|
| Source | LinkedIn Job Postings β arshkon/linkedin-job-postings (Kaggle) |
| Original size | 123,850 rows Γ 49 columns |
| Working sample | 30,000 rows Β· random_state=42 |
| After join with companies | 30,000 rows Γ 40 columns |
| After cleaning | 29,572 rows Γ 51 columns (in df_model) |
| Train / Test split | 23,657 / 5,915 (80/20, random_state=42) |
| Regression target | log_views = log1p(views) β log-transformed to handle right skew |
| Classification target | high_engagement β top 25% of training views (threshold derived from training set only) |
β οΈ Scope & Limitations
LinkedIn's algorithm, sponsored status, and company follower counts drive the majority of view variance and are unobservable in this dataset. Models use posting-level features only. The practical goal is ranking postings by predicted engagement, not exact point prediction. Results show associations, not causal relationships.
ποΈ Repository Files
| File | Description |
|---|---|
notebook.ipynb |
Full pipeline: Cleaning β EDA β Feature Engineering β Clustering β Regression β Classification β Bonus |
linkedin_regression_model.pkl |
Winning regression model: Random Forest (Tuned via RandomizedSearchCV) |
linkedin_classification_model.pkl |
Winning classification model: Decision Tree (max_depth=8, class_weight="balanced") |
regression_model_results.csv |
Full regression model comparison table |
classification_model_results.csv |
Full classification model comparison table |
π§Ή Data Cleaning Pipeline
7 steps from 123,850 raw rows to a clean, leakage-free modelling matrix:
Step 1 β Reproducible sampling
123,850 rows β sample(n=30,000, random_state=42)
Joined with companies.csv on company_id (left join, rows preserved)
Result: 30,000 rows Γ 40 columns
Step 2 β Duplicate & missing target removal
Removed duplicate rows
Dropped rows where views is NaN or negative
Result: 29,572 usable rows
Step 3 β Date parsing
listed_time, original_listed_time, expiry, closed_time β parsed to datetime
Extracted: posting_year, posting_month, posting_dayofweek, posting_weekend
Step 4 β Missing value analysis & column dropping
Threshold: >70% missing β drop
Dropped: closed_time (99.2%), skills_desc (98.1%), med_salary (95.1%),
remote_allowed (87.9%), applies (81.1%), max_salary/min_salary (76%)
Step 5 β Leakage columns excluded
expiry, applies β removed (post-publication outcomes)
views β kept as target only, never as a feature
Step 6 β Salary imputation strategy
has_salary_info = 1 if salary present, else 0
salary_midpoint computed from min/max salary where available
Missing salary β imputed inside sklearn Pipeline on training data only
Step 7 β Log transformation of target
Raw views: mean=14.9, std=98.8, max=9,949 β heavily right-skewed
log_views = log1p(views) β compresses scale, improves regression fit
Predictions converted back via expm1() for interpretation
Outliers (IQR method): 4,074 (13.8%) β kept, not removed
π EDA β 5 Research Questions
Note on notebook ordering: Q1=Work type, Q2=Salary, Q3=Description, Q4=Day of week, Q5=Seniority. Presented below in order of business impact.
π° Q2 β Salary Transparency vs Views
No salary info βββββββββββββββββββββββββ ~12 avg views (70.1% of postings)
Has salary info βββββββββββββββββββββββββ ~21 avg views (29.9% of postings)
+74.3% lift β
Only 8,562 of 29,572 postings (29.9%) disclose salary. Transparent postings attract 74.3% more views on average. This is the highest-leverage, lowest-cost recruiter action available.
π Q3 β Description Length vs Views
< 100 words ββββββββββββββββββββ ~8 avg views β signals incomplete posting
100β250 words ββββββββββββββββββββ ~13 avg views
250β500 words ββββββββββββββββββββ ~24 avg views PEAK β
β sweet spot
500β750 words ββββββββββββββββββββ ~18 avg views
> 1000 words ββββββββββββββββββββ ~10 avg views β overwhelms candidates
Non-linear relationship confirmed. Sweet spot: 250β500 words. This motivated
description_densityβ the #1 feature in the winning regression model.
π Q4 β Day of Week vs Views
Monday ββββββββββββββββββββ 39 avg views β
best day (n=1,837)
Tuesday ββββββββββββββββββββ 25 avg views
Wednesday ββββββββββββββββββββ 22 avg views
Thursday ββββββββββββββββββββ 18 avg views
Friday ββββββββββββββββββββ 7 avg views β worst day (n=10,076)
Saturday ββββββββββββββββββββ 28 avg views (weekend β n=2,116 total, noisier)
Sunday ββββββββββββββββββββ 28 avg views (weekend β noisier)
Counterintuitive finding: Weekend postings show higher averages (~28), but the weekend sample is tiny (2,116 obs total) making these estimates unreliable. Monday is the clear best weekday at 39 avg views. The day-of-week signal is modest and should not override content features.
πΌ Q1 β Work Type vs Views
Contract ββββββββββββββββββββ 29.97 avg views median: 7.0
Internship ββββββββββββββββββββ 25.71 avg views median: 5.0
Full-time ββββββββββββββββββββ 13.70 avg views median: 4.0 β 80% of volume
Other ββββββββββββββββββββ 11.27 avg views median: 4.0
Part-time ββββββββββββββββββββ 9.59 avg views median: 4.0
Contract and Internship roles show the highest engagement. However, Full-time dominates volume (23,674 of 29,572 postings = 80%). Work type is a useful predictive feature but should not be interpreted as causal.
π Q5 β Seniority Level vs Views
Entry-level ββββββββββββββββββββ 18 avg views n=792
Senior-level ββββββββββββββββββββ 16 avg views n=3,577
Other/Mid ββββββββββββββββββββ 15 avg views n=25,203
Entry vs Senior: +12.4% more views
Entry vs Other: +18.9% more views
Supply-side effect β more candidates qualify for junior roles, so the pool is larger.
is_entry_rolecarries predictive signal because it proxies for candidate pool size, not intrinsic desirability.
π₯ Feature Correlation with log(views+1)
Feature Corr Direction Note
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
desc_salary_interaction +0.18 β views strongest single predictor
has_salary_info +0.14 β views salary transparency
salary_log +0.12 β views salary level
description_density +0.10 β views content quality
description_word_count +0.08 β views description length
is_software_role +0.08 β views tech role demand
is_data_role +0.07 β views data role demand
is_entry_role +0.06 β views larger candidate pool
posting_weekend -0.04 β views small negative signal
is_senior_role -0.03 β views smaller candidate pool
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Internal correlations (structural β not data leakage):
salary_log β salary_midpoint +0.96 log transform of same variable
desc_wc β desc_density +0.55 density uses length in formula
is_software β is_data +0.35 often co-occur in job titles
is_senior β is_entry -0.28 mutually exclusive by construction
Most features show weak linear correlation β no single feature dominates. This motivated tree-based models (Random Forest, Gradient Boosting) which capture non-linear interactions and feature combinations.
π‘οΈ Correlation Heatmap (feature-to-feature + target)
log desc has sal desc is_ is_ is_ post is_
views dens sal log wc soft data entr wknd snr
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
log_views β 1.00 0.10 0.14 0.12 0.08 0.08 0.07 0.06 -0.04 -0.03
description_density β 0.10 1.00 0.02 0.04 0.55 0.01 0.01 -0.01 0.00 0.00
has_salary_info β 0.14 0.02 1.00 0.72 0.03 0.06 0.07 -0.03 -0.01 -0.02
salary_log β 0.12 0.04 0.72 1.00 0.04 0.05 0.06 -0.02 -0.01 -0.01
description_word_count β 0.08 0.55 0.03 0.04 1.00 0.01 0.01 -0.01 0.00 0.00
is_software_role β 0.08 0.01 0.06 0.05 0.01 1.00 0.35 -0.08 0.00 -0.05
is_data_role β 0.07 0.01 0.07 0.06 0.01 0.35 1.00 -0.06 0.00 -0.04
is_entry_role β 0.06 -0.01 -0.03 -0.02 -0.01 -0.08 -0.06 1.00 0.01 -0.28
posting_weekend β -0.04 0.00 -0.01 -0.01 0.00 0.00 0.00 0.01 1.00 0.00
is_senior_role β -0.03 0.00 -0.02 -0.01 0.00 -0.05 -0.04 -0.28 0.00 1.00
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key structural correlations:
salary_log β has_salary_info +0.72 same underlying signal, different form
desc_wc β desc_density +0.55 density formula uses word count
is_software β is_data +0.35 frequently co-occur in job titles
is_entry β is_senior -0.28 mutually exclusive flags
The heatmap confirms no multicollinearity crisis β the highest inter-feature correlation (salary_log β has_salary_info at 0.72) is a structural relationship between two forms of the same signal, not a data problem. All correlations with log_views are weak, validating the move to non-linear tree-based models.
βοΈ Feature Engineering β 20 Base + 6 Cluster = 30 Total Features
| Group | Features |
|---|---|
| Text length | title_length, title_word_count, description_length, description_word_count |
| Text structure | description_density β
, title_desc_ratio |
| Salary | salary_midpoint, salary_range, has_salary_info, salary_log |
| Role keywords | is_senior_role, is_entry_role, is_software_role, is_data_role, is_manager_role, is_sales_role, is_marketing_role, is_remote_text |
| Interactions | desc_salary_interaction β
, senior_salary, weekend_remote, title_desc_word_interaction, salary_density_interaction, salary_description_interaction, title_density_interaction |
| Clustering | cluster_0, cluster_1, cluster_2, cluster_3, cluster_4, cluster_5 |
Missing value strategy:
- Columns with >70% missing β dropped
- Salary β
has_salary_infoflag +salary_midpointwhere available; remaining NaN imputed inside sklearn Pipeline on training data only - Remaining numeric β
SimpleImputer(strategy="median")inside Pipeline
π΅ Clustering β KMeans k=6
Features used for clustering (12 total, leakage-checked):
title_word_count, description_word_count, salary_log, description_density, has_salary_info, is_senior_role, is_entry_role, is_software_role, is_data_role, is_manager_role, is_sales_role, is_marketing_role
Methods used to select k:
- Elbow method β inconclusive, no sharp elbow
- KMeans silhouette scores on full training matrix
- Cluster-size stability table
- Interactive K-Means widget (visualization aid β uses sample)
- Hierarchical clustering dendrogram (Ward linkage, 300 obs)
- Agglomerative clustering comparison (k=2β10)
Silhouette scores by k (full training matrix):
k=2 ββββββββββββββββββββ 0.198 smallest cluster: 6,830 (28.9%)
k=3 ββββββββββββββββββββ 0.221 smallest cluster: 2,100 (8.9%)
k=4 ββββββββββββββββββββ 0.312 β strong BUT largest=72% of data
k=5 ββββββββββββββββββββ 0.250 smallest: 526 (unstable)
k=6 ββββββββββββββββββββ 0.290 β SELECTED β
smallest: 583 (2.5%)
k=7 ββββββββββββββββββββ 0.286 singleton cluster appeared
k=8+ singleton clusters appeared
Why NOT k=10 (highest score): singleton cluster (1 observation)
Why NOT k=4 (strong score): largest cluster = 72% β not meaningful separation
Why k=6: no singletons, stable sizes, interpretable profiles, silhouette 0.290
Cluster profiles at k=6 (training set n=23,657):
| Cluster | Label | Size | Share | Key Signal |
|---|---|---|---|---|
| 0 | Manager-focused | 4,571 | 19% | is_manager_role=1.00 |
| 1 | General / Mixed | 13,055 | 55% | No dominant role signal |
| 2 | Salary-transparent | 1,940 | 8% | has_salary_info=1.00 |
| 3 | Data roles | 1,451 | 6% | is_data_role=1.00 |
| 4 | Software roles | 2,057 | 9% | is_software_role=1.00 |
| 5 | Entry / low salary | 583 | 2% | Smallest cluster |
Official final silhouette score: 0.290 (full training matrix)
Cluster labels one-hot encoded as 6 dummy features. Including clusters improved both regression RMSE and classification F1 over models without them.
π Regression β Predicting log1p(views)
Baseline
Mean Baseline (predict training mean for all observations):
RMSE_log = 0.8708 RΒ² = -0.0002 β floor every model must beat
MAE_views β 10.64
Baseline Linear Regression (20 features, no clustering):
RMSE_log = 0.8425 RΒ² = 0.0639
Full Model Comparison (after feature engineering + clustering)
| Model | RMSE_log β | RΒ² β | Notes |
|---|---|---|---|
| Random Forest (Tuned) β | 0.8347 | 0.0811 | RandomizedSearchCV winner |
| Random Forest (Controlled) | 0.8349 | 0.0807 | Manual constraints |
| Gradient Boosting | 0.8370 | 0.0770 | β |
| Linear Regression + Features | 0.8420 | 0.0640 | β |
| RidgeCV | 0.8420 | 0.0640 | β |
| Lasso Regression | 0.8430 | 0.0640 | β |
| PCA + Linear Regression | 0.8440 | 0.0600 | 15 components, 96.3% variance |
| Mean Baseline | 0.8708 | -0.0002 | Floor |
Key lessons:
- Unrestricted RF β train RΒ²=0.854, test RΒ²=0.003 (massive overfit). Fixed by
max_depth,min_samples_split,min_samples_leaf,max_featuresconstraints. - 3-fold CV mean RMSE_log: 0.8747 (Β±0.0125) β stable across folds
- Outlier robustness test: capping views at 99th pct β RMSE_log 0.8147, RΒ²=0.0812
Top Feature Importances (RF Tuned)
description_density ββββββββββββ #1 β content quality proxy
description_length ββββββββββββ #2 β raw description size
description_word_count ββββββββββββ #3 β word count
title-description interactionββββββββββββ #4 β combined text signal
is_software_role ββββββββββββ #5 β tech role demand
is_data_role ββββββββββββ #6 β data role demand
salary_log / has_salary_info ββββββββββββ #7+ β salary signals
desc_salary_interactionranks #2 in SHAP analysis but further down in Gini importance β both agree on description quality and salary as top drivers.
Why RΒ² = 0.081 Is Acceptable
RΒ² = 0.081 β model explains ~8% of variance in log(views+1)
β Beats mean baseline (RΒ²β0) β real posting-level signal captured
β Social engagement inherently noisy β platform factors dominate
β 92% of variance from unobservable sources (algorithm, followers, ads)
β Practical use = ranking postings, not forecasting exact counts
π Classification β High Engagement vs. Normal
Target: high_engagement = 1 if views β₯ 75th percentile of TRAINING views
Class balance: ~75% Normal (Class 0) / ~25% High Engagement (Class 1)
Feature matrix: X_clf uses 24 features (see notebook cell 207)
Metric: F1-score for Class 1 (accuracy misleading with 75/25 imbalance)
Model Comparison
| Model | F1 (Class 1) | Recall (Class 1) | Notes |
|---|---|---|---|
| Decision Tree β | HIGHEST | HIGHEST | max_depth=8, class_weight="balanced" |
| Logistic Regression | near-best | high | Close to DT in F1 |
| Random Forest | moderate | lower | Lowest FP count |
| Dummy Baseline | 0.00 | 0.00 | Always predicts Class 0 |
5-fold CV F1: 0.4424 Β± 0.0152 β stable, no lucky split
Error Cost Analysis
FN (missed high-engagement) = most costly error
β Company fails to prioritize, promote, or learn from a strong posting
FP (false alarm) = also costly
β Recruiter wastes time and budget on a posting that won't perform
Decision Tree minimises FN (catches most high-engagement postings) but produces more FP. Random Forest minimises FP (fewest false alarms) but misses more high-engagement postings.
π‘ Business Insights
- Salary transparency is the single highest-leverage action β 74.3% more views for free. Fewer than 30% of postings disclose salary today.
- Description structure matters β
description_densitywas the #1 feature in both models. Sweet spot: 250β500 words. - Tech roles attract disproportionate engagement β
is_software_roleandis_data_rolecarry real signal beyond salary. - Work type is associated with engagement β contract roles lead, but full-time dominates volume (80%).
- Platform factors dominate β RΒ²β0.08 is expected and acceptable. Model value is in ranking postings, not exact prediction.
π Bonus Work
π§ SHAP Explainability
SHAP mean |value| β RF Tuned regression (test observations)
description_density ββββββββββββ strongest positive impact β
desc_salary_interaction ββββββββββββ salary Γ description synergy β
salary_log ββββββββββββ salary level β
has_salary_info ββββββββββββ disclosed β more views β
posting_weekend ββββββββββββ weekend β fewer views β
desc_salary_interaction ranks #2 in SHAP but lower in Gini β confirms it captures genuine non-linear interaction that neither feature achieves alone.
π Feature Importance: Regression vs Classification
Regression RF Classification DT
description_density #1 #2
desc_salary_interaction #2 (SHAP) varies
salary_log #7+ varies
is_entry_role lower rises in classification
is_data_role #6 varies
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Agreement: description quality + salary dominate both models
Divergence: seniority/role flags matter more for threshold-crossing
(classification) than for predicting exact counts (regression)
π¬ Additional Extras
- Interactive K-Means Widget β explore different k values visually (notebook cell 4.11)
- Hierarchical Clustering Dendrogram β Ward linkage, 300 obs sample (cell 4.12)
- Agglomerative Clustering Diagnostic β k=2β10 comparison (cell 4.13)
- Outlier Robustness Test β views capped at 99th percentile: RMSE_log 0.8147 vs 0.8347 uncapped
- 3-fold CV for regression β mean RMSE_log 0.8747 Β± 0.0125
π οΈ How to Use the Models
import pickle, numpy as np
with open("linkedin_regression_model.pkl", "rb") as f:
reg_model = pickle.load(f)
with open("linkedin_classification_model.pkl", "rb") as f:
clf_model = pickle.load(f)
# Regression β predict log(views+1), convert back to raw view estimate
log_views_pred = reg_model.predict(X_test_fe)
views_pred = np.expm1(log_views_pred)
# Classification β predict high-engagement label (0 = Normal, 1 = High)
label = clf_model.predict(X_clf)
Regression model expects 30-column
X_test_fe(including cluster dummies). Classification model expects 24-columnX_clf(see notebook cell 207). Run the full pipeline in the notebook to produce compatible feature matrices.
Assignment 2 β Classification, Regression, Clustering, Evaluation | LinkedIn Job Postings Β· arshkon/linkedin-job-postings (Kaggle)
- Downloads last month
- -