ML for Targeted Mortgage Marketing

Cornell M.Eng. consulting-style machine learning project completed with Deluxe, focused on predicting the likelihood that a household would take out a home equity loan in order to improve targeted marketing and customer acquisition strategy.

Project Snapshot

Partner: Deluxe
Context: Financial services / targeted marketing analytics
Dataset: 103,835 records and 898 input features
Problem Type: Binary classification
Models: Logistic Regression, XGBoost, LightGBM
Tools: Python, scikit-learn, XGBoost, LightGBM

Overview

This project was designed to simulate a real-world data science consulting engagement. Working with a team of Cornell graduate students, I helped build an end-to-end modeling pipeline to predict whether a customer would respond to a mortgage-related offer. The broader business goal was to help improve marketing precision by identifying higher-probability households more effectively.

The dataset was high dimensional and messy, with extensive missing values, transformed financial variables, class imbalance, and outliers. Because of that, the project was not just about training a model — it was about building a thoughtful preprocessing and feature engineering workflow that could improve both performance and explainability.

What I Worked On

Prepared and modeled a large structured dataset with 100k+ observations
Handled missing values, outliers, scaling, and class imbalance
Built baseline and improved classification pipelines
Tested interpretable and tree-based models for comparison
Engineered transformed features to improve predictive signal
Presented findings in a final report and stakeholder-ready deliverables

Technical Approach

In Phase I, we established baseline models with relatively simple preprocessing. This included dropping columns with more than 50% missingness, imputing remaining missing values, scaling variables, and using SMOTE to address severe class imbalance.

In Phase II, we built a more advanced feature engineering pipeline. That work included:

Iterative multivariate imputation instead of simple median/mode imputation
Outlier detection using IQR and creation of outlier indicator features
Missingness indicator features for added signal
Distribution-based transformations like Yeo-Johnson and quantile methods
Cubic spline feature generation
Discretization / binning of numeric variables
Sparse feature selection for more interpretable models

Results

One of the biggest project takeaways was that the enhanced feature engineering pipeline meaningfully improved the sparse logistic regression model. Test accuracy improved from 91.28% in Phase I to 94.95% in Phase II, while test AUROC improved from 0.9672 to 0.9826.

XGBoost performed strongly as a baseline model, reaching 98.77% test accuracy in Phase I. The project also surfaced which features were most influential, including home equity activity, mortgage history, and engineered indicators derived from transaction behavior.

Why It Matters

Beyond model accuracy, this project was a strong example of how data science can support business decision-making. The work connected data cleaning, feature engineering, modeling, and interpretation in a way that mapped directly to a real customer acquisition use case. It also reinforced an important lesson: in many applied ML problems, better preprocessing and feature design can matter just as much as model choice.

Key Takeaways

Advanced feature engineering substantially improved interpretable model performance
Missingness and outlier handling were central to model quality
Discretization and Yeo-Johnson transformations were especially useful in Phase II
Strong ML projects are not just about algorithms — they depend on careful data preparation and clear business framing