ML for Targeted Mortgage Marketing
Cornell M.Eng. consulting-style machine learning project completed with Deluxe, focused on predicting the likelihood that a household would take out a home equity loan in order to improve targeted marketing and customer acquisition strategy.
Project Snapshot
- Partner: Deluxe
- Context: Financial services / targeted marketing analytics
- Dataset: 103,835 records and 898 input features
- Problem Type: Binary classification
- Models: Logistic Regression, XGBoost, LightGBM
- Tools: Python, scikit-learn, XGBoost, LightGBM
Overview
This project was designed to simulate a real-world data science consulting engagement. Working with a team of Cornell graduate students, I helped build an end-to-end modeling pipeline to predict whether a customer would respond to a mortgage-related offer. The broader business goal was to help improve marketing precision by identifying higher-probability households more effectively.
The dataset was high dimensional and messy, with extensive missing values, transformed financial variables, class imbalance, and outliers. Because of that, the project was not just about training a model — it was about building a thoughtful preprocessing and feature engineering workflow that could improve both performance and explainability.
What I Worked On
- Prepared and modeled a large structured dataset with 100k+ observations
- Handled missing values, outliers, scaling, and class imbalance
- Built baseline and improved classification pipelines
- Tested interpretable and tree-based models for comparison
- Engineered transformed features to improve predictive signal
- Presented findings in a final report and stakeholder-ready deliverables
Technical Approach
In Phase I, we established baseline models with relatively simple preprocessing. This included dropping columns with more than 50% missingness, imputing remaining missing values, scaling variables, and using SMOTE to address severe class imbalance.
In Phase II, we built a more advanced feature engineering pipeline. That work included:
- Iterative multivariate imputation instead of simple median/mode imputation
- Outlier detection using IQR and creation of outlier indicator features
- Missingness indicator features for added signal
- Distribution-based transformations like Yeo-Johnson and quantile methods
- Cubic spline feature generation
- Discretization / binning of numeric variables
- Sparse feature selection for more interpretable models
Results
One of the biggest project takeaways was that the enhanced feature engineering pipeline meaningfully improved the sparse logistic regression model. Test accuracy improved from 91.28% in Phase I to 94.95% in Phase II, while test AUROC improved from 0.9672 to 0.9826.
XGBoost performed strongly as a baseline model, reaching 98.77% test accuracy in Phase I. The project also surfaced which features were most influential, including home equity activity, mortgage history, and engineered indicators derived from transaction behavior.
Why It Matters
Beyond model accuracy, this project was a strong example of how data science can support business decision-making. The work connected data cleaning, feature engineering, modeling, and interpretation in a way that mapped directly to a real customer acquisition use case. It also reinforced an important lesson: in many applied ML problems, better preprocessing and feature design can matter just as much as model choice.
Key Takeaways
- Advanced feature engineering substantially improved interpretable model performance
- Missingness and outlier handling were central to model quality
- Discretization and Yeo-Johnson transformations were especially useful in Phase II
- Strong ML projects are not just about algorithms — they depend on careful data preparation and clear business framing