Applied ML · Retail & Commerce

PolicyPredict

An end-to-end deep learning pipeline that predicts customer likelihood of purchasing new insurance policies delivering purchase probability scores for targeted marketing campaigns.

Architecture Deep Learning · Binary Classification

Tech Stack

Python TensorFlow scikit-learn pandas matplotlib / seaborn

Source Code View on GitHub

70%

Classification Accuracy on Purchase Intent

85%

ROC-AUC — Discrimination Across Thresholds

Per-Customer

Purchase Likelihood Scores for Targeted Campaigns

The Problem

Insurance marketers send the same campaigns to all customers — without knowing which ones are actually likely to buy

Insurance companies hold rich customer data — demographics, policy history, vehicle and health attributes, past interactions — but most marketing decisions are made without systematically using it to predict purchase intent. Campaigns are broadcast broadly rather than targeted to customers with genuine purchase likelihood, resulting in low conversion rates, wasted marketing spend, and a poor customer experience from irrelevant outreach. The challenge is transforming customer attribute data into a probability score that separates likely buyers from the rest — enabling the marketing team to concentrate resources on customers where outreach is most likely to convert, and to personalize the message based on predicted intent rather than segment-level assumptions.

The Solution

A fully connected neural network pipeline that preprocesses customer data with scikit-learn and outputs per-customer purchase probability scores via TensorFlow

PolicyPredict implements an end-to-end pipeline from raw customer data to actionable purchase likelihood scores. Data preparation and feature engineering are handled by scikit-learn — encoding categorical variables, scaling numerical features, and constructing a clean input matrix ready for neural network training. A fully connected TensorFlow neural network then learns the nonlinear relationships between customer attributes and purchase behavior, outputting per-customer probabilities that can be ranked and acted on directly by the marketing team. Model performance is evaluated using both accuracy and ROC-AUC — the latter capturing how well the model discriminates between buyers and non-buyers across all possible decision thresholds, which is the metric that matters most when deciding where to draw the targeting cutoff. Results are visualized through a confusion matrix and ROC curve, and through matplotlib and seaborn plots that illuminate which customer attributes are most influential in driving predicted purchase intent.

Key Outcome

A deep learning pipeline that achieves 70% accuracy and 85% ROC-AUC on insurance purchase intent prediction — delivering per-customer probability scores and interpretable feature visualizations that enable data-driven targeting, improved marketing ROI, and personalized policy recommendations at scale.

Technical Deep Dive

Architecture & Design

Modeling Pipeline

Stage 1 — Data Preparation & Feature Engineering

Preprocessing

pandas & scikit-learn

Data cleaning · Categorical encoding · Numerical feature scaling

Feature Engineering

Pipeline Construction

scikit-learn pipeline ensures identical transformations at train and inference time

▼

Stage 2 — Neural Network Architecture · TensorFlow

Fully Connected Neural Network

Dense Layers → Sigmoid Output

Hidden layers with activation functions learn nonlinear feature interactions · Sigmoid output layer produces per-customer purchase probability [0, 1] · Binary cross-entropy loss optimized with Adam

▼

Stage 3 — Evaluation

Metric 1

Accuracy — 70%

Overall correct classification rate on held-out test set

Metric 2

ROC-AUC — 85%

Discrimination power across all classification thresholds

Artifact

Confusion Matrix & ROC Curve

Visual diagnostics for threshold selection and error type analysis

▼

Stage 4 — Insight Delivery

Output A

Purchase Likelihood Scores

Per-customer probabilities ranked for targeted campaign prioritization

Output B

Feature Visualizations

matplotlib & seaborn plots revealing influential customer attributes and behavior patterns

Stage 1

Data Preparation & Feature Engineering

Raw customer data is loaded and cleaned with pandas, then processed through a scikit-learn pipeline that encodes categorical variables, scales numerical features, and produces a consistent input matrix for model training. Crucially, using a scikit-learn pipeline — rather than applying transformations manually — ensures that the same preprocessing steps applied to the training set are automatically applied to the test set and to new inference data, preventing the data leakage that arises when scaling statistics are computed on the full dataset before splitting.

Stage 2

Neural Network Architecture

A fully connected TensorFlow neural network learns from the processed customer feature matrix. Hidden layers with nonlinear activation functions allow the model to capture complex interaction effects between customer attributes that linear models cannot represent — relationships such as the joint effect of age, vehicle type, and prior claims history on purchase intent. The output layer uses a sigmoid activation to produce a continuous probability score between 0 and 1 for each customer, which can be thresholded at any cutoff to generate binary purchase/no-purchase predictions. The model is optimized with Adam using binary cross-entropy loss.

Stage 3

Evaluation

Performance is assessed on two complementary metrics. Accuracy at 70% gives a single-threshold view of correct predictions on the held-out test set. ROC-AUC at 85% provides a threshold-independent measure of how well the model ranks buyers above non-buyers across all possible cutoffs — the operationally meaningful metric when the marketing team needs to decide how many customers to target rather than simply whether any given customer will buy. The confusion matrix shows the distribution of true positives, false positives, true negatives, and false negatives, while the ROC curve visualizes the full precision-recall tradeoff curve.

Stage 4

Insight Delivery

The model outputs a per-customer purchase probability score that can be ranked and segmented to prioritize outreach. Customers scoring above a chosen threshold become the target list for a campaign; those near the boundary become candidates for a softer engagement or nurture sequence. Matplotlib and seaborn visualizations provide stakeholders with interpretable insight into which customer attributes most strongly drive purchase predictions — enabling the marketing team to understand not just who to target, but why those customers are predicted to buy, supporting more personalized and effective messaging.

Key Design Decisions

ROC-AUC is the primary performance metric — not accuracy

Purchase intent datasets are typically imbalanced — non-buyers substantially outnumber buyers. In this setting, a model that predicts "no purchase" for every customer can achieve high accuracy simply by reflecting the class distribution, while being entirely useless for targeting. ROC-AUC is threshold-independent: it measures the probability that the model ranks a randomly chosen buyer above a randomly chosen non-buyer, regardless of where the decision boundary is drawn. An AUC of 85% means the model's probability scores are a meaningful ranking of purchase likelihood — the property that actually matters when deciding which customers to contact.

Scikit-learn pipeline integration prevents data leakage across train and test

A common error in ML pipelines is fitting preprocessing transformers — scalers, encoders — on the entire dataset before splitting into train and test sets. This leaks information from the test set into the preprocessing step, producing optimistically biased evaluation metrics and a model that will underperform when deployed on genuinely unseen data. Encapsulating all preprocessing in a scikit-learn pipeline that is fit only on training data ensures the test set is processed using only statistics derived from training — producing evaluation metrics that accurately reflect expected deployment performance.

Deep learning over classical models to capture nonlinear customer attribute interactions

Customer purchase intent is rarely driven by any single attribute in isolation — it emerges from combinations: a young customer with a recent claim and an older vehicle behaves differently from an identical profile where the vehicle is new. Logistic regression models these interactions only if they are explicitly engineered as interaction terms, which requires domain knowledge about which combinations matter. A fully connected neural network learns these interaction effects automatically through its hidden layers, making it more capable of capturing the complex, multi-way relationships in customer data — at the cost of some interpretability, which the feature visualization outputs partially restore for stakeholder communication.

Tech Stack

Technology	Purpose
TensorFlow	Fully connected neural network model development and training
scikit-learn	Preprocessing pipeline, feature engineering, and evaluation metrics
pandas	Data loading, cleaning, transformation, and analysis
NumPy	High-performance numerical computation underlying all array operations
matplotlib / seaborn	Confusion matrix, ROC curve, and feature behavior visualizations
Python	Core language and end-to-end notebook orchestration

Results & Metrics

What the system delivers

70%

Classification Accuracy

Correct purchase intent predictions on the held-out test set

85%

ROC-AUC Score

Model ranks buyers above non-buyers across all decision thresholds

Per-Customer

Probability Scores

Continuous likelihood output rankable for direct campaign prioritization

🎯

85% ROC-AUC delivers reliable customer ranking for campaign targeting

An AUC of 85% means that in 85 out of 100 random comparisons between a buyer and a non-buyer, the model correctly assigns the higher probability score to the buyer. This discrimination power is what enables the marketing team to cut the scored customer list at any budget-driven threshold and be confident that the customers above the cutoff are genuinely more likely to purchase — rather than simply being the most common demographic profile in the dataset.

🧠

Neural network captures nonlinear attribute interactions that linear models miss

Customer purchase intent is shaped by combinations of attributes — age interacting with vehicle type, claim history interacting with policy tenure — that linear models can represent only if those interactions are explicitly constructed as engineered features. The fully connected architecture learns these interaction effects automatically through its hidden layers, capturing the complex, multi-way relationships in customer behavior data without requiring domain-specific feature engineering to specify which combinations matter in advance.

📊

Confusion matrix and ROC curve support threshold selection for deployment

A single accuracy number does not reveal how prediction errors are distributed between false positives — customers contacted who will not buy, incurring outreach cost — and false negatives — buyers missed by the targeting cutoff. The confusion matrix makes this distribution visible, allowing the marketing team to tune the decision threshold based on the relative cost of each error type. The ROC curve shows how that tradeoff shifts across all possible thresholds, enabling an informed choice of cutoff that matches the campaign's budget and conversion economics.

🔒

Scikit-learn pipeline integration produces deployment-ready preprocessing

Encapsulating all preprocessing — encoding, scaling, transformation — in a scikit-learn pipeline ensures that new customer records at inference time are processed identically to training records, using only statistics derived from the training set. This prevents the data leakage and deployment inconsistencies that arise when preprocessing is applied manually, and makes the pipeline directly reusable for future policy product lines or customer segments without restructuring the preprocessing logic.

📈

Feature visualizations give stakeholders interpretable insight into prediction drivers

Matplotlib and seaborn visualizations surface which customer attributes are most strongly associated with high purchase probability scores — translating the model's learned representations into patterns that the marketing and product teams can act on. Understanding that, for example, customers with a specific vehicle age and prior policy tenure score highest allows marketing to refine message personalization and product teams to identify which customer profiles are underserved by the current policy offering.

← Back to Applied ML

← Previous

Market Basket Analysis

Retail & Commerce · Apriori · Unsupervised Association Mining

PolicyPredict Lite

Retail & Commerce · Classical ML · Multi-Model Comparison