Applied ML · Binary Classification

Diabetic Retinopathy Prediction

Predicting the probability of diabetic retinopathy from retinal image features using binary logistic regression with two-step variable selection, maximum likelihood estimation, and Hosmer-Lemeshow goodness-of-fit assessment.

Method Binary Logistic Regression

Tech Stack

R AIC Hosmer-Lemeshow GLM · STATS Package Backward Elimination

Source Code View on GitHub

1,151

Dataset Instances · No Missing Values

Final Predictors Selected from 20 Variables

64.3%

Balanced Accuracy · Training Sample

The Problem

Diabetic retinopathy diagnosis depends on specialist manual review of retinal images — too slow and costly for population-scale early screening

Diabetic retinopathy (DR) is a diabetes complication that damages the blood vessels of the retina, and early detection is the single most important factor in effective treatment. Yet diagnosis currently depends on a trained specialist manually reviewing retinal images and identifying microaneurysms, exudate deposits, and anatomical anomalies — a process that is expensive, time-consuming, and impossible to scale for the millions of diabetic patients who require regular eye screening. Automated image analysis tools can already extract dozens of quantitative features from retinal photographs, but the downstream question — how do those features translate into a DR probability score for each patient? — still lacks a fast, interpretable, and statistically rigorous answer. The gap is between the data that already exists in retinal image feature extraction pipelines and the ability to convert those features into a clinically useful probability of DR occurrence.

The Solution

A binary logistic regression model with two-step variable selection that maps retinal image features to a calibrated DR probability score

The analysis uses binary logistic regression — the statistically appropriate method for binary outcomes — to model the probability of DR occurrence from retinal image features. The 20-variable dataset presented a significant multicollinearity challenge: the six MA variables (3–8) were all highly correlated, and the eight Exudate variables (9–16) split into two internally collinear sub-groups. A two-step variable selection strategy resolved this: first, one representative was selected from each correlated group to eliminate multicollinearity, reducing the candidate set to eight predictors; then AIC-guided backward elimination pruned non-significant variables (α < 0.05), producing a final four-predictor model. The model was fit on a 65% training sample using maximum likelihood estimation, with coefficient significance verified via the Wald statistic, and validated on a held-out 35% test set. Goodness-of-fit was assessed through both the Chi-square deviance test and the Hosmer-Lemeshow test to verify that predicted probabilities are well-calibrated against observed event rates.

Key Outcome

A clinically interpretable logistic regression model that reduces 20 retinal image features to four significant predictors — Prescreening result, MA count, and two Exudate measures — while finding that anatomical features including optic disc diameter and macula distance have no significant effect on DR probability. Model fit confirmed by Hosmer-Lemeshow (p = 0.82) and validated on a held-out test set with 62.1% balanced accuracy.

Technical Deep Dive

Methodology & Analysis

Analytical Workflow

Stage 1 — Data Profiling & Multicollinearity Mapping

Step 1

Variable Inventory

20 variables · 1,151 instances · No missing values · Binary outcome (DR / no DR)

Step 2

Class Balance Check

46.8% no DR · 53.2% DR · Near-balanced — no resampling required

Step 3

Correlation Analysis

MAs (vars 3–8) highly correlated · Exudates (9–12) collinear · Exudates (13–16) collinear

▼

Stage 2 — Variable Selection

Step A · Multicollinearity Resolution

Group Representatives

One variable selected per correlated group · 20 variables reduced to 8 candidate predictors

Step B · Significance Pruning

AIC & Backward Elimination

Non-significant variables removed (α < 0.05) · 8 candidates reduced to 4 final predictors

▼

Stage 3 — Model Fitting & Diagnostics

Step 1

Train-Test Split

65% training · 35% holdout · Both sets exceed minimum size (400 instances)

Step 2

MLE via glm()

R STATS package · Newton's method · Wald statistic for coefficient significance

Step 3

Residual Diagnostics

Standardized residuals <1% significant · No Cook's distance >1 · Leverage points retained

▼

Stage 4 — Goodness-of-Fit & Evaluation

Test 1

Chi-Square Deviance

p = 0.00 · Model significantly outperforms the null (intercept-only) model

Test 2

Hosmer-Lemeshow

p = 0.82 · Predicted probabilities well-calibrated against observed event rates

Test 3

Confusion Matrix

Precision · Sensitivity · Balanced accuracy · Evaluated on both training and test sets

Stage 1

Data Profiling & Multicollinearity Mapping

The dataset contains 1,151 instances across 20 variables with no missing values and a near-balanced outcome (46.8% no DR, 53.2% DR). Correlation analysis revealed three collinear clusters that required resolution before modeling: the six MA variables (3–8) were all mutually highly correlated; the eight Exudate variables split into two internally collinear sub-groups (9–12 and 13–16). No other variables showed multicollinearity. The near-balanced class distribution meant no resampling strategy was needed prior to fitting.

Stage 2

Two-Step Variable Selection

Step A addressed multicollinearity by selecting one representative from each correlated group — Quality Assessment, Prescreening, MAS_0.05, Exudates_0.03, Exudates_0.07, Euclidean Distance, Optic Disc Diameter, and AMFM — reducing the candidate set to eight predictors with no collinearity issues. Step B applied AIC-guided backward elimination, removing variables that failed to reach significance at α < 0.05, yielding the final four-predictor model: Prescreening, MAS_0.05, Exudates_0.03, and Exudates_0.07.

Stage 3

Model Fitting & Diagnostics

The dataset was split 65/35 into training and holdout sets, both exceeding the minimum size requirement of 400 instances. The model was fit using R's glm() function from the STATS package, with maximum likelihood estimation solved iteratively via Newton's method. Coefficient significance was assessed with the Wald statistic. Diagnostic checks confirmed fewer than 1% significant standardized residuals, no Cook's distance exceeding 1, and leverage points that were retained after confirming they had no material effect on results.

Stage 4

Goodness-of-Fit & Evaluation

Model fit was evaluated on three dimensions. The Chi-square deviance test (p = 0.00) confirmed the model significantly outperforms the null. The Hosmer-Lemeshow test (p = 0.82) confirmed that predicted probabilities are well-calibrated against observed event rates across deciles. Nagelkerke R² of 0.17 indicates moderate explanatory power. The confusion matrix was evaluated on both training and holdout sets — balanced accuracy of 64.3% (train) and 62.1% (test) confirms the model generalizes without overfitting.

Key Methodological Choices

Two-step variable selection eliminates multicollinearity before significance testing

Running AIC or backward elimination directly on collinear predictors inflates standard errors and distorts p-values — variables that are truly significant may appear non-significant simply because their variance is shared with a correlated neighbor. By first resolving multicollinearity through group representation, then applying AIC-guided pruning, the selection process operates on a clean candidate set where each variable's significance can be estimated independently and reliably.

MLE is the correct estimator for binary outcomes — OLS assumptions do not apply

Ordinary least squares assumes a continuous, normally distributed response variable with constant error variance — none of which hold for a binary outcome. Maximum likelihood estimation is the principled choice: it directly maximizes the probability of observing the data under the logistic model, produces well-calibrated probability outputs constrained between 0 and 1, and provides statistically valid coefficient estimates and standard errors through the Wald statistic and confidence intervals.

Hosmer-Lemeshow tests calibration directly — deviance alone is insufficient

The Chi-square deviance test confirms that the model performs better than a null model, but it does not verify that predicted probabilities match observed event rates. The Hosmer-Lemeshow test fills this gap by grouping instances into deciles of predicted probability and comparing predicted to observed event counts within each group. A non-significant result (p = 0.82) confirms that the model is not just better than nothing — its predicted probabilities are genuinely well-calibrated and clinically trustworthy.

Tech Stack

Technology	Purpose
R	Core language for statistical modeling and analysis
GLM · STATS Package	Binary logistic regression model fitting via maximum likelihood estimation
AIC	Information criterion guiding backward elimination variable selection
Backward Elimination	Stepwise removal of non-significant predictors to produce a parsimonious final model
Hosmer-Lemeshow Test	Goodness-of-fit assessment verifying calibration of predicted probabilities
Confusion Matrix	Predictive accuracy evaluation — precision, sensitivity, and balanced accuracy on train and test sets

Results & Metrics

What the analysis reveals

64.3%

Balanced Accuracy

Training sample · Precision 66.7% · Sensitivity 68.4%

62.1%

Balanced Accuracy

Holdout test set · Precision 62.2% · Sensitivity 69.4%

p = 0.82

Hosmer-Lemeshow

Well-calibrated predicted probabilities · Good model fit confirmed

🎯

Two-step selection distilled 20 variables down to 4 significant predictors

By first eliminating multicollinearity across three correlated variable groups, then applying AIC-guided backward elimination, the model retained only Prescreening, MAS_0.05, Exudates_0.03, and Exudates_0.07 as statistically significant predictors. Quality Assessment, Euclidean Distance, Optic Disc Diameter, and AMFM were all eliminated — a clinically meaningful finding showing that vascular indicators, not anatomical geometry, drive DR risk.

📊

Model fit confirmed on multiple dimensions — significant improvement over null with well-calibrated probabilities

The Chi-square deviance test (p = 0.00) confirmed the model significantly outperforms the intercept-only null model. The Hosmer-Lemeshow test (p = 0.82) confirmed that predicted probabilities closely match observed event rates across deciles — the model is not merely better than nothing, its probability outputs are genuinely well-calibrated. Nagelkerke R² of 0.17 and Cox & Snell R² of 0.13 indicate moderate explanatory power, consistent with the complexity of DR as a multifactorial condition.

🔬

Prescreening is the dominant predictor — abnormality detection at triage stage is strongly protective

Prescreening carries the largest effect size in the model (Exp(B) = 0.381). Patients who are prescreened and flagged with a severe retinal abnormality are 0.38× as likely to receive a DR diagnosis compared to those not flagged — a protective inverse relationship explained by the prescreening step identifying and routing high-risk cases for immediate specialist review rather than progressing through the standard diagnostic pipeline. The confidence interval for the odds ratio never crosses 1, confirming the direction of this relationship is stable across samples.

📈

Microaneurysms and exudate levels directly elevate DR risk — Exudates_0.07 carries the largest positive effect

Each unit increase in MA count at α = 0.05 raises the odds of DR by a factor of 1.027 (Exp(B) = 1.027). Exudates at α = 0.03 contribute a smaller per-unit effect (Exp(B) = 1.004), while Exudates at α = 0.07 carry the largest positive effect among vascular predictors (Exp(B) = 1.311) — meaning patients with elevated high-confidence exudate counts are 1.31× more likely to be diagnosed with DR per unit increase. All confidence intervals exclude 1, confirming consistent directionality across all three vascular predictors.

✅

Model generalizes without overfitting — near-identical performance on training and holdout sets

The small gap between training balanced accuracy (64.3%) and holdout balanced accuracy (62.1%) confirms the model has not overfit to the training sample. Sensitivity is notably consistent across both sets (68.4% training, 69.4% test), indicating the model's ability to correctly identify true DR cases is stable on unseen data — the clinically most important performance dimension for an early screening tool where missed cases carry higher cost than false positives.

← Back to Applied ML

Coronary Heart Disease Prediction

Healthcare · Binary Classification