Diabetic Retinopathy Prediction
Predicting the probability of diabetic retinopathy from retinal image features using binary logistic regression with two-step variable selection, maximum likelihood estimation, and Hosmer-Lemeshow goodness-of-fit assessment.
1,151
Dataset Instances · No Missing Values
4
Final Predictors Selected from 20 Variables
64.3%
Balanced Accuracy · Training Sample
The Problem
Diabetic retinopathy diagnosis depends on specialist manual review of retinal images — too slow and costly for population-scale early screening
Diabetic retinopathy (DR) is a diabetes complication that damages the blood vessels of the retina, and early detection is the single most important factor in effective treatment. Yet diagnosis currently depends on a trained specialist manually reviewing retinal images and identifying microaneurysms, exudate deposits, and anatomical anomalies — a process that is expensive, time-consuming, and impossible to scale for the millions of diabetic patients who require regular eye screening. Automated image analysis tools can already extract dozens of quantitative features from retinal photographs, but the downstream question — how do those features translate into a DR probability score for each patient? — still lacks a fast, interpretable, and statistically rigorous answer. The gap is between the data that already exists in retinal image feature extraction pipelines and the ability to convert those features into a clinically useful probability of DR occurrence.
The Solution
A binary logistic regression model with two-step variable selection that maps retinal image features to a calibrated DR probability score
The analysis uses binary logistic regression — the statistically appropriate method for binary outcomes — to model the probability of DR occurrence from retinal image features. The 20-variable dataset presented a significant multicollinearity challenge: the six MA variables (3–8) were all highly correlated, and the eight Exudate variables (9–16) split into two internally collinear sub-groups. A two-step variable selection strategy resolved this: first, one representative was selected from each correlated group to eliminate multicollinearity, reducing the candidate set to eight predictors; then AIC-guided backward elimination pruned non-significant variables (α < 0.05), producing a final four-predictor model. The model was fit on a 65% training sample using maximum likelihood estimation, with coefficient significance verified via the Wald statistic, and validated on a held-out 35% test set. Goodness-of-fit was assessed through both the Chi-square deviance test and the Hosmer-Lemeshow test to verify that predicted probabilities are well-calibrated against observed event rates.
Key Outcome
A clinically interpretable logistic regression model that reduces 20 retinal image features to four significant predictors — Prescreening result, MA count, and two Exudate measures — while finding that anatomical features including optic disc diameter and macula distance have no significant effect on DR probability. Model fit confirmed by Hosmer-Lemeshow (p = 0.82) and validated on a held-out test set with 62.1% balanced accuracy.
Technical Deep Dive
Methodology & Analysis
Analytical Workflow
Stage 1 — Data Profiling & Multicollinearity Mapping
Step 1
Variable Inventory
20 variables · 1,151 instances · No missing values · Binary outcome (DR / no DR)
Step 2
Class Balance Check
46.8% no DR · 53.2% DR · Near-balanced — no resampling required
Step 3
Correlation Analysis
MAs (vars 3–8) highly correlated · Exudates (9–12) collinear · Exudates (13–16) collinear
Stage 2 — Variable Selection
Step A · Multicollinearity Resolution
Group Representatives
One variable selected per correlated group · 20 variables reduced to 8 candidate predictors
Step B · Significance Pruning
AIC & Backward Elimination
Non-significant variables removed (α < 0.05) · 8 candidates reduced to 4 final predictors
Stage 3 — Model Fitting & Diagnostics
Step 1
Train-Test Split
65% training · 35% holdout · Both sets exceed minimum size (400 instances)
Step 2
MLE via glm()
R STATS package · Newton's method · Wald statistic for coefficient significance
Step 3
Residual Diagnostics
Standardized residuals <1% significant · No Cook's distance >1 · Leverage points retained
Stage 4 — Goodness-of-Fit & Evaluation
Test 1
Chi-Square Deviance
p = 0.00 · Model significantly outperforms the null (intercept-only) model
Test 2
Hosmer-Lemeshow
p = 0.82 · Predicted probabilities well-calibrated against observed event rates
Test 3
Confusion Matrix
Precision · Sensitivity · Balanced accuracy · Evaluated on both training and test sets
Stage 1
Data Profiling & Multicollinearity Mapping
The dataset contains 1,151 instances across 20 variables with no missing values and a near-balanced outcome (46.8% no DR, 53.2% DR). Correlation analysis revealed three collinear clusters that required resolution before modeling: the six MA variables (3–8) were all mutually highly correlated; the eight Exudate variables split into two internally collinear sub-groups (9–12 and 13–16). No other variables showed multicollinearity. The near-balanced class distribution meant no resampling strategy was needed prior to fitting.
Stage 2
Two-Step Variable Selection
Step A addressed multicollinearity by selecting one representative from each correlated group — Quality Assessment, Prescreening, MAS_0.05, Exudates_0.03, Exudates_0.07, Euclidean Distance, Optic Disc Diameter, and AMFM — reducing the candidate set to eight predictors with no collinearity issues. Step B applied AIC-guided backward elimination, removing variables that failed to reach significance at α < 0.05, yielding the final four-predictor model: Prescreening, MAS_0.05, Exudates_0.03, and Exudates_0.07.
Stage 3
Model Fitting & Diagnostics
The dataset was split 65/35 into training and holdout sets, both exceeding the minimum size requirement of 400 instances. The model was fit using R's glm() function from the STATS package, with maximum likelihood estimation solved iteratively via Newton's method. Coefficient significance was assessed with the Wald statistic. Diagnostic checks confirmed fewer than 1% significant standardized residuals, no Cook's distance exceeding 1, and leverage points that were retained after confirming they had no material effect on results.
Stage 4
Goodness-of-Fit & Evaluation
Model fit was evaluated on three dimensions. The Chi-square deviance test (p = 0.00) confirmed the model significantly outperforms the null. The Hosmer-Lemeshow test (p = 0.82) confirmed that predicted probabilities are well-calibrated against observed event rates across deciles. Nagelkerke R² of 0.17 indicates moderate explanatory power. The confusion matrix was evaluated on both training and holdout sets — balanced accuracy of 64.3% (train) and 62.1% (test) confirms the model generalizes without overfitting.
Key Methodological Choices
Two-step variable selection eliminates multicollinearity before significance testing
Running AIC or backward elimination directly on collinear predictors inflates standard errors and distorts p-values — variables that are truly significant may appear non-significant simply because their variance is shared with a correlated neighbor. By first resolving multicollinearity through group representation, then applying AIC-guided pruning, the selection process operates on a clean candidate set where each variable's significance can be estimated independently and reliably.
MLE is the correct estimator for binary outcomes — OLS assumptions do not apply
Ordinary least squares assumes a continuous, normally distributed response variable with constant error variance — none of which hold for a binary outcome. Maximum likelihood estimation is the principled choice: it directly maximizes the probability of observing the data under the logistic model, produces well-calibrated probability outputs constrained between 0 and 1, and provides statistically valid coefficient estimates and standard errors through the Wald statistic and confidence intervals.
Hosmer-Lemeshow tests calibration directly — deviance alone is insufficient
The Chi-square deviance test confirms that the model performs better than a null model, but it does not verify that predicted probabilities match observed event rates. The Hosmer-Lemeshow test fills this gap by grouping instances into deciles of predicted probability and comparing predicted to observed event counts within each group. A non-significant result (p = 0.82) confirms that the model is not just better than nothing — its predicted probabilities are genuinely well-calibrated and clinically trustworthy.
Tech Stack
| Technology | Purpose |
|---|---|
| R | Core language for statistical modeling and analysis |
| GLM · STATS Package | Binary logistic regression model fitting via maximum likelihood estimation |
| AIC | Information criterion guiding backward elimination variable selection |
| Backward Elimination | Stepwise removal of non-significant predictors to produce a parsimonious final model |
| Hosmer-Lemeshow Test | Goodness-of-fit assessment verifying calibration of predicted probabilities |
| Confusion Matrix | Predictive accuracy evaluation — precision, sensitivity, and balanced accuracy on train and test sets |
Results & Metrics
What the analysis reveals
64.3%
Balanced Accuracy
Training sample · Precision 66.7% · Sensitivity 68.4%
62.1%
Balanced Accuracy
Holdout test set · Precision 62.2% · Sensitivity 69.4%
p = 0.82
Hosmer-Lemeshow
Well-calibrated predicted probabilities · Good model fit confirmed
Two-step selection distilled 20 variables down to 4 significant predictors
By first eliminating multicollinearity across three correlated variable groups, then applying AIC-guided backward elimination, the model retained only Prescreening, MAS_0.05, Exudates_0.03, and Exudates_0.07 as statistically significant predictors. Quality Assessment, Euclidean Distance, Optic Disc Diameter, and AMFM were all eliminated — a clinically meaningful finding showing that vascular indicators, not anatomical geometry, drive DR risk.
Model fit confirmed on multiple dimensions — significant improvement over null with well-calibrated probabilities
The Chi-square deviance test (p = 0.00) confirmed the model significantly outperforms the intercept-only null model. The Hosmer-Lemeshow test (p = 0.82) confirmed that predicted probabilities closely match observed event rates across deciles — the model is not merely better than nothing, its probability outputs are genuinely well-calibrated. Nagelkerke R² of 0.17 and Cox & Snell R² of 0.13 indicate moderate explanatory power, consistent with the complexity of DR as a multifactorial condition.
Prescreening is the dominant predictor — abnormality detection at triage stage is strongly protective
Prescreening carries the largest effect size in the model (Exp(B) = 0.381). Patients who are prescreened and flagged with a severe retinal abnormality are 0.38× as likely to receive a DR diagnosis compared to those not flagged — a protective inverse relationship explained by the prescreening step identifying and routing high-risk cases for immediate specialist review rather than progressing through the standard diagnostic pipeline. The confidence interval for the odds ratio never crosses 1, confirming the direction of this relationship is stable across samples.
Microaneurysms and exudate levels directly elevate DR risk — Exudates_0.07 carries the largest positive effect
Each unit increase in MA count at α = 0.05 raises the odds of DR by a factor of 1.027 (Exp(B) = 1.027). Exudates at α = 0.03 contribute a smaller per-unit effect (Exp(B) = 1.004), while Exudates at α = 0.07 carry the largest positive effect among vascular predictors (Exp(B) = 1.311) — meaning patients with elevated high-confidence exudate counts are 1.31× more likely to be diagnosed with DR per unit increase. All confidence intervals exclude 1, confirming consistent directionality across all three vascular predictors.
Model generalizes without overfitting — near-identical performance on training and holdout sets
The small gap between training balanced accuracy (64.3%) and holdout balanced accuracy (62.1%) confirms the model has not overfit to the training sample. Sensitivity is notably consistent across both sets (68.4% training, 69.4% test), indicating the model's ability to correctly identify true DR cases is stable on unseen data — the clinically most important performance dimension for an early screening tool where missed cases carry higher cost than false positives.