Applied ML · Ensemble Classification

Coronary Heart Disease Risk Prediction

A head-to-head comparison of Mixture Discriminant Analysis and Random Forests across 10 independent train-test splits, evaluating predictive accuracy and result stability for 10-year coronary heart disease risk.

Methods MDA · Random Forests
Tech Stack
R BIC OOB Error ARI
Source Code View on GitHub

4,080

Patient Instances · Framingham Heart Study

10 Runs

Independent Seeds · Stability-Tested Results

15.7%

RF Test Misclassification · Best Avg. Performance

The Problem

Coronary heart disease is a leading cause of adult mortality — physicians need a reliable, data-driven tool for identifying 10-year CHD risk from routine clinical measurements

Coronary heart disease (CHD) develops when plaque accumulates inside the coronary arteries, progressively restricting the oxygen-rich blood supply to the heart. It remains one of the leading causes of mortality among adults in Europe and North America. Early identification of high-risk patients allows physicians to intervene before a cardiac event occurs — but risk assessment from routine clinical data requires a classification model that is both accurate and consistent. The Framingham Heart Study dataset presents a realistic and challenging version of this problem: 4,080 patients described by 16 clinical variables spanning demographics, lifestyle factors, vitals, and lab results, with a class imbalance of only 15% CHD-positive cases. The challenge is not just building a classifier, but understanding which approach generalizes reliably across different data samples — a question that cannot be answered by a single train-test split.

The Solution

A 10-run head-to-head comparison of MDA and Random Forests evaluating both accuracy and stability across independent train-test splits

Rather than fitting a single model and reporting a single accuracy figure, this analysis evaluated Mixture Discriminant Analysis (MDA) and Random Forests across 10 independent 75/25 train-test splits using different random seeds — producing distributions of misclassification rate and Adjusted Rand Index (ARI) for both training and test sets. This design surfaces result stability alongside average accuracy, which is critical when selecting a model for clinical deployment. MDA fits Gaussian mixture models per class with BIC-selected component counts, capturing complex within-class structure. Random Forests decorrelate bootstrapped trees by restricting each split to a random variable subset, reducing variance without increasing bias. The 9.3% missing data was handled via complete-case analysis, and scaling was applied prior to MDA fitting to prevent variable magnitude from distorting the Gaussian components.

Key Outcome

Both MDA and Random Forests achieve similar classification performance, but Random Forests is the preferred method — producing lower average test misclassification (15.7% vs 20.7%) and markedly more consistent results across runs (std 0.010 vs 0.059). Systolic Blood Pressure, Diastolic Blood Pressure, and Age emerge as the three most important predictors of 10-year CHD risk.

Technical Deep Dive

Methodology & Analysis

Analytical Workflow

Stage 1 — Data Profiling & Preprocessing

Step 1

Variable Inventory

4,080 instances · 16 variables · 8 continuous · 8 categorical · Binary outcome

Step 2

Missing Data

9.3% missing · Complete-case analysis applied · Sufficient sample retained

Step 3

Class & Collinearity Check

15% CHD positive · 5 moderate variable pairs (all < 0.80) · No high multicollinearity

Stage 2 — Experimental Setup

Split Strategy

10 Independent Train-Test Splits

75/25 ratio · 10 different random seeds · Stability evaluated across all runs

Preprocessing

Scaling for MDA

Feature scaling applied before MDA · Prevents variable magnitude distorting Gaussian components

Stage 3 — Model Fitting

Model A · MDA

Mixture Discriminant Analysis

BIC selects Gaussian components per class · EEV covariance model · Class 0: 5 components · Class 1: 3 components

Model B · Random Forests

Random Forests

950 trees · 9 variables per split · OOB error guides model selection · Best: seed 555, OOB 14.77%

Stage 4 — Evaluation & Comparison

Metric 1

Misclassification Rate

Avg. & std. across 10 runs · Train and test sets reported separately

Metric 2

Adjusted Rand Index

Corrects for chance agreement · More informative on imbalanced classes

Metric 3

Variable Importance

RF variable importance plot · Identifies clinical predictors driving CHD risk

Stage 1

Data Profiling & Preprocessing

The Framingham Heart Study dataset contains 4,080 patients described by 16 variables — 8 continuous (age, CigsPerDay, cholesterol, systolic BP, diastolic BP, BMI, heart rate, glucose) and 8 categorical (sex, education, smoking status, BP medications, prevalent stroke, hypertension, diabetes, outcome). With 9.3% missing data and sufficient sample size, a complete-case approach was adopted. The dataset shows a class imbalance of 15% CHD-positive cases and five pairs of moderate correlations, none exceeding 0.80 — acceptable for the methods used.

Stage 2

10-Run Experimental Design

The dataset was split 75/25 into training and test sets and this process was repeated 10 times using different random seeds, producing 10 independent MDA models and 10 independent Random Forests models. Reporting averages and standard deviations across all 10 runs provides a stable estimate of each method's expected performance and reveals how sensitive results are to the specific data sample chosen — a critical consideration when selecting between methods for a real clinical application. Feature scaling was applied before each MDA run to prevent variable magnitude from distorting Gaussian component fitting.

Stage 3

Model Fitting — MDA & Random Forests

MDA fits Gaussian mixture models independently for each class, with BIC selecting the optimal number of components. The best model (seed 777) used an EEV covariance structure — 5 Gaussian components for class 0 (85% of observations) and 3 for class 1 (15%). Random Forests grew 950 trees per model, restricting each split to 9 randomly selected variables from 16 to decorrelate trees. The best RF model (seed 555) achieved an OOB error of 14.77%, used as the out-of-sample performance estimate without requiring a separate validation set.

Stage 4

Evaluation & Model Comparison

Each of the 10 models per method was evaluated using misclassification rate and Adjusted Rand Index (ARI) on both training and test sets. ARI was chosen alongside misclassification rate because it corrects for chance agreement and is more informative when classes are imbalanced — a classifier that predicts all instances as the majority class achieves a low misclassification rate but an ARI near zero. Averages and standard deviations across 10 runs were compared between methods, and the best RF model's variable importance plot was used to identify the clinical predictors driving CHD risk.

Key Methodological Choices

10-run evaluation reveals stability — a single train-test split cannot

A single split produces a single accuracy figure that may be unusually high or low depending on how the data happens to be divided. By repeating the split 10 times with different seeds and reporting the mean and standard deviation of performance across all runs, the analysis surfaces which method is not only more accurate on average but also more consistent. Random Forests' standard deviation of test misclassification (0.010) is six times smaller than MDA's (0.059) — a stability difference that matters far more than the difference in average accuracy alone when choosing a model for clinical use.

BIC-selected Gaussian components allow MDA to model complex within-class structure

Standard linear discriminant analysis assumes each class follows a single Gaussian distribution. MDA relaxes this by fitting a mixture of Gaussians per class, with BIC determining how many components best describe the data without overfitting. This is particularly valuable for a class like CHD-positive patients, which likely contains multiple sub-populations (e.g., hypertension-driven vs. lifestyle-driven cases). The best model fitted 3 components to the CHD-positive class and 5 to the majority class, reflecting genuine within-class heterogeneity.

ARI is essential alongside misclassification rate on imbalanced data

With only 15% CHD-positive cases, a naive classifier that predicts all patients as negative achieves 85% accuracy — a misleadingly strong misclassification rate that conceals total failure on the minority class. The Adjusted Rand Index corrects for chance agreement and provides a meaningful measure of how well predicted labels align with true labels across both classes. Reporting both metrics together ensures that neither method can appear to perform well simply by exploiting the class imbalance.

Tech Stack

Technology Purpose
R Core language for statistical modeling and analysis
mclust Mixture Discriminant Analysis with BIC-based component selection
randomForest Random Forests ensemble classifier with OOB error estimation
BIC Bayesian Information Criterion for selecting number of Gaussian components per class in MDA
OOB Error Out-of-bag error for Random Forests model selection without a separate validation set
ARI Adjusted Rand Index — chance-corrected classification performance metric for imbalanced data

Results & Metrics

What the analysis reveals

15.7%

RF Avg. Test Misclassification

Std. 0.010 · Highly consistent across 10 runs

20.7%

MDA Avg. Test Misclassification

Std. 0.059 · Higher variance across runs

14.77%

Best RF OOB Error

Seed 555 · 950 trees · 9 variables per split

🏆

Random Forests outperforms MDA on both accuracy and consistency

Across 10 independent test sets, Random Forests achieved an average misclassification rate of 15.7% compared to MDA's 20.7%. More importantly, RF's standard deviation of 0.010 is six times smaller than MDA's 0.059, indicating that RF produces reliable, stable results regardless of how the data is split. Both methods show minimal train-test gaps — RF: 15.4% train vs 15.7% test; MDA: 18.7% train vs 20.7% test — confirming neither method is overfitting to the training sample.

🩺

Systolic BP, Diastolic BP, and Age are the dominant clinical predictors

The RF variable importance plot identifies Systolic Blood Pressure and Diastolic Blood Pressure as the two most important predictors, with closely matched importance values explained by their high correlation (0.79). Age ranks third as an independent predictor. A mid-importance cluster of Prevalent Hypertension, BMI, Glucose Level, Total Cholesterol, and CigsPerDay follows, while remaining variables including sex, education, and diabetes show low predictive importance for 10-year CHD risk.

🔬

MDA captures genuine within-class structure through Gaussian mixture components

The best MDA model (seed 777) selected an EEV covariance structure with 5 Gaussian components for the CHD-negative class and 3 for the CHD-positive class — reflecting the real heterogeneity within patient subgroups rather than assuming each class follows a single multivariate Gaussian distribution. While MDA ultimately underperformed RF in this analysis, the BIC-guided component selection demonstrates a principled, data-driven approach to capturing complex class structure that standard discriminant analysis cannot represent.

📊

Class imbalance limits overall performance — additional data and techniques could improve separation

With only 15% CHD-positive cases, the dataset presents a genuine class separation challenge for both methods. ARI values — RF test 0.096, MDA test 0.103 — indicate modest agreement with true labels beyond chance, reflecting the difficulty of the classification problem rather than model failure. Strategies such as oversampling, cost-sensitive learning, or class-weighted training could improve minority class recall. Additional data and more granular clinical features (e.g., imaging biomarkers) may be required for a clinically deployable screening tool.

Random Forests is the preferred classifier — lower error, lower variance, and clinically interpretable importance scores

Combining lower average test misclassification (15.7% vs 20.7%), markedly higher result stability (std 0.010 vs 0.059), and the added benefit of variable importance scores that align with established clinical knowledge — blood pressure and age as leading CHD risk factors — Random Forests is the clear preferred method for this dataset. The OOB error of 14.77% from the best model provides a reliable out-of-sample performance estimate without requiring a separate validation set, further supporting its suitability for clinical risk stratification applications.