Applied ML · Public Health & Sustainability

Life Expectancy Trend Modeler

Comparative analysis of k-NN regression and Gaussian kernel smoothing across ten parameter configurations on 217 years of Russian life expectancy data — illustrating the bias-variance tradeoff in non-parametric regression.

Methods k-NN Regression · Gaussian Kernel Smoothing

Tech Stack

R k-NN Regression Kernel Smoothing

Source Code View on GitHub

217

Annual Data Points · Russia 1800–2016

Smoothing Configurations Evaluated

16 → 70.87

Life Expectancy Range (years) · 1943 Low · 2016 High

The Problem

Russia's life expectancy from 1800 to 2016 follows a complex multi-phase trend with high local variation — a pattern that linear regression systematically misrepresents and that demands a flexible non-parametric approach

Life expectancy trends over long historical periods are rarely smooth or linear — they reflect the compounded effects of disease, war, medical advances, economic conditions, and public health infrastructure, each operating on different timescales and with different magnitudes. Russia's life expectancy record from 1800 to 2016 is a particularly challenging modeling target: it holds nearly constant through the mid-19th century, begins a non-steady upward climb through the early 20th century with significant interruptions (reaching its minimum of 16 years in 1943 during wartime), recovers and grows through the post-war era, then plateaus again approaching 2016 with a maximum of 70.87 years. This three-phase structure — constant, non-monotone growth, constant — violates the linearity assumption at multiple points simultaneously. A single linear fit would produce systematic errors across all three phases. The modeling challenge is not just fitting the trend, but doing so in a way that respects the high local variation in the growth phase without overfitting the noise that accompanies it.

The Solution

A ten-configuration benchmark of k-NN regression and Gaussian kernel smoothing — five parameter values per method — evaluated on the bias-variance tradeoff to identify the best non-parametric approach for this longitudinal health trend

k-NN regression is evaluated at five neighborhood sizes: k=1, 3, 8, 25, and 85. At k=1, the model interpolates the training data exactly — zero bias, maximum variance. At k=85, nearly half the dataset contributes to each prediction — high bias, minimum variance. Gaussian kernel smoothing is evaluated at five bandwidth values: h=0.5, 5, 10, 25, and 85. At h=0.5, only the nearest-neighboring points receive meaningful weight — producing an overfitting curve comparable to k=1. At h=85, distant points receive nearly equal weight to nearby ones — collapsing the curve toward a flat mean. Both methods are compared through visual inspection of fitted curves, MSE at representative test points (including the 1943 minimum at y=16 and the 2015 near-maximum at y=70.83), and qualitative analysis of boundary behavior and spike characteristics. The goal is not to select a single optimal parameter but to characterize the full bias-variance spectrum for each method and determine which approach is more appropriate for this type of data.

Key Outcome

Both methods perform comparably across their parameter ranges — neither is definitively superior on this dataset — but Gaussian kernel smoothing is preferred for its distance-weighted averaging, which produces smoother and less spiky curves than k-NN's flat neighborhood averaging, and for its more stable boundary behavior where k-NN's fixed nearest-neighbor set causes boundary flattening that kernel smoothing avoids.

Technical Deep Dive

Methodology & Analysis

Analytical Workflow

Stage 1 — Dataset Selection & Characterization

Dataset

Russia · 217 Annual Points

Year (1800–2016) vs Life Expectancy · Min 16 yrs (1943) · Max 70.87 yrs (2016) · Mean 41.99 yrs

Trend Structure

Three Structural Phases

Near-constant (pre-1850s) · Non-steady growth (1850s–1980s) · Near-constant (1980s–2016)

Country Selection

Russia — Most Challenging Non-Linear Case

Selected from 16 countries · Linear fit performs worst here · High local variation in growth phase

▼

Stage 2 — k-NN Regression · Five Configurations

Low k — High Variance

k = 1, 3, 8

k=1: interpolates training data exactly · Near-zero bias · Maximum variance · Highly spiky curve

High k — High Bias

k = 25, 85

k=85 at (1943, 16): predicted 47.56, MSE 995.96 · Flattened boundaries · Important features lost

▼

Stage 3 — Gaussian Kernel Smoothing · Five Configurations

Small h — High Variance

h = 0.5, 5, 10

h=0.5: nearest neighbor dominates weight · Near-zero bias · Curve nearly interpolates training data

Large h — High Bias

h = 25, 85

h=85 at (1943, 16): predicted 46.98, MSE 959.82 · Distant points weighted equally · Over-smoothed

▼

Stage 4 — Comparative Evaluation & Method Preference

Conclusion

Gaussian Kernel Preferred — Smoother Curves · Better Boundary Behavior · Distance-Weighted

Both methods perform reasonably · Kernel produces fewer spiky curves · k-NN boundaries flatten due to repeated nearest neighbor set

Stage 1 & 2

k-NN Regression

k-NN regression estimates the life expectancy at a given year by averaging the observed values of the k nearest years in the training set — with each of the k neighbors contributing equally regardless of distance. At k=1, the prediction at any point equals the value of its nearest neighbor exactly, producing a curve that perfectly follows the training data at the cost of learning its noise — zero training bias, high variance on new data. At k=85, approximately 40% of all data points contribute equally to each prediction; the sharp wartime minimum of 16 years in 1943 is pulled toward 47.56 (MSE 995.96) by the surrounding higher-valued years that dominate the large neighborhood. The 2015 near-maximum is predicted at 60.61 (MSE 104.54 at k=85), illustrating how large k also flattens the boundaries where fewer historical neighbors are available.

Stage 3 & 4

Gaussian Kernel Smoothing

Gaussian kernel smoothing uses all 217 data points to estimate each prediction but assigns exponentially decaying weights based on squared distance from the target year — points close to the query year receive high weight, distant points receive near-zero weight. Bandwidth h controls the rate of decay: at h=0.5 the kernel is extremely narrow and only the nearest neighbor receives meaningful weight, reproducing the overfitting behavior of k=1; at h=85, the kernel is so wide that 1800s observations materially influence 1990s predictions — pulling the 1943 minimum to 46.98 (MSE 959.82), nearly identical to k=85's result. The key structural difference from k-NN is that kernel weighting is continuous and distance-sensitive, producing smoother curves with fewer sharp spikes and more stable behavior at the data boundaries.

Key Methodological Choices

Russia selected as the most challenging non-linear case across 16 countries

The full dataset contains life expectancy records for 16 countries from 1800 to 2016. Most countries exhibit a relatively smooth monotone increase — making them tractable for even moderately flexible regression methods. Russia was selected precisely because it is the hardest case: a non-monotone growth phase with severe disruptions (wartime collapse to 16 years in 1943, post-war recovery, late-century plateau) that would defeat any linear fit. By selecting the most challenging country, the analysis maximizes the discriminating power of the comparison — revealing differences between k-NN and kernel smoothing that would be invisible on a smooth dataset.

Equal neighborhood weighting in k-NN versus distance-sensitive weighting in kernel smoothing

The fundamental distinction between the two methods is the weighting scheme: k-NN assigns weight 1/k to each of the k nearest neighbors and zero to all others — regardless of how far the nearest neighbors are from the query point. Kernel smoothing assigns continuously decaying weights to all points based on distance — the closest points contribute most, distant points contribute negligibly. In practice this means k-NN treats all k neighbors as equally informative, even if some are much farther away than others. Kernel smoothing's distance-sensitive weighting produces smoother curves because it avoids the sudden weight changes that occur in k-NN when points enter or leave the k-nearest neighborhood as the query point moves.

Kernel smoothing preferred — smoother curves, distance-weighted averaging, stable boundaries

Both methods capture the general multi-phase trend in Russia's life expectancy with appropriate parameter tuning — neither is definitively superior in predictive terms. The preference for kernel smoothing rests on three structural advantages. First, its distance-weighted averaging produces smoother curves with fewer sharp spikes than k-NN's flat averaging — particularly visible in the high-variation growth phase of the data. Second, k-NN's boundary behavior degrades because the same limited set of nearby points keeps contributing as nearest neighbors at the data edges, flattening the curve artificially. Third, kernel smoothing's use of all data points with decaying weights is more robust to the fixed-neighborhood artifacts that make k-NN inconsistent in regions of low data density.

Tech Stack

Technology	Purpose
R	Statistical analysis environment and primary implementation language
k-NN Regression	Non-parametric neighborhood averaging — equal weights across k nearest neighbors; evaluated at k=1, 3, 8, 25, 85
Gaussian Kernel Smoothing	Distance-weighted non-parametric regression using the Gaussian (Normal) kernel; evaluated at h=0.5, 5, 10, 25, 85
MSE (Point-wise)	Evaluation metric at representative test points — 1943 minimum (y=16) and 2015 near-maximum (y=70.83)

Results & Metrics

What the analysis reveals

995.96

MSE — k=85 at 1943 Low

Predicted 47.56 vs actual 16 — large neighborhood pulls prediction toward surrounding higher-valued years

959.82

MSE — h=85 at 1943 Low

Predicted 46.98 vs actual 16 — confirms both methods fail comparably at extreme over-smoothing

Configurations Benchmarked

Five k values and five bandwidth values — covering the full bias-variance spectrum for each method

📉

At low k and low h — both methods overfit and learn noise

At k=1, the predicted life expectancy at any query year equals the observed value of its nearest neighboring year exactly — the training error is zero but the model has memorized noise rather than learned the trend. The fitted curve is highly spiky, reacting to every fluctuation in the historical data including temporary disruptions and measurement inconsistencies. At h=0.5, the Gaussian kernel is so narrow that only the immediate neighbor receives meaningful weight — producing a curve nearly identical in behavior to k=1. Both extremes are textbook high-variance models: perfectly fit to training data, poorly generalizable to unseen years or new country records.

📈

At high k and high h — both methods over-smooth and miss critical features

At k=85 and h=85, predictions at the 1943 wartime minimum (actual value: 16 years) reach 47.56 and 46.98 respectively — MSE values approaching 1,000 — because the large neighborhood or wide bandwidth pulls the prediction toward the surrounding higher-valued decades. The sharp wartime collapse is completely absorbed into the surrounding 85-year or h=85 window, producing a smooth curve that entirely misses the most historically significant event in the dataset. Both methods produce nearly identical results at maximum over-smoothing, confirming that the bias at extreme parameter values is driven by the same underlying mechanism: too much averaging over structurally different periods.

⚖️

Both methods perform comparably across intermediate parameter values

In the middle of the parameter range — k=8 or k=25 for k-NN, h=5 or h=10 for kernel smoothing — both methods capture the general three-phase structure of Russia's life expectancy trend without catastrophically overfitting or over-smoothing. Neither method is demonstrably superior in predictive accuracy at intermediate parameters; the choice between them at equivalent smoothing levels does not produce meaningfully different fitted curves for this dataset. The performance gap between methods is most visible at the extremes of the parameter range, and in the smoothness and boundary behavior of the fitted curve rather than in the MSE at specific test points.

🌊

Kernel smoothing produces fewer spiky curves and more stable boundary behavior

The structural advantage of kernel smoothing over k-NN manifests in two ways. First, k-NN's flat equal-weight averaging over the k-nearest neighborhood produces visible spikes wherever consecutive data points have sharply different values — because a single outlier year contributes 1/k weight regardless of how anomalous its value is. Kernel smoothing's exponentially decaying weights dampen the influence of outlier years relative to their neighbors, producing a smoother curve at the same effective smoothing level. Second, k-NN's boundary behavior degrades because the same nearby points dominate the neighborhood repeatedly at the data edges — flattening the early 1800s and post-2010 regions artificially. Kernel smoothing avoids this artifact because all data points remain in the weighted average throughout.

🗺️

Russia's life expectancy data is a robust benchmark for bias-variance tradeoff analysis

The three-phase structure of Russia's life expectancy record — with a near-constant early period, a turbulent non-monotone growth phase including a wartime collapse to 16 years, and a late plateau — creates a natural stress test for non-parametric smoothers. The early flat period rewards high-k or high-h configurations. The growth phase demands local responsiveness. The wartime minimum is a sharp outlier that punishes over-smoothing. No single parameter value handles all three phases optimally — making this dataset a clean pedagogical benchmark for the bias-variance tradeoff and a practical demonstration that parameter selection for non-parametric regression is inherently a problem-specific decision.

← Back to Applied ML

← Previous

Glass Type Forensic Clustering

Engineering & Materials · Unsupervised Clustering

Fuel Economy Prediction

Public Health & Sustainability · LASSO Regression · MARS