Applied ML · Forensics

Glass Type Forensic Clustering

Unsupervised clustering of 214 crime-scene glass specimens using Hierarchical, K-Means, and Gaussian Mixture Model methods to identify which chemical composition variables best discriminate glass types as forensic evidence.

Methods Unsupervised Clustering
Tech Stack
R mclust K-Means Ward's Linkage
Source Code View on GitHub

214

Crime-Scene Glass Specimens · 9 Chemical Variables

3

Clustering Methods Benchmarked

0.45

Best Silhouette — K-Means (2 Clusters)

The Problem

Glass evidence recovered from crime scenes needs to be classified by type — but the chemical composition of different glass categories overlaps substantially, making unsupervised separation a genuinely difficult clustering problem

Glass fragments recovered at crime scenes are a key form of physical evidence in forensic science — the type of glass (building window, vehicle window, container, tableware, headlamp) can link suspects to locations, incidents, or vehicles. Identification requires chemical analysis of the glass's elemental composition: refractive index and weight percentages of nine oxides including sodium, magnesium, aluminum, silicon, potassium, calcium, barium, and iron. The challenge is that different glass types share overlapping chemical profiles — silicon and calcium, for instance, show limited discriminating power across glass categories, while other elements like barium and magnesium vary more meaningfully between types. With no labeled training data available at the point of evidence analysis, the classification problem is inherently unsupervised: the goal is to discover which specimens cluster together based on chemical similarity, and whether those clusters correspond interpretably to known glass types without being explicitly trained on those labels.

The Solution

A three-method unsupervised clustering benchmark — Hierarchical (Ward's and complete linkage), K-Means, and Gaussian Mixture Models — evaluated on silhouette width, ARI, and misclassification rate across both free and semi-supervised configurations

The analysis applies three clustering families to 214 glass specimens scaled across 9 chemical composition variables. Hierarchical clustering is tested with Ward's and complete linkage methods using Euclidean distance, with optimal cluster count determined by silhouette width. K-Means clustering uses both best-silhouette and elbow-plot methods to determine cluster count — with best-silhouette selected when the elbow plot yields no clear inflection. Gaussian Mixture Model-based clustering via the mclust package fits 14 model families with varying covariance constraints, selecting the best model by BIC. All three methods are evaluated at both their unconstrained optimal cluster counts and in a semi-supervised configuration where 6 clusters are forced to align with the known glass type structure. Each configuration is assessed on three metrics: silhouette width (cluster compactness and separation), Adjusted Rand Index (ARI, agreement with known glass types), and misclassification rate. The glass type labels are withheld during clustering and used only post-hoc for evaluation.

Key Outcome

K-Means outperforms all methods across the evaluation metrics — achieving the highest silhouette (0.45 at 2 clusters), highest ARI (0.19), and lowest misclassification rate (0.57 at 6 clusters) — with the 2-cluster model interpretably separating headlamp glass from all other types based on high barium, aluminum, and sodium content versus high magnesium and iron, directly replicating the dominant chemical distinction between lamp glass and window or container glass.

Technical Deep Dive

Methodology & Analysis

Analytical Workflow

Stage 1 — Data Preparation

Dataset

214 Specimens · 9 Variables

RI, Na, Mg, Al, Si, K, Ca, Ba, Fe · No missing data · No multicollinearity (<0.50) · Outliers retained

Preprocessing

Variable Scaling

All 9 variables scaled to zero mean & unit variance · Prevents magnitude-driven distance distortion

Evaluation

3-Metric Framework

Silhouette width · ARI vs known glass types · Misclassification rate — labels used post-hoc only

Stage 2 — Three-Method Clustering

Hierarchical

Ward's & Complete Linkage

Euclidean distance · Ward's best: silhouette 0.402, ARI 0.15 (2 clusters) · Semi-supervised 6-cluster: ARI 0.19

K-Means

Best-Silhouette Method

2 clusters optimal · Silhouette 0.45, ARI 0.19 · Semi-supervised 6 clusters: misclassification 0.57 (best)

GMM · mclust

VEV Model · 5 Clusters

14 model families evaluated · BIC-selected VEV (ellipsoidal, equal shape) · BIC −1322.79 · ARI 0.15

Stage 3 — Post-Hoc Interpretation & Variable Analysis

Parallel Coordinates & Variable Discrimination

K-Means 2-Cluster Model — Headlamp vs All Other Glass Types

Cluster 2 (headlamps): high Na, Al, Ba · low Mg, Fe · Cluster 1 (all others): high Mg, Fe · low Na, Al, Ba · RI, Si, K, Ca — poor discriminators across all methods

Stage 1

Data Preparation & Scaling

The dataset contains 214 glass specimens with 9 continuous chemical composition variables: refractive index (RI) and weight percentages of sodium (Na), magnesium (Mg), aluminum (Al), silicon (Si), potassium (K), calcium (Ca), barium (Ba), and iron (Fe). The glass type variable (categories 1, 2, 3, 5, 6, 7) is excluded from clustering and reserved for post-hoc evaluation. The dataset contains no missing data and no multicollinearity between predictors (all correlations below 0.50). Some outliers were detected but retained as forensically meaningful observations. All 9 variables are scaled to zero mean and unit variance before clustering — preventing the high-magnitude silicon and calcium variables from dominating Euclidean distance calculations and suppressing the contribution of lower-magnitude but potentially more discriminative variables like barium and iron.

Stage 2

Three-Method Clustering Benchmark

Hierarchical clustering is applied with Ward's method (minimizes within-cluster variance) and complete linkage, using Euclidean distance. Optimal cluster count is identified by silhouette — both methods select 2 clusters. Semi-supervised configurations with 6 clusters are also evaluated. K-Means clustering iterates until centroid assignments stabilize, with cluster count selected by silhouette (elbow plot gives no clear inflection). GMM-based clustering via mclust evaluates all 14 covariance model families — varying cluster volume, shape, and orientation constraints — selecting the best model by BIC. The BIC-selected model is VEV (ellipsoidal clusters with equal shape but variable volume and orientation), producing 5 clusters.

Stage 3

Post-Hoc Interpretation & Variable Analysis

The best-performing K-Means 2-cluster model is analyzed using a parallel coordinates plot across all 9 chemical variables. The plot reveals which variables drive cluster separation and which provide no discriminating power. Cluster profiles are compared against the known glass type structure — revealing that the 2-cluster solution cleanly separates headlamp glass (Type 7) from all other types based on its distinctive chemical signature: high barium, aluminum, and sodium versus low magnesium and iron. This finding is cross-validated against the original glass type parallel coordinates — confirming that the same variables (RI, Si, K, Ca) fail to separate glass types in both the unsupervised clusters and the supervised type structure.

Key Methodological Choices

Best-silhouette over elbow for K-Means cluster selection

The elbow plot of within-cluster sum of squares versus cluster count shows a gradual continuous decline with no sharp inflection — a common occurrence in real-world data with overlapping classes and high intra-class variability. In this case, the elbow method provides no clear guidance. The silhouette method — which measures how similar each point is to its own cluster versus the nearest alternative cluster — provides a principled, data-driven criterion that works even when the elbow is ambiguous. Silhouette width peaks at 2 clusters (0.45), confirming that the data's natural partition structure is binary rather than six-way, despite the dataset containing six known glass categories.

BIC for GMM model selection — evaluating all 14 covariance families

Gaussian Mixture Models offer 14 parameterizations of the covariance matrix — ranging from spherical identical clusters (EII) to fully variable ellipsoidal clusters (VVV) — each encoding different assumptions about the geometry of the cluster boundaries. Rather than selecting a single covariance model by assumption, mclust evaluates all 14 families across a range of cluster counts and selects the best combination by BIC, which balances goodness-of-fit against model complexity. The selected VEV model (ellipsoidal, equal shape) represents a principled compromise: clusters can vary in volume and orientation but share the same shape — appropriate for a dataset where glass types may differ in how tightly they cluster but share a common covariance structure.

Semi-supervised evaluation at 6 clusters — testing forced alignment with known categories

All three methods are evaluated not only at their unsupervised optimal cluster count but also in a semi-supervised configuration where 6 clusters are forced — one per known glass type. This two-stage evaluation is methodologically important: the unsupervised configuration reveals the natural structure the data supports, while the forced 6-cluster configuration tests whether the algorithms can recover the full glass type taxonomy when given the correct number of groups. The comparison shows that semi-supervised K-Means reduces misclassification from 0.67 to 0.57 at the cost of lower silhouette (0.32 vs 0.45) — confirming that the 6-type structure exists in the data but is harder to recover than the dominant 2-group separation.

Tech Stack

Technology Purpose
R Statistical analysis environment and primary implementation language
mclust (R package) Gaussian Mixture Model clustering — evaluates all 14 covariance families with BIC-based model selection
K-Means Non-hierarchical partitioning minimizing within-cluster variation; cluster count selected by silhouette
Hierarchical Clustering Agglomerative bottom-up clustering with Ward's method and complete linkage using Euclidean distance
Silhouette Width Primary cluster count selection criterion; measures intra-cluster cohesion versus inter-cluster separation
Adjusted Rand Index Post-hoc agreement measure between discovered clusters and known glass type labels

Results & Metrics

What the analysis reveals

0.45

Best Silhouette — K-Means

At 2 clusters — highest cluster compactness and separation across all methods and configurations

0.57

Best Misclassification — K-Means

Semi-supervised 6-cluster configuration — lowest misclassification rate across all methods

4

Poor Discriminating Variables

RI, Si, K, Ca — fail to separate glass types in both clusters and original type structure

🏆

K-Means outperforms all methods across silhouette, ARI, and misclassification

K-Means achieves the highest silhouette width (0.45 at 2 clusters), ties for highest ARI (0.19 at 2 clusters), and achieves the lowest misclassification rate (0.57 at 6 clusters). No other method wins on any individual metric — Ward's hierarchical achieves silhouette 0.402 and ARI 0.15; complete linkage achieves silhouette 0.407 but only ARI 0.03; GMM achieves ARI 0.15 and misclassification 0.66. The consistent multi-metric dominance of K-Means makes it the preferred method for this dataset despite the overall difficulty of the clustering problem.

💡

The 2-cluster solution isolates headlamp glass from all other types

Parallel coordinates analysis of the K-Means 2-cluster model reveals a clear and interpretable separation. Cluster 2 is characterized by high sodium, aluminum, and barium — and low magnesium and iron. Cluster 1 has the opposite profile: high magnesium and iron, low sodium, aluminum, and barium. Comparison with the original glass type labels shows that Cluster 2 corresponds to Type 7 (headlamps) — a glass category chemically distinct from building windows, vehicle windows, containers, and tableware due to its lead-glass composition and optical properties. The unsupervised algorithm recovers this chemically meaningful distinction without any label information.

🔬

RI, Si, K, and Ca show poor discriminating power — confirmed independently by clusters and known types

The parallel coordinates analysis identifies four variables that fail to separate either the K-Means clusters or the original glass type categories: refractive index (RI), silicon content (Si), potassium (K), and calcium (Ca). This finding is particularly robust because it arises independently from two different sources — the unsupervised clustering result and the supervised type structure — and agrees. Silicon and calcium are present in high concentrations across nearly all glass types; their values overlap so heavily across categories that they contribute negligible discriminating signal regardless of the clustering algorithm used.

⚖️

High intra-class variability limits all methods — further analysis required for full 6-type separation

All methods struggle to recover the full six-category glass type structure — ARI values top out at 0.19 and misclassification rates remain above 0.57 even in the best configurations. The primary cause is high chemical variability within each glass type: building windows manufactured by different processes share a type label but can vary substantially in elemental composition, making them hard to cluster together without supervision. The low ARIs do not reflect algorithmic failure but rather a genuinely difficult clustering problem where the class boundaries are fuzzy. Further analysis — potentially incorporating additional chemical markers or dimensionality reduction — would be needed to achieve meaningful six-way separation.