Applied ML · Agriculture

AgriGrain: Grain Type Classification

A multiclass classification pipeline benchmarking five algorithm families — Bagging, Random Forests, Boosting, MDA, and ANNs — across 13,611 grain records and 16 geometric shape features to automate grain type identification.

Architecture Ensemble Learning · Neural Networks · Discriminant Analysis

Tech Stack

R scikit-learn Boosting

Source Code View on GitHub

13,611

Grain Records · 7 Classes · 16 Shape Features

ML Algorithm Families Benchmarked

91.1%

Best Test Accuracy (RF & ANN)

The Problem

Manual grain type identification is slow, inconsistent, and unscalable — agricultural quality control needs a data-driven classification system that can reliably distinguish between grain varieties from geometric image features alone

Grain type identification is a critical step in agricultural quality control and food production — different grain varieties have different nutritional profiles, market values, and processing requirements. Traditional inspection relies on manual visual assessment, which is time-intensive, subject to human error, and impossible to scale across the volumes of grain processed in modern agricultural operations. Computer vision systems can extract rich geometric and shape-based features from grain images — perimeter, roundness, compactness, axis lengths, shape factors — but converting those features into reliable multiclass predictions across seven grain varieties requires a classification model that can handle the structural overlap and within-class variability that makes this a genuinely difficult problem. No single algorithm is guaranteed to handle multiclass grain classification optimally, and the right choice depends on which families of decision boundaries best match the feature geometry of each grain type.

The Solution

A five-algorithm multiclass classification benchmark in R, with 5-fold cross-validated hyperparameter tuning across Bagging, Random Forests, Boosting, MDA, and ANNs — evaluated on misclassification rate and ARI

AgriGrain implements a comprehensive classification pipeline across five algorithm families using 16 geometric shape features extracted from 13,611 grain images across 7 varieties. A stratified random sample of 1,364 observations is used for computational feasibility, split 75/25 into training and testing sets. All hyperparameters — number of trees, variables per split, interaction depth, shrinkage, hidden layer size, and weight decay — are tuned using 5-fold cross-validation to ensure each algorithm is evaluated at its best configuration rather than a default. Bagging is tuned to 2,100 trees, Random Forests to 6 variables per split with 500 trees, Boosting to depth 3 with λ=0.05 and 200 trees, and ANNs to size 11 with decay 4. MDA fits a BIC-selected Gaussian mixture model per class. All five models are evaluated on misclassification rate and Adjusted Rand Index (ARI) across both training and testing sets, with variable importance analysis conducted for the tree-based methods.

Key Outcome

Random Forests and ANNs jointly achieve the best test accuracy at 91.1% (8.9% misclassification) — with all five algorithms performing within 1.5% of each other on the test set, demonstrating that shape-based geometric features provide a consistently strong signal for automated grain classification regardless of the modeling approach used.

Technical Deep Dive

Architecture & Design

Classification Pipeline

Stage 1 — Data Setup & Sampling

Dataset

13,611 Grain Records

7 grain classes · 16 geometric shape features · No missing data

Sampling

Stratified Sample — 1,364 Records

Class proportions preserved · 75% train / 25% test split · 5-fold CV for tuning

Features

16 Geometric Shape Predictors

Area, Perimeter, Axis Lengths, Roundness, Compactness, Shape Factors · High inter-feature correlations noted

▼

Stage 2 — Multi-Model Training & CV Tuning

Bagging

2,100 Trees

OOB error 8.87% · Bootstrap aggregation · adabag

Random Forests

500 Trees · 6 Vars/Split

OOB error 8.77% · Decorrelated trees · randomForest

Boosting

200 Trees · λ=0.05 · Depth 3

Sequential residual fitting · adabag

MDA

BIC-Selected Gaussian Mixture

Mixture model per class · Scaled inputs · mclust

ANN

Size 11 · Decay 4

Single hidden layer · Scaled inputs · Softmax output · nnet

▼

Stage 3 — Evaluation & Variable Importance

Metrics

Misclassification Rate & ARI

Reported on both training and testing sets · Standard deviation across models computed for consistency

Variable Importance

Mean Decrease Accuracy & Gini

Computed for Bagging, RF & Boosting · Heatmap visualization across methods

Result

RF & ANN Best on Test — 91.1% Accuracy · All Models Within 1.5%

Perimeter, Shape Factor 1, Compactness, Minor Axis Length, Major Axis Length identified as top predictors · Extent consistently least important

Stage 1

Data Setup & Stratified Sampling

The full dataset contains 13,611 grain records across 7 classes, with 16 continuous geometric shape features extracted from high-resolution grain images using computer vision and feature extraction techniques. Features include area, perimeter, major and minor axis lengths, aspect ratio, eccentricity, roundness, compactness, and four shape factors — capturing both size and morphological characteristics of each grain. Given the computational intensity of fitting five models to 13,611 observations, a stratified random sample of 1,364 records is drawn, preserving the class proportion of each grain variety. The sample is split 75/25 into training and testing sets, and 5-fold cross-validation is applied throughout for all hyperparameter tuning decisions.

Stage 2

Multi-Model Training & Hyperparameter Tuning

All five algorithms are trained with CV-optimized hyperparameters. Bagging (2,100 trees) and Random Forests (500 trees, 6 variables per split) use bootstrap aggregation to reduce variance, with Random Forests decorrelating trees by randomly subsampling features at each split. Boosting grows 200 shallow trees (depth 3) sequentially on residuals with shrinkage λ=0.05, building a strong learner incrementally. MDA fits a BIC-selected Gaussian mixture model within each class — allowing non-elliptical class boundaries. The ANN uses a single hidden layer of 11 nodes with weight decay 4 and a Softmax output layer for 7-class prediction, trained on scaled inputs to prevent feature dominance by magnitude.

Stage 3

Evaluation & Variable Importance Analysis

All models are evaluated on misclassification rate and ARI across both training and testing sets. The train/test gap is the key diagnostic for overfitting — Boosting shows the largest gap (0.2% training vs 9.5% test), indicating more variance than the ensemble or neural methods. Variable importance is analyzed for Bagging, Random Forests, and Boosting using mean decrease in accuracy and Gini index, and visualized as a cross-model heatmap. Perimeter, Shape Factor 1, Compactness, Minor Axis Length, and Major Axis Length emerge as the most consistently important predictors. Extent is the least important variable across all methods — confirmed independently by both the accuracy-based and Gini-based importance measures.

Key Design Decisions

Stratified sampling preserves class distribution at scale

Simple random sampling from a dataset with unequal class sizes risks underrepresenting minority classes — resulting in models that are optimized for the majority class and perform poorly on rarer grain varieties. Stratified sampling ensures that each of the 7 grain classes appears in the 1,364-record sample in the same proportions as the full 13,611-record dataset. This guarantees that the training and testing sets reflect the true distribution of the problem, making evaluation metrics meaningful and ensuring all classes are adequately represented during hyperparameter tuning.

5-fold cross-validation applied consistently across all five algorithms

Using the same cross-validation strategy across all five models ensures that hyperparameter selection is conducted on a level playing field — no algorithm benefits from more favorable tuning conditions than another. 5-fold CV balances the statistical reliability of the error estimate against the computational cost of running five complex models with multiple hyperparameter configurations. The consistent application of CV also means that the final test set evaluation is a genuine holdout — untouched during any tuning step — making the reported misclassification rates and ARI values trustworthy estimates of out-of-sample performance.

Five algorithm families cover the full spectrum of inductive biases

Bagging and Random Forests reduce variance through averaging and decorrelation. Boosting reduces bias through sequential residual fitting. MDA uses a probabilistic generative model that can capture non-elliptical class shapes. ANNs learn hierarchical feature combinations through nonlinear transformations. By benchmarking across all five families rather than selecting a single approach upfront, the pipeline produces a result that is robust to the question of which algorithm family best suits this feature geometry — and the close performance across all five models confirms that the geometric features are genuinely discriminative regardless of which decision boundary type is used.

Tech Stack

Technology	Purpose
R	Primary modeling environment for all five classification algorithms
caret	Unified training, cross-validation, and preprocessing interface across all models
randomForest	Random Forest implementation with OOB error estimation and variable importance
adabag	Bagging and Boosting implementations for tree ensemble methods
nnet	Single hidden layer feedforward neural network with Softmax output for multiclass prediction
mclust	Mixture Discriminant Analysis with BIC-based Gaussian mixture model selection per class

Results & Metrics

What the system delivers

91.1%

Best Test Accuracy

Random Forests and ANN tied — 8.9% misclassification across 7 grain classes

<1.5%

Performance Spread

All five algorithms within 1.5% of each other on test misclassification — std dev 0.6%

Top Shape Features

Perimeter, Shape Factor 1, Compactness, Minor & Major Axis Length — consistent across all methods

🌲

Random Forests and ANNs jointly achieve the best test performance

Both Random Forests (500 trees, 6 variables per split) and the ANN (size 11, decay 4) achieve 8.9% test misclassification and ARI of 0.785 and 0.779 respectively — effectively tied within measurement precision. Random Forests also achieves the best OOB error among tree-based methods at 8.77%, and shows the smallest gap between training and testing performance, indicating the most stable generalization. The ANN achieves 7.9% training misclassification — slightly more variance than RF, but comparable at test time.

⚡

Boosting achieves near-perfect training accuracy but shows the highest overfitting

Boosting achieves 0.2% training misclassification (ARI 0.994) — the best training performance of all five models by a wide margin. However, its test misclassification rises to 9.5% (ARI 0.768), showing the largest train-to-test gap. This reflects Boosting's known tendency to overfit when sequential residual fitting drives training error too close to zero — the 200-tree, depth-3, λ=0.05 configuration is already well-tuned, but the algorithm's inherent mechanism produces more variance than the averaging-based ensemble methods.

📊

All five algorithms perform within 1.5% — geometric features provide robust signal

The standard deviation of test misclassification across all five models is just 0.6%, and ARI standard deviation is 0.009 — indicating that the geometric shape features are sufficiently discriminative that the choice of algorithm matters far less than feature quality. This is a meaningful result for deployment: it suggests that even computationally simpler methods like Bagging (10.1% test misclassification) can serve as viable production classifiers when the alternative is manual inspection with far higher error rates.

🔬

Perimeter, compactness, and axis lengths are the most important shape features

Variable importance analysis across Bagging, Random Forests, and Boosting consistently identifies Perimeter, Shape Factor 1, Compactness, Minor Axis Length, and Major Axis Length as the top predictors of grain type. These features capture the boundary geometry and proportions of the grain — the characteristics that most reliably distinguish between varieties with different elongation, roundness, and edge profiles. Extent is consistently the least important variable across all three methods, suggesting that the ratio of grain pixels to bounding box area adds little discriminative power beyond what the other shape descriptors already capture.

🚀

The pipeline replaces manual inspection with a scalable, automated classification system

The full pipeline — from image-derived geometric features through five independently benchmarked classifiers to a final accuracy evaluation — delivers a production-ready foundation for automated grain quality control. At 91.1% accuracy across seven grain varieties, the system substantially outperforms the consistency and speed of manual visual inspection. The modular R implementation allows any of the five models to be swapped into production depending on the computational constraints of the deployment environment, with confidence that performance will remain within a narrow 1.5% band regardless of which model is selected.

← Back to Applied ML

← Previous

China's GDP Growth Modeling

Finance & Economics · Standard Cubic Splines · Smoothing Splines

Boston House Value Prediction

Real Estate · Ridge · LASSO · LSLR