Applied ML · Retail & Commerce

Product Purchase Prediction

A multi-output sequence modeling pipeline that forecasts per-product daily purchase demand using a 1D CNN+LSTM hybrid architecture trained on engineered lag and calendar features.

Architecture CNN+LSTM · Multi-Output Forecasting

Tech Stack

Python PyTorch pandas scikit-learn matplotlib

Source Code View on GitHub

Multi-Output

N-Product Demand in One Forward Pass

CNN+LSTM

1D Conv Block + LSTM Temporal Modeling

GPU-Ready

Seamless CPU / GPU Execution via .to(device)

The Problem

Retailers need product-level demand forecasts — but tabular models miss the temporal patterns and multi-product dynamics that drive daily purchasing behavior

Inventory planning, procurement scheduling, and promotional targeting all depend on knowing how much of each product will be purchased tomorrow — not just on average, but per SKU, per day. Classical regression models treat each day as independent, discarding the sequential structure that makes demand forecastable: what sold yesterday predicts what sells today. Multi-output approaches that predict all products simultaneously are rare, and most pipelines require a separate model per product — scaling poorly as catalog size grows. The gap is between the sequential, multi-product structure of retail demand data and the modeling capacity of approaches that treat each product and each day in isolation.

The Solution

A CNN+LSTM pipeline that engineers lag and calendar features, windows them into sequences, and forecasts all products simultaneously in a single forward pass

The pipeline begins with temporal feature engineering: lag features are computed per product to capture recent purchase history, and calendar features — month, day-of-week, and a weekend flag — are added to encode seasonal and weekly patterns. The resulting tabular data is then windowed into fixed-length sequences (seq_len=3), converting each row into a (batch, seq_len, features) tensor that the model can process as a time series. A 1D CNN block applies convolutions across the feature dimension to extract local patterns within each window, feeding its output into an LSTM layer that models temporal dependencies across steps. A final dense output layer maps the LSTM hidden state to predictions for all N products simultaneously — enabling a single forward pass to produce a full demand forecast across the entire catalog. The model is trained with MSELoss and Adam optimizer, with GPU execution supported via PyTorch's .to(device) — making it scalable from laptop to cloud without code changes.

Key Outcome

A GPU-ready multi-output demand forecasting pipeline that combines lag and calendar feature engineering with sequence windowing and a CNN+LSTM architecture — predicting next-day purchase volumes for N products simultaneously in a single forward pass, with per-product MAE evaluation and clean train/test loss convergence.

Technical Deep Dive

Architecture & Design

Modeling Pipeline

Stage 1 — Temporal Feature Engineering

Lag Features

Per-Product Purchase History

Lagged demand values computed per product · Encodes recent purchase momentum for each SKU

Calendar Features

Month · Day-of-Week · Weekend Flag

Seasonal and weekly periodicity encoded per row · Added uniformly across all products

▼

Stage 2 — Sequence Windowing

Tensor Construction · seq_len = 3

Tabular → (batch, seq_len, features)

Rolling windows of length 3 slide across the time axis · Each window becomes one training sample · MinMaxScaler normalizes features before windowing · DataLoader batches tensors for GPU-ready training

▼

Stage 3 — CNN+LSTM Architecture

Block 1

1D Conv Block

Convolutions across feature channels · Extracts local within-window patterns

Block 2

LSTM Layer

Temporal dependencies across time steps · Hidden state carries sequence context forward

Block 3

Dense Output Layer

Maps LSTM hidden state → N-product demand predictions in one forward pass

▼

Stage 4 — Training & Evaluation

Training

MSELoss + Adam Optimizer

GPU-ready via .to(device) · Clean train/test loss convergence across epochs

Evaluation

Per-Product MAE & Overall MSE

scikit-learn mean_absolute_error per SKU · Loss curves visualized with matplotlib

Stage 1

Temporal Feature Engineering

Lag features are computed per product to capture each SKU's recent purchase history — giving the model explicit signal about demand momentum without requiring it to learn that structure from raw counts alone. Calendar features — month, day-of-week, and a weekend flag — are then added to every row, encoding the seasonal and weekly periodicity that drives retail demand. Together, the lag and calendar features form a feature matrix that is rich in both product-specific history and temporal context.

Stage 2

Sequence Windowing

The engineered feature matrix is normalized with MinMaxScaler, then converted into fixed-length sequences using a rolling window of length 3. Each window becomes one training sample — a (seq_len, features) tensor — and the target is the demand vector for the day immediately following that window. PyTorch DataLoader batches these into (batch, seq_len, features) tensors and handles shuffling, allowing the model to be trained on CPU or GPU without pipeline changes.

Stage 3

CNN+LSTM Architecture

The model stacks three blocks. A 1D convolutional block applies filters across the feature dimension within each time step, learning local co-purchase patterns and feature interactions. Its output is passed to an LSTM layer, which processes the sequence across the time axis and captures how patterns evolve from step to step. The final LSTM hidden state is projected through a dense output layer to produce simultaneous predictions for all N products — completing the forward pass in a single operation.

Stage 4

Training & Evaluation

The model is trained with MSELoss and Adam optimizer, with full GPU support via PyTorch's .to(device) for seamless CPU/GPU execution. Training and test loss are tracked across epochs and visualized as convergence curves — confirming that the model generalizes rather than overfits the sequence structure. Evaluation reports per-product MAE using scikit-learn's mean_absolute_error, giving granular insight into which SKUs are predicted well and which may require additional features or longer sequence windows.

Key Design Decisions

CNN before LSTM separates local feature interaction from temporal dynamics

Applying 1D convolutions before the LSTM is a deliberate architectural choice rather than a default. The convolutional block operates across the feature channels at each time step — learning which combinations of lag values and calendar signals co-occur with demand spikes. The LSTM then operates on these learned representations rather than on raw features, giving it a richer input at each step. This decomposition makes each component's job simpler: the CNN handles what features matter together; the LSTM handles how those patterns evolve over time.

A single multi-output head replaces N independent models

The dense output layer maps the LSTM hidden state directly to predictions for all N products in one operation, rather than training a separate model per SKU. This matters for two reasons. First, it scales — adding new products requires only retraining, not restructuring the pipeline. Second, it allows the model to implicitly learn cross-product patterns: if products A and B tend to spike together, the shared LSTM representation can capture that relationship in a way that per-product models cannot. The per-product MAE evaluation then surfaces whether any individual SKU is underserved by the shared representation.

Calendar features encode periodicity that lag features alone cannot capture

Lag features encode recent demand momentum — but they cannot directly represent the day-of-week or monthly cycles that are independent of recent history. A product with flat sales history can still experience a weekend spike; a lag-only model would have no signal for it. Adding month, day-of-week, and a binary weekend flag gives the model explicit access to the cyclical structure of retail demand without requiring it to infer periodicity from lag values alone — a pattern that is particularly difficult to learn when training sequences are short.

Tech Stack

Technology	Purpose
PyTorch	CNN+LSTM model definition, training loop, and DataLoader pipelines
pandas / NumPy	Data loading, lag feature construction, and calendar feature engineering
scikit-learn	MinMaxScaler for normalization, train_test_split, and per-product MAE evaluation
matplotlib	Training vs. test loss convergence curves and performance visualizations

Results & Metrics

What the system delivers

Multi-Output

N-Product Demand Per Pass

All SKUs forecasted simultaneously — no per-product models required

CNN+LSTM

Hybrid Sequence Architecture

1D conv for local feature interactions · LSTM for cross-step temporal dynamics

seq_len=3

Rolling Temporal Window

3-step windows convert tabular data into (batch, seq_len, features) tensors for the model

📦

Single forward pass forecasts demand across the entire product catalog

The dense output layer maps the LSTM hidden state to N simultaneous product demand predictions — eliminating the need to train, maintain, and serve a separate model per SKU. This makes the pipeline scalable as catalog size grows: new products extend the output dimension without restructuring the architecture or rewriting inference logic.

📅

Lag and calendar features together cover both momentum and periodicity

Per-product lag features encode recent purchase momentum — how much was sold in preceding days. Calendar features — month, day-of-week, and weekend flag — encode the cyclical patterns that repeat regardless of recent history. The combination means the model has explicit signal for both a product's recent trajectory and the structural demand patterns of the day it is forecasting, without needing to infer either from raw counts alone.

🔁

Sequence windowing preserves the temporal structure that tabular models discard

Rolling windows of length 3 convert the flat feature matrix into (batch, seq_len, features) tensors, ensuring that the model sees each day in the context of the days that preceded it — not as an isolated observation. This sequential structure is what enables the LSTM to model how demand patterns evolve across consecutive steps, which a standard regression model operating on rows independently cannot do.

📊

Per-product MAE surfaces SKU-level forecast quality across the catalog

Reporting a single aggregate error metric hides performance variation across products. Per-product MAE using scikit-learn's mean_absolute_error gives a granular view of which SKUs the model forecasts accurately and which underperform — enabling targeted investigation of products that may benefit from longer sequence windows, additional features, or product-specific fine-tuning within the shared architecture.

📉

Clean train/test loss convergence confirms the model generalizes to unseen sequences

Training and test loss curves tracked across epochs show parallel convergence without divergence — confirming that the CNN+LSTM is learning generalizable demand patterns rather than memorizing the training sequences. The matplotlib loss curve artifact provides a visual check on training stability, making it straightforward to diagnose overfitting or underfitting and adjust sequence length, network depth, or learning rate accordingly.

← Back to Applied ML

← Previous

Transformer Architecture

Computer Vision & NLP · Encoder-Decoder · Multi-Head Attention

Market Basket Analysis

Retail & Commerce · Apriori · Unsupervised Association Mining