Product Purchase Prediction
A multi-output sequence modeling pipeline that forecasts per-product daily purchase demand using a 1D CNN+LSTM hybrid architecture trained on engineered lag and calendar features.
Multi-Output
N-Product Demand in One Forward Pass
CNN+LSTM
1D Conv Block + LSTM Temporal Modeling
GPU-Ready
Seamless CPU / GPU Execution via .to(device)
The Problem
Retailers need product-level demand forecasts — but tabular models miss the temporal patterns and multi-product dynamics that drive daily purchasing behavior
Inventory planning, procurement scheduling, and promotional targeting all depend on knowing how much of each product will be purchased tomorrow — not just on average, but per SKU, per day. Classical regression models treat each day as independent, discarding the sequential structure that makes demand forecastable: what sold yesterday predicts what sells today. Multi-output approaches that predict all products simultaneously are rare, and most pipelines require a separate model per product — scaling poorly as catalog size grows. The gap is between the sequential, multi-product structure of retail demand data and the modeling capacity of approaches that treat each product and each day in isolation.
The Solution
A CNN+LSTM pipeline that engineers lag and calendar features, windows them into sequences, and forecasts all products simultaneously in a single forward pass
The pipeline begins with temporal feature engineering: lag features are computed per product to capture recent purchase history, and calendar features — month, day-of-week, and a weekend flag — are added to encode seasonal and weekly patterns. The resulting tabular data is then windowed into fixed-length sequences (seq_len=3), converting each row into a (batch, seq_len, features) tensor that the model can process as a time series. A 1D CNN block applies convolutions across the feature dimension to extract local patterns within each window, feeding its output into an LSTM layer that models temporal dependencies across steps. A final dense output layer maps the LSTM hidden state to predictions for all N products simultaneously — enabling a single forward pass to produce a full demand forecast across the entire catalog. The model is trained with MSELoss and Adam optimizer, with GPU execution supported via PyTorch's .to(device) — making it scalable from laptop to cloud without code changes.
Key Outcome
A GPU-ready multi-output demand forecasting pipeline that combines lag and calendar feature engineering with sequence windowing and a CNN+LSTM architecture — predicting next-day purchase volumes for N products simultaneously in a single forward pass, with per-product MAE evaluation and clean train/test loss convergence.
Technical Deep Dive
Architecture & Design
Modeling Pipeline
Stage 1 — Temporal Feature Engineering
Lag Features
Per-Product Purchase History
Lagged demand values computed per product · Encodes recent purchase momentum for each SKU
Calendar Features
Month · Day-of-Week · Weekend Flag
Seasonal and weekly periodicity encoded per row · Added uniformly across all products
Stage 2 — Sequence Windowing
Tensor Construction · seq_len = 3
Tabular → (batch, seq_len, features)
Rolling windows of length 3 slide across the time axis · Each window becomes one training sample · MinMaxScaler normalizes features before windowing · DataLoader batches tensors for GPU-ready training
Stage 3 — CNN+LSTM Architecture
Block 1
1D Conv Block
Convolutions across feature channels · Extracts local within-window patterns
Block 2
LSTM Layer
Temporal dependencies across time steps · Hidden state carries sequence context forward
Block 3
Dense Output Layer
Maps LSTM hidden state → N-product demand predictions in one forward pass
Stage 4 — Training & Evaluation
Training
MSELoss + Adam Optimizer
GPU-ready via .to(device) · Clean train/test loss convergence across epochs
Evaluation
Per-Product MAE & Overall MSE
scikit-learn mean_absolute_error per SKU · Loss curves visualized with matplotlib
Stage 1
Temporal Feature Engineering
Lag features are computed per product to capture each SKU's recent purchase history — giving the model explicit signal about demand momentum without requiring it to learn that structure from raw counts alone. Calendar features — month, day-of-week, and a weekend flag — are then added to every row, encoding the seasonal and weekly periodicity that drives retail demand. Together, the lag and calendar features form a feature matrix that is rich in both product-specific history and temporal context.
Stage 2
Sequence Windowing
The engineered feature matrix is normalized with MinMaxScaler, then converted into fixed-length sequences using a rolling window of length 3. Each window becomes one training sample — a (seq_len, features) tensor — and the target is the demand vector for the day immediately following that window. PyTorch DataLoader batches these into (batch, seq_len, features) tensors and handles shuffling, allowing the model to be trained on CPU or GPU without pipeline changes.
Stage 3
CNN+LSTM Architecture
The model stacks three blocks. A 1D convolutional block applies filters across the feature dimension within each time step, learning local co-purchase patterns and feature interactions. Its output is passed to an LSTM layer, which processes the sequence across the time axis and captures how patterns evolve from step to step. The final LSTM hidden state is projected through a dense output layer to produce simultaneous predictions for all N products — completing the forward pass in a single operation.
Stage 4
Training & Evaluation
The model is trained with MSELoss and Adam optimizer, with full GPU support via PyTorch's .to(device) for seamless CPU/GPU execution. Training and test loss are tracked across epochs and visualized as convergence curves — confirming that the model generalizes rather than overfits the sequence structure. Evaluation reports per-product MAE using scikit-learn's mean_absolute_error, giving granular insight into which SKUs are predicted well and which may require additional features or longer sequence windows.
Key Design Decisions
CNN before LSTM separates local feature interaction from temporal dynamics
Applying 1D convolutions before the LSTM is a deliberate architectural choice rather than a default. The convolutional block operates across the feature channels at each time step — learning which combinations of lag values and calendar signals co-occur with demand spikes. The LSTM then operates on these learned representations rather than on raw features, giving it a richer input at each step. This decomposition makes each component's job simpler: the CNN handles what features matter together; the LSTM handles how those patterns evolve over time.
A single multi-output head replaces N independent models
The dense output layer maps the LSTM hidden state directly to predictions for all N products in one operation, rather than training a separate model per SKU. This matters for two reasons. First, it scales — adding new products requires only retraining, not restructuring the pipeline. Second, it allows the model to implicitly learn cross-product patterns: if products A and B tend to spike together, the shared LSTM representation can capture that relationship in a way that per-product models cannot. The per-product MAE evaluation then surfaces whether any individual SKU is underserved by the shared representation.
Calendar features encode periodicity that lag features alone cannot capture
Lag features encode recent demand momentum — but they cannot directly represent the day-of-week or monthly cycles that are independent of recent history. A product with flat sales history can still experience a weekend spike; a lag-only model would have no signal for it. Adding month, day-of-week, and a binary weekend flag gives the model explicit access to the cyclical structure of retail demand without requiring it to infer periodicity from lag values alone — a pattern that is particularly difficult to learn when training sequences are short.
Tech Stack
| Technology | Purpose |
|---|---|
| PyTorch | CNN+LSTM model definition, training loop, and DataLoader pipelines |
| pandas / NumPy | Data loading, lag feature construction, and calendar feature engineering |
| scikit-learn | MinMaxScaler for normalization, train_test_split, and per-product MAE evaluation |
| matplotlib | Training vs. test loss convergence curves and performance visualizations |
Results & Metrics
What the system delivers
Multi-Output
N-Product Demand Per Pass
All SKUs forecasted simultaneously — no per-product models required
CNN+LSTM
Hybrid Sequence Architecture
1D conv for local feature interactions · LSTM for cross-step temporal dynamics
seq_len=3
Rolling Temporal Window
3-step windows convert tabular data into (batch, seq_len, features) tensors for the model
Single forward pass forecasts demand across the entire product catalog
The dense output layer maps the LSTM hidden state to N simultaneous product demand predictions — eliminating the need to train, maintain, and serve a separate model per SKU. This makes the pipeline scalable as catalog size grows: new products extend the output dimension without restructuring the architecture or rewriting inference logic.
Lag and calendar features together cover both momentum and periodicity
Per-product lag features encode recent purchase momentum — how much was sold in preceding days. Calendar features — month, day-of-week, and weekend flag — encode the cyclical patterns that repeat regardless of recent history. The combination means the model has explicit signal for both a product's recent trajectory and the structural demand patterns of the day it is forecasting, without needing to infer either from raw counts alone.
Sequence windowing preserves the temporal structure that tabular models discard
Rolling windows of length 3 convert the flat feature matrix into (batch, seq_len, features) tensors, ensuring that the model sees each day in the context of the days that preceded it — not as an isolated observation. This sequential structure is what enables the LSTM to model how demand patterns evolve across consecutive steps, which a standard regression model operating on rows independently cannot do.
Per-product MAE surfaces SKU-level forecast quality across the catalog
Reporting a single aggregate error metric hides performance variation across products. Per-product MAE using scikit-learn's mean_absolute_error gives a granular view of which SKUs the model forecasts accurately and which underperform — enabling targeted investigation of products that may benefit from longer sequence windows, additional features, or product-specific fine-tuning within the shared architecture.
Clean train/test loss convergence confirms the model generalizes to unseen sequences
Training and test loss curves tracked across epochs show parallel convergence without divergence — confirming that the CNN+LSTM is learning generalizable demand patterns rather than memorizing the training sequences. The matplotlib loss curve artifact provides a visual check on training stability, making it straightforward to diagnose overfitting or underfitting and adjust sequence length, network depth, or learning rate accordingly.