Applied ML · Computer Vision

DeepVision: Video Analysis

A deep learning solution for automated video content analysis — combining convolutional spatial feature extraction with LSTM temporal modeling to classify action sequences across the UCF-101 dataset.

Architecture ConvLSTM Architecture

Tech Stack

Python TensorFlow OpenCV CNN LSTM NumPy

Source Code View on GitHub

84%

Classification Accuracy on UCF-101

97%

AUC — Strong Discriminative Power

85%

Precision — High-Confidence Predictions

The Problem

Standard image classifiers cannot understand video — they see frames in isolation and miss everything that happens between them

Video is a fundamentally different data type from images. A single frame can tell you what is in a scene — but it cannot tell you what is happening. Action recognition, anomaly detection, and quality inspection all depend on understanding how a scene evolves across time: the trajectory of motion, the rhythm of repeated movement, the progression from one state to another. Standard convolutional networks process each frame independently, discarding all temporal context. Feeding their outputs into a separate LSTM partially addresses this — but the CNN and LSTM operate on different representations, and the spatial structure of feature maps is lost in the handoff. The core challenge is building a single model that captures spatial patterns and temporal dynamics jointly, without discarding the spatial structure that makes visual understanding possible.

The Solution

A ConvLSTM architecture that processes video frames as spatiotemporal sequences — preserving spatial structure across the temporal dimension in a single unified model

DeepVision applies a ConvLSTM-based classification model to the UCF-101 action recognition dataset. An OpenCV pipeline ingests raw video files, extracts frames at uniform intervals, resizes them to a consistent spatial resolution, and normalizes pixel values. Frames are then grouped into fixed-length temporal sequences and converted into (batch, frames, height, width, channels) tensors — the input format required for spatiotemporal modeling. The ConvLSTM architecture processes these sequences by applying convolutional operations at each timestep within the LSTM recurrence — extracting spatial features per frame while simultaneously propagating temporal context across the sequence. A dense classification head then maps the learned spatiotemporal representations to action class predictions. The complete workflow — from raw video through preprocessing, model training, and evaluation on unseen clips — is implemented in a reproducible Jupyter Notebook.

Key Outcome

A ConvLSTM-based video classification system achieving 84% accuracy and 97% AUC on UCF-101 action recognition — built on an end-to-end pipeline from raw video ingestion through OpenCV preprocessing, spatiotemporal sequence modeling, and evaluation on unseen clips, demonstrating the model's ability to capture both what appears in frames and how it evolves across time.

Technical Deep Dive

Architecture & Design

ConvLSTM Pipeline

Stage 1 — Video Ingestion & Frame Extraction

Input

UCF-101 Video Files

101 action categories · Raw .avi video clips per class

OpenCV

Frame Extraction

Uniform frame sampling at fixed intervals · Consistent temporal coverage per clip

▼

Stage 2 — Frame Preprocessing

Step 1

Resize

All frames resized to uniform spatial dimensions (H × W)

Step 2

Normalize

Pixel values scaled to [0, 1] for stable gradient flow during training

Step 3

Sequence Construction

Frames stacked into fixed-length temporal sequences · Shape: (batch, T, H, W, C)

▼

Stage 3 — ConvLSTM Model Architecture

CNN Block

Spatial Feature Extraction

Convolutional filters applied per timestep · Learns spatial patterns within each frame

LSTM Block

Temporal Sequence Modeling

Recurrent connections carry context across frames · Captures motion dynamics over time

Output

Dense Classification Head

Fully connected layer maps spatiotemporal features to action class probabilities

▼

Stage 4 — Training & Evaluation

Output

Accuracy 84% · Precision 85% · Recall 84% · AUC 97%

Categorical cross-entropy loss · All metrics evaluated on unseen video clips

Stage 1 & 2

Video Ingestion & Preprocessing

OpenCV reads raw UCF-101 video clips and extracts frames at uniform sampling intervals, ensuring consistent temporal coverage regardless of each clip's original frame rate or duration. Extracted frames are resized to a fixed spatial resolution and pixel values are normalized to the [0, 1] range, producing stable inputs for gradient-based optimization. Frames are then stacked into fixed-length temporal sequences and converted to (batch, T, H, W, C) tensors — the input format required for spatiotemporal modeling.

Stage 3

ConvLSTM Architecture

The ConvLSTM architecture applies convolutional operations at each recurrent timestep — extracting spatial feature maps from each frame while simultaneously propagating hidden state context across the temporal sequence. This allows the model to learn what is in each frame and how the scene evolves across frames within a single unified operation, avoiding the information loss that occurs when CNN outputs are flattened before being passed to a separate LSTM. A dense classification head receives the final hidden state and maps the learned spatiotemporal representation to action class probabilities.

Stage 4

Training & Evaluation

The model is trained using categorical cross-entropy loss with performance tracked across accuracy, precision, recall, and AUC throughout training. Evaluation is conducted on held-out video clips not seen during training, ensuring that reported metrics reflect genuine generalization to unseen sequences. The complete pipeline — from raw video ingestion through preprocessing, training, and evaluation — is implemented in a reproducible Jupyter Notebook, allowing step-by-step review, parameter experimentation, and direct application to new video datasets.

Key Design Decisions

ConvLSTM preserves spatial structure across the temporal dimension

A separate CNN-then-LSTM pipeline requires flattening CNN feature maps before feeding them into the recurrent layer — discarding the 2D spatial structure of what was learned in each frame. ConvLSTM avoids this by applying convolutional operations directly within the recurrent computation, so the hidden state at each timestep retains full spatial dimensionality. The model can track where motion is occurring in the frame across time, not just that something is changing.

Fixed-length uniform frame sampling ensures consistent input shape across variable-length clips

UCF-101 clips vary in duration. Processing every frame would produce variable-length sequences incompatible with batched tensor operations. Sampling a fixed number of frames at uniform intervals across each clip's full duration standardizes sequence length, enables batched training, and ensures that each sampled sequence covers the full temporal extent of the action — capturing the beginning, middle, and end of the activity rather than biasing toward a specific portion of the clip.

OpenCV gives frame-level control over extraction, resizing, and normalization

OpenCV provides direct access to video frame buffers, frame rate metadata, and per-frame image operations — making it well-suited for building a reproducible preprocessing pipeline that works consistently across videos of varying resolution, codec, and duration. Handling resize and normalization at the frame level before sequence construction ensures that the spatial standardization is applied uniformly to every input, regardless of the source video's native properties.

Tech Stack

Technology	Purpose
Python	Core programming language and notebook environment
TensorFlow	ConvLSTM model development, training loop, and evaluation
OpenCV	Video frame extraction, resizing, and normalization
CNN	Spatial feature extraction from individual video frames at each timestep
LSTM	Temporal sequence modeling — propagates context across the frame sequence
NumPy	Array operations, tensor construction, and data manipulation

Results & Metrics

What the system delivers

84%

Classification Accuracy

Evaluated on unseen UCF-101 video clips — strong generalization across 101 action categories

97%

AUC

Near-perfect class discrimination — model ranks predictions correctly across the full confidence range

85%

Precision

High-confidence predictions — when the model assigns a class, it is correct 85% of the time

🎯

Strong generalization across 101 action categories

84% accuracy on unseen UCF-101 clips demonstrates that the ConvLSTM model generalizes well across a diverse and challenging benchmark covering 101 distinct human action classes — ranging from sports and exercise to daily activities and instrument playing. This breadth of categories makes UCF-101 a meaningful test of spatiotemporal generalization rather than narrow category memorization.

📈

Near-perfect discriminative power at 97% AUC

A 97% AUC indicates that the model's confidence scores reliably separate action classes across the full prediction range — not just at the default classification threshold. This is particularly significant for multi-class video classification, where the model must simultaneously distinguish between 101 categories using learned spatiotemporal features rather than simple visual cues.

⚖️

Balanced precision and recall — confident without missing detections

85% precision and 84% recall are tightly paired — the model neither over-predicts classes (high false positive rate) nor under-predicts them (high false negative rate). This balance is difficult to achieve in multi-class video classification where some action categories share similar visual and temporal patterns, and indicates that the learned spatiotemporal features are sufficiently discriminative across the category space.

🔁

End-to-end reproducible pipeline from raw video to evaluated model

The complete workflow — UCF-101 video ingestion, OpenCV frame extraction, resizing, normalization, sequence construction, ConvLSTM training, and evaluation on unseen clips — is implemented in a single reproducible Jupyter Notebook. Every preprocessing and modeling decision is visible and executable step-by-step, supporting experimentation with different frame counts, sequence lengths, or model configurations without rebuilding the pipeline from scratch.

🚀

Architecture directly transferable to custom video classification tasks

The ConvLSTM architecture and OpenCV preprocessing pipeline are not specific to UCF-101. Replacing the input video folder and class labels is sufficient to retarget the system toward quality inspection footage, security camera feeds, or any other video classification domain — with the same spatiotemporal modeling capability that produced 84% accuracy and 97% AUC on UCF-101.

← Back to Applied ML

← Previous

Coronary Heart Disease Prediction

Healthcare · Ensemble Classification

Transformer Architecture

Computer Vision & NLP · Encoder-Decoder · Multi-Head Attention