Applied ML · NLP

Transformer Architecture

A clean, modular implementation of the full Transformer architecture built from scratch in PyTorch — covering positional encoding, multi-head self-attention, feedforward layers, and complete encoder and decoder stacks.

Architecture Encoder-Decoder · Multi-Head Attention

Tech Stack

Python PyTorch Lightning

Source Code View on GitHub

5 Blocks

Positional Encoding, Attention, Feedforward, Encoder, Decoder

Pure PyTorch

No HuggingFace — Every Component Built from Scratch

Seq-to-Seq

Extensible to Translation, Summarization, Generation

The Problem

Transformers are the foundation of modern NLP — but high-level libraries abstract away the mechanics that make them work

The Transformer architecture introduced in Vaswani et al. (2017) fundamentally reshaped sequence modeling — replacing recurrence with self-attention and enabling parallelizable, context-aware representations across entire sequences. Today, virtually every state-of-the-art NLP system is built on Transformer foundations. Yet frameworks like HuggingFace, while powerful for application development, wrap the architecture in layers of abstraction that conceal the core mechanics: how queries, keys, and values interact in scaled dot-product attention; how positional information is injected into token embeddings; how encoder context flows into the decoder through cross-attention. Practitioners who rely solely on high-level libraries can use Transformers without understanding them — and that gap becomes a liability when debugging, adapting architectures, or reasoning about model behavior.

The Solution

A modular, from-scratch PyTorch implementation that builds the full Transformer component by component — making every design decision explicit and inspectable

This project implements the complete Transformer encoder-decoder architecture in PyTorch, following the original Vaswani et al. (2017) paper, without any dependency on HuggingFace or pre-built Transformer classes. Each subcomponent is implemented as a standalone, inspectable module: positional encoding injects sinusoidal position information into token embeddings; multi-head self-attention computes scaled dot-product attention across multiple representation subspaces in parallel; position-wise feedforward layers apply two linear transformations with a non-linearity between them; encoder blocks stack self-attention and feedforward layers with residual connections and LayerNorm; decoder blocks add a cross-attention layer that attends to encoder output, allowing the decoder to condition generation on the full source sequence. The complete encoder-decoder Transformer is assembled from these blocks and trained with the Adam optimizer and LayerNorm stabilization. An optional PyTorch Lightning module is provided for structured training orchestration. The architecture is fully configurable — number of layers, attention heads, and embedding dimensions are all adjustable parameters.

Key Outcome

A complete, from-scratch PyTorch implementation of the Transformer encoder-decoder architecture — five independently inspectable components assembled into a fully configurable seq-to-seq model that trains stably with Adam and LayerNorm, and can be extended to translation, summarization, or generation with minimal modifications.

Technical Deep Dive

Architecture & Design

Transformer Pipeline

Stage 1 — Input Embedding & Positional Encoding

Token Embedding

Token → Dense Vector

Input tokens mapped to d_model-dimensional embedding space

Positional Encoding

Position → Sinusoidal Signal

Sinusoidal encodings added to token embeddings · Injects sequence order without recurrence

▼

Stage 2 — Encoder Stack

Multi-Head Self-Attention

Scaled Dot-Product Attention

Q, K, V projections across h parallel heads · Captures diverse token relationships simultaneously

Feedforward + Residual

Position-wise FFN & LayerNorm

Two linear layers with ReLU · Residual connections + LayerNorm for stable gradient flow

▼

Stage 3 — Decoder Stack

Masked Self-Attention

Causal Masking

Prevents attending to future tokens during autoregressive generation

Cross-Attention

Encoder-Decoder Attention

Decoder queries attend to encoder K, V · Conditions generation on full source context

Feedforward + Residual

Position-wise FFN & LayerNorm

Same structure as encoder FFN · Residual + LayerNorm after each sublayer

▼

Stage 4 — Output Projection & Training

Linear + Softmax

Token Probability Distribution

Projects decoder output to vocabulary size · Softmax produces next-token probabilities

Training

Adam · PyTorch Lightning (optional)

Adam optimizer · LayerNorm for stable gradients · Lightning module for structured orchestration

Stage 1

Input Embedding & Positional Encoding

Input tokens are first converted to dense vectors in a d_model-dimensional embedding space. Because the Transformer has no recurrence, it has no inherent sense of token order — positional encoding addresses this by adding sinusoidal signals to the token embeddings. The sinusoidal formulation allows the model to generalize to sequence lengths not seen during training and enables the model to infer relative positions through simple linear operations on the encoding values.

Stage 2

Encoder Stack

Each encoder block applies multi-head self-attention — computing scaled dot-product attention across h parallel subspaces simultaneously — followed by a position-wise feedforward network. Residual connections wrap both sublayers, and LayerNorm is applied after each, ensuring stable gradient flow through deep stacks. The number of encoder layers is a configurable parameter, enabling experimentation with depth without changing any other component.

Stage 3

Decoder Stack

Each decoder block contains three sublayers. Masked self-attention applies causal masking to prevent the decoder from attending to future positions during autoregressive generation. Cross-attention takes decoder queries and attends to encoder keys and values — conditioning each generated token on the full source sequence. A position-wise feedforward network follows, with residual connections and LayerNorm after each sublayer, matching the encoder structure.

Stage 4

Output Projection & Training

The decoder's final hidden states are projected to vocabulary size via a linear layer, and a softmax produces a probability distribution over the next token. Training uses the Adam optimizer with LayerNorm throughout the network providing gradient stabilization. An optional PyTorch Lightning module wraps the training loop for structured orchestration — separating model definition from training logic while keeping both inspectable and modifiable.

Key Design Decisions

Each subcomponent is a standalone module — the architecture is transparent by design

Positional encoding, multi-head attention, feedforward layers, encoder blocks, and decoder blocks are each implemented as separate, independently testable classes. This modularity means you can inspect, modify, or replace any single component without touching the rest of the architecture. It also makes the implementation readable as a direct translation of the Vaswani et al. paper — each class corresponds to a named section of the original architecture diagram.

No HuggingFace dependency — every attention computation is explicit

Using HuggingFace or torch.nn.Transformer hides the Q, K, V projection logic, the scaling factor, the attention masking, and the multi-head concatenation behind a single API call. This implementation exposes all of it: the projection matrices are explicit parameters, the scaled dot-product attention formula is written out, and the masking logic for both encoder self-attention and decoder causal masking is visible. This makes the implementation genuinely educational — and makes adapting the attention mechanism (e.g., for sparse attention or custom masking schemes) straightforward.

Configurable depth and width — one architecture, many experiments

The number of encoder and decoder layers, the number of attention heads, the embedding dimension d_model, and the feedforward layer width d_ff are all constructor parameters. Changing from a 2-layer 4-head model to a 6-layer 8-head model requires changing four integers at instantiation — nothing else in the code changes. This enables systematic experimentation with model capacity without architectural surgery, and makes the implementation a practical base for adapting the Transformer to new tasks or datasets.

Tech Stack

Technology	Purpose
PyTorch	Core implementation — attention, encoder/decoder blocks, training loop, DataLoader compatibility
PyTorch Lightning	Optional training orchestration module — separates model definition from training logic

Results & Metrics

What the implementation delivers

5 Blocks

Independently Inspectable Modules

Positional encoding, multi-head attention, feedforward layers, encoder, and decoder — each a standalone class

Stable

Training Convergence

Adam optimizer with LayerNorm throughout — consistent loss reduction across training runs

Seq-to-Seq

Ready for Real NLP Tasks

Translation, summarization, or generation — pluggable with minimal modifications to the data pipeline

🔍

Multi-head attention successfully captures token-level dependencies

The multi-head self-attention implementation correctly computes scaled dot-product attention across parallel subspaces, allowing the model to simultaneously attend to different aspects of the input sequence from different representation perspectives. Attention weights are visualizable per head, making it possible to inspect what relationships each head has learned to prioritize.

🔗

Encoder-decoder stacking enables contextual sequence generation

Cross-attention in the decoder correctly routes encoder keys and values to each decoder layer, so every generated token is conditioned on the full source sequence representation. This is the mechanism that makes translation and summarization possible — the decoder does not generate from a fixed bottleneck vector but attends directly to the most relevant parts of the encoded source at each generation step.

📉

Stable training with Adam and LayerNorm throughout

LayerNorm after every sublayer in both encoder and decoder stacks normalizes activations within each training step, decoupling gradient magnitudes from the depth of the network. Combined with residual connections, this produces stable loss curves across training runs without requiring learning rate warm-up schedules or gradient clipping for the toy datasets used in demonstration.

🧩

Fully configurable — depth, width, and heads are constructor parameters

Number of layers, attention heads, embedding dimension, and feedforward width are all set at instantiation. The same codebase supports a lightweight 2-layer 4-head model for fast experimentation and a deeper 6-layer 8-head model for more expressive capacity — with no changes to any component class. This makes the implementation directly usable as a starting point for custom Transformer variants.

🚀

Minimal modifications needed to apply to real NLP tasks

Swapping the toy random-sequence dataset for a real parallel corpus (e.g., English-Spanish sentence pairs) requires replacing only the DataLoader and tokenizer — the model architecture, training loop, and attention mechanism remain unchanged. This makes the implementation a genuine starting point for translation, abstractive summarization, or any sequence-to-sequence task, not just a demonstration of the architecture in isolation.

← Back to Applied ML

← Previous

DeepVision: Video Analysis

Computer Vision & NLP · ConvLSTM Architecture

Product Purchase Prediction

Retail & Commerce · CNN + LSTM · Multi-Output Forecasting