Transformer Architecture
A clean, modular implementation of the full Transformer architecture built from scratch in PyTorch — covering positional encoding, multi-head self-attention, feedforward layers, and complete encoder and decoder stacks.
5 Blocks
Positional Encoding, Attention, Feedforward, Encoder, Decoder
Pure PyTorch
No HuggingFace — Every Component Built from Scratch
Seq-to-Seq
Extensible to Translation, Summarization, Generation
The Problem
Transformers are the foundation of modern NLP — but high-level libraries abstract away the mechanics that make them work
The Transformer architecture introduced in Vaswani et al. (2017) fundamentally reshaped sequence modeling — replacing recurrence with self-attention and enabling parallelizable, context-aware representations across entire sequences. Today, virtually every state-of-the-art NLP system is built on Transformer foundations. Yet frameworks like HuggingFace, while powerful for application development, wrap the architecture in layers of abstraction that conceal the core mechanics: how queries, keys, and values interact in scaled dot-product attention; how positional information is injected into token embeddings; how encoder context flows into the decoder through cross-attention. Practitioners who rely solely on high-level libraries can use Transformers without understanding them — and that gap becomes a liability when debugging, adapting architectures, or reasoning about model behavior.
The Solution
A modular, from-scratch PyTorch implementation that builds the full Transformer component by component — making every design decision explicit and inspectable
This project implements the complete Transformer encoder-decoder architecture in PyTorch, following the original Vaswani et al. (2017) paper, without any dependency on HuggingFace or pre-built Transformer classes. Each subcomponent is implemented as a standalone, inspectable module: positional encoding injects sinusoidal position information into token embeddings; multi-head self-attention computes scaled dot-product attention across multiple representation subspaces in parallel; position-wise feedforward layers apply two linear transformations with a non-linearity between them; encoder blocks stack self-attention and feedforward layers with residual connections and LayerNorm; decoder blocks add a cross-attention layer that attends to encoder output, allowing the decoder to condition generation on the full source sequence. The complete encoder-decoder Transformer is assembled from these blocks and trained with the Adam optimizer and LayerNorm stabilization. An optional PyTorch Lightning module is provided for structured training orchestration. The architecture is fully configurable — number of layers, attention heads, and embedding dimensions are all adjustable parameters.
Key Outcome
A complete, from-scratch PyTorch implementation of the Transformer encoder-decoder architecture — five independently inspectable components assembled into a fully configurable seq-to-seq model that trains stably with Adam and LayerNorm, and can be extended to translation, summarization, or generation with minimal modifications.
Technical Deep Dive
Architecture & Design
Transformer Pipeline
Stage 1 — Input Embedding & Positional Encoding
Token Embedding
Token → Dense Vector
Input tokens mapped to d_model-dimensional embedding space
Positional Encoding
Position → Sinusoidal Signal
Sinusoidal encodings added to token embeddings · Injects sequence order without recurrence
Stage 2 — Encoder Stack
Multi-Head Self-Attention
Scaled Dot-Product Attention
Q, K, V projections across h parallel heads · Captures diverse token relationships simultaneously
Feedforward + Residual
Position-wise FFN & LayerNorm
Two linear layers with ReLU · Residual connections + LayerNorm for stable gradient flow
Stage 3 — Decoder Stack
Masked Self-Attention
Causal Masking
Prevents attending to future tokens during autoregressive generation
Cross-Attention
Encoder-Decoder Attention
Decoder queries attend to encoder K, V · Conditions generation on full source context
Feedforward + Residual
Position-wise FFN & LayerNorm
Same structure as encoder FFN · Residual + LayerNorm after each sublayer
Stage 4 — Output Projection & Training
Linear + Softmax
Token Probability Distribution
Projects decoder output to vocabulary size · Softmax produces next-token probabilities
Training
Adam · PyTorch Lightning (optional)
Adam optimizer · LayerNorm for stable gradients · Lightning module for structured orchestration
Stage 1
Input Embedding & Positional Encoding
Input tokens are first converted to dense vectors in a d_model-dimensional embedding space. Because the Transformer has no recurrence, it has no inherent sense of token order — positional encoding addresses this by adding sinusoidal signals to the token embeddings. The sinusoidal formulation allows the model to generalize to sequence lengths not seen during training and enables the model to infer relative positions through simple linear operations on the encoding values.
Stage 2
Encoder Stack
Each encoder block applies multi-head self-attention — computing scaled dot-product attention across h parallel subspaces simultaneously — followed by a position-wise feedforward network. Residual connections wrap both sublayers, and LayerNorm is applied after each, ensuring stable gradient flow through deep stacks. The number of encoder layers is a configurable parameter, enabling experimentation with depth without changing any other component.
Stage 3
Decoder Stack
Each decoder block contains three sublayers. Masked self-attention applies causal masking to prevent the decoder from attending to future positions during autoregressive generation. Cross-attention takes decoder queries and attends to encoder keys and values — conditioning each generated token on the full source sequence. A position-wise feedforward network follows, with residual connections and LayerNorm after each sublayer, matching the encoder structure.
Stage 4
Output Projection & Training
The decoder's final hidden states are projected to vocabulary size via a linear layer, and a softmax produces a probability distribution over the next token. Training uses the Adam optimizer with LayerNorm throughout the network providing gradient stabilization. An optional PyTorch Lightning module wraps the training loop for structured orchestration — separating model definition from training logic while keeping both inspectable and modifiable.
Key Design Decisions
Each subcomponent is a standalone module — the architecture is transparent by design
Positional encoding, multi-head attention, feedforward layers, encoder blocks, and decoder blocks are each implemented as separate, independently testable classes. This modularity means you can inspect, modify, or replace any single component without touching the rest of the architecture. It also makes the implementation readable as a direct translation of the Vaswani et al. paper — each class corresponds to a named section of the original architecture diagram.
No HuggingFace dependency — every attention computation is explicit
Using HuggingFace or torch.nn.Transformer hides the Q, K, V projection logic, the scaling factor, the attention masking, and the multi-head concatenation behind a single API call. This implementation exposes all of it: the projection matrices are explicit parameters, the scaled dot-product attention formula is written out, and the masking logic for both encoder self-attention and decoder causal masking is visible. This makes the implementation genuinely educational — and makes adapting the attention mechanism (e.g., for sparse attention or custom masking schemes) straightforward.
Configurable depth and width — one architecture, many experiments
The number of encoder and decoder layers, the number of attention heads, the embedding dimension d_model, and the feedforward layer width d_ff are all constructor parameters. Changing from a 2-layer 4-head model to a 6-layer 8-head model requires changing four integers at instantiation — nothing else in the code changes. This enables systematic experimentation with model capacity without architectural surgery, and makes the implementation a practical base for adapting the Transformer to new tasks or datasets.
Tech Stack
| Technology | Purpose |
|---|---|
| PyTorch | Core implementation — attention, encoder/decoder blocks, training loop, DataLoader compatibility |
| PyTorch Lightning | Optional training orchestration module — separates model definition from training logic |
Results & Metrics
What the implementation delivers
5 Blocks
Independently Inspectable Modules
Positional encoding, multi-head attention, feedforward layers, encoder, and decoder — each a standalone class
Stable
Training Convergence
Adam optimizer with LayerNorm throughout — consistent loss reduction across training runs
Seq-to-Seq
Ready for Real NLP Tasks
Translation, summarization, or generation — pluggable with minimal modifications to the data pipeline
Multi-head attention successfully captures token-level dependencies
The multi-head self-attention implementation correctly computes scaled dot-product attention across parallel subspaces, allowing the model to simultaneously attend to different aspects of the input sequence from different representation perspectives. Attention weights are visualizable per head, making it possible to inspect what relationships each head has learned to prioritize.
Encoder-decoder stacking enables contextual sequence generation
Cross-attention in the decoder correctly routes encoder keys and values to each decoder layer, so every generated token is conditioned on the full source sequence representation. This is the mechanism that makes translation and summarization possible — the decoder does not generate from a fixed bottleneck vector but attends directly to the most relevant parts of the encoded source at each generation step.
Stable training with Adam and LayerNorm throughout
LayerNorm after every sublayer in both encoder and decoder stacks normalizes activations within each training step, decoupling gradient magnitudes from the depth of the network. Combined with residual connections, this produces stable loss curves across training runs without requiring learning rate warm-up schedules or gradient clipping for the toy datasets used in demonstration.
Fully configurable — depth, width, and heads are constructor parameters
Number of layers, attention heads, embedding dimension, and feedforward width are all set at instantiation. The same codebase supports a lightweight 2-layer 4-head model for fast experimentation and a deeper 6-layer 8-head model for more expressive capacity — with no changes to any component class. This makes the implementation directly usable as a starting point for custom Transformer variants.
Minimal modifications needed to apply to real NLP tasks
Swapping the toy random-sequence dataset for a real parallel corpus (e.g., English-Spanish sentence pairs) requires replacing only the DataLoader and tokenizer — the model architecture, training loop, and attention mechanism remain unchanged. This makes the implementation a genuine starting point for translation, abstractive summarization, or any sequence-to-sequence task, not just a demonstration of the architecture in isolation.