KubeTFX: Kubernetes-Native ML Pipeline
A TFX-based ML pipeline orchestrated via Kubeflow Pipelines on a local Minikube Kubernetes cluster — demonstrating containerized, cloud-native workflow execution with Dockerized components, and YAML-based infrastructure configuration.
AirflowTFX vs KubeTFX — Both projects use the same TFX pipeline and dataset. The distinction is entirely infrastructural: AirflowTFX demonstrates DAG-based orchestration and reproducibility in a local Airflow environment. KubeTFX replaces the orchestration layer with Kubeflow Pipelines on Kubernetes — shifting the emphasis from workflow automation to containerized, cloud-native scalability and deployment readiness.
K8s
Kubernetes-Native Execution via Minikube
Kubeflow
Cloud-Native Pipeline Orchestration
Docker
Containerized Portable Components
The Problem
DAG orchestration solves reproducibility — but it does not solve scalability, portability, or cloud deployment readiness
An Airflow-orchestrated TFX pipeline running in a local Python environment is reproducible and automated — but it is not scalable, not containerized, and not cloud-ready. Each pipeline component runs in the same environment with the same dependencies, making it impossible to scale individual components independently. The pipeline cannot be ported to a cloud Kubernetes cluster without significant rearchitecting. There is no container-level isolation between components, no infrastructure-as-code configuration for storage and compute, and no Kubernetes-native execution model that maps cleanly to how production ML platforms like Vertex AI, AWS SageMaker, or Azure ML actually run workloads. Moving from a working local pipeline to a production-grade cloud deployment requires addressing all of these gaps — and that requires a fundamentally different orchestration layer.
The Solution
The same TFX pipeline — rebuilt on Kubeflow Pipelines and Kubernetes to demonstrate cloud-native scalability and deployment readiness
KubeTFX takes the same TFX pipeline from AirflowTFX — data ingestion, schema validation, feature transformation, model training, and evaluation — and replaces the orchestration layer entirely. Kubeflow Pipelines on a local Minikube Kubernetes cluster orchestrates each TFX component as a containerized Kubernetes pod, scheduled and managed by the Kubernetes control plane. Pipeline components are Dockerized for environment reproducibility and portability. Persistent storage is defined using YAML-based PersistentVolume and PersistentVolumeClaim configurations — infrastructure-as-code that maps directly to how storage is managed on cloud Kubernetes clusters. Pipeline metadata is tracked in SQLite for full traceability. The compiled pipeline is submitted to the Kubeflow Pipelines UI where runs can be monitored, compared, and managed. The result is a cloud-ready ML pipeline that can be lifted from Minikube to a managed Kubernetes service with minimal changes.
AirflowTFX — Local DAG Orchestration
Runs in a local Python environment · Airflow manages task dependencies as a DAG · Emphasis on reproducibility, logging, and workflow automation · Single environment for all components · Not containerized
KubeTFX — Kubernetes-Native Execution
Runs on Kubernetes via Kubeflow Pipelines · Each component executes as an isolated Docker container · Emphasis on scalability, portability, and cloud readiness · YAML-based infrastructure config · Cloud-deployable as-is
Key Outcome
A containerized, Kubernetes-native TFX pipeline that demonstrates the infrastructure layer required to move ML workloads from local development to cloud-ready production — with each component running as an isolated Docker pod, storage managed via YAML PV/PVC configuration, and pipeline execution orchestrated and monitored through the Kubeflow Pipelines UI.
Technical Deep Dive
Architecture & Design
Kubernetes Infrastructure & TFX Pipeline
Infrastructure Layer — Kubernetes + Docker
Cluster
Minikube
Local Kubernetes cluster · 4 CPUs, 8GB RAM · Docker driver
Storage · pv.yaml + pvc.yaml
PersistentVolume + PVC
3Gi ReadWriteMany · hostPath mount · YAML infrastructure-as-code
Containerization
Docker
Each TFX component runs as an isolated Docker pod · Environment reproducibility
Kubeflow Pipelines — Kubernetes-Native Orchestration
Stage 1 · ExampleGen
Data Ingestion
Ingests insurance cost CSV from PV mount · Train/eval splits · TFRecord artifacts
Stage 2a · StatisticsGen
Statistics
Dataset statistics for profiling and drift visibility
Stage 2b · SchemaGen
Schema
Learns dataset schema from training examples
Stage 2c · ExampleValidator
Validation
Detects anomalies and schema violations before training
Stage 3 · Transform
Feature Engineering
TFT preprocessing graph · Consistent train/serve transforms · Stored to PVC
Stage 4 · Tuner
Hyperparameter Search
Optional search over candidate training configurations before final fit
Stage 5 · Trainer
Model Training
Regression model training · module.py logic · Artifacts written to PVC
Stage 6 · Model Resolver
Baseline Resolution
Fetches the latest blessed baseline model for candidate comparison
Stage 7 · Evaluator
Model Evaluation
TFMA evaluation · Candidate-versus-baseline checks · Blessing decision
Stage 8 · Pusher
Model Promotion
Pushes the blessed model to the serving model directory for downstream deployment
Kubeflow Pipelines UI — pipeline_run.py
Pipeline Metadata & Traceability
SQLite Metadata Store
Artifact Lineage & Pipeline State
Every artifact, execution, and component state logged · Full reproducibility across Kubeflow pipeline runs
Infrastructure
Minikube + Docker + PV/PVC
A local Minikube Kubernetes cluster (4 CPUs, 8GB RAM, Docker driver) simulates a production cloud Kubernetes environment. Each TFX component runs as an isolated Docker container — a Kubernetes pod — with its own environment and dependencies. Storage is defined using YAML-based PersistentVolume and PersistentVolumeClaim configurations (pv.yaml, pvc.yaml) that allocate 3Gi of ReadWriteMany storage, mirroring how persistent storage is managed on cloud Kubernetes clusters.
Orchestration
Kubeflow Pipelines
pipeline_run.py compiles the TFX pipeline to a Kubeflow-compatible YAML artifact stored in pl_yaml_output/ and submits it to the Kubeflow Pipelines UI at port 8080. Kubeflow schedules each TFX component as a separate Kubernetes pod, manages execution order, monitors pod health, and surfaces the full pipeline graph and run history through its UI.
Pipeline Sequence
Full TFX Component Flow
The pipeline follows the full TFX progression: ExampleGen, StatisticsGen, SchemaGen, ExampleValidator, Transform, Tuner, Trainer, Model Resolver, Evaluator, and Pusher. This structure extends beyond the abbreviated version in the earlier draft, showing not only ingestion, validation, transformation, and training, but also hyperparameter search, baseline model resolution, model blessing, and final promotion to the serving directory.
Metadata
SQLite Metadata Store
Every artifact, execution, and component state produced by the pipeline is logged in a SQLite metadata store — providing a complete lineage trail from raw data to pushed model. This metadata layer enables full reproducibility across pipeline runs, supports artifact comparison between runs, and provides the audit trail needed for production ML governance.
Key Design Decisions
Minikube simulates production Kubernetes without cloud cost
Running Kubeflow Pipelines on a managed cloud Kubernetes service incurs significant infrastructure cost for development and experimentation. Minikube provides a functionally similar local Kubernetes environment — same kubectl workflow, same pod scheduling model, and same YAML configuration patterns — allowing the pipeline to be validated locally before cloud deployment.
YAML-based PV/PVC decouples storage from compute
Defining storage as PersistentVolume and PersistentVolumeClaim manifests rather than hardcoded local paths treats infrastructure as code — the storage layer is versioned, reviewable, and reproducible. The ReadWriteMany access mode allows multiple pipeline pods to read and write shared artifacts across the workflow.
Full TFX flow improves production readiness
Including Tuner, Model Resolver, Evaluator, and Pusher makes the architecture more representative of a production ML system. The pipeline no longer stops at training; it compares candidate models against an approved baseline, applies evaluation thresholds, and promotes only blessed models into a serving location.
Tech Stack
| Technology | Purpose |
|---|---|
| TensorFlow Extended (TFX) | Standardized ML pipeline framework — ExampleGen through Pusher |
| Kubeflow Pipelines | Kubernetes-native pipeline orchestration, scheduling, and UI |
| Minikube | Local Kubernetes cluster simulation — cloud-compatible execution environment |
| Docker | Containerization of TFX components for reproducibility and portability |
| PersistentVolume + PVC (YAML) | Infrastructure-as-code storage configuration — 3Gi ReadWriteMany shared across pods |
| TF Data Validation (TFDV) | Dataset statistics, schema inference, and anomaly detection |
| TF Transform (TFT) | Consistent preprocessing graph for training and serving |
| TF Model Analysis (TFMA) | Candidate-versus-baseline model evaluation and blessing |
| SQLite | Pipeline metadata store — artifact lineage and execution state tracking |
Results & Metrics
What the system delivers
K8s
Kubernetes-Native Execution
Each TFX component runs as an isolated Docker pod — independently scheduled and managed by the Kubernetes control plane
10
Pipeline Components
ExampleGen through Pusher — fully automated, containerized, and orchestrated via Kubeflow Pipelines
Docker
Containerized & Portable
YAML-based PV/PVC infrastructure config — pipeline lifts from Minikube to cloud Kubernetes with minimal changes
Containerized pipeline validated on local Kubernetes cluster
All 10 TFX components — from ExampleGen through Pusher — completed successfully as containerized Kubernetes pods in the Minikube cluster. The pipeline graph, pod states in the kubeflow namespace, and Minikube resource stats during execution were captured and verified through the Kubeflow Pipelines UI and kubectl, confirming end-to-end cloud-native execution.
Cloud-ready pipeline deployable to managed Kubernetes with minimal changes
The same pipeline YAML, Docker containers, and PV/PVC storage configuration used on Minikube map directly to managed Kubernetes services like GKE, EKS, or AKS. Migrating to cloud requires only updating the storageClassName in pvc.yaml and pointing kubectl to the cloud cluster — the pipeline execution model, component isolation, and orchestration layer remain identical.
Pod-level isolation enables independent component scaling
Each pipeline component runs as an independent Docker container with its own resource allocation and lifecycle. Compute-intensive stages like Tuner and Trainer can be assigned more CPU and memory than lightweight stages like ExampleValidator — enabling resource-efficient scaling that is architecturally impossible in a single-environment local pipeline like AirflowTFX.
YAML infrastructure-as-code makes storage reproducible and auditable
PersistentVolume and PersistentVolumeClaim configurations in pv.yaml and pvc.yaml define the storage layer as versioned, reviewable code rather than manual configuration. The ReadWriteMany access mode allows multiple pipeline pods to share artifacts concurrently — a requirement for parallel component execution that is configured once and enforced automatically by the Kubernetes storage layer.
Full artifact lineage tracked via SQLite metadata store
Every artifact, execution, and component state across all 10 pipeline stages is logged in the SQLite metadata store — creating a complete, queryable lineage trail from raw data ingestion through model pushing. Any pipeline run can be fully reconstructed, any artifact traced to its producing component, and any model version linked to the exact data and hyperparameter configuration that generated it.