MLOps & Production Pipelines · Kubernetes-Native Pipeline Orchestration

KubeTFX: Kubernetes-Native ML Pipeline

A TFX-based ML pipeline orchestrated via Kubeflow Pipelines on a local Minikube Kubernetes cluster — demonstrating containerized, cloud-native workflow execution with Dockerized components, and YAML-based infrastructure configuration.

⚖️

AirflowTFX vs KubeTFX — Both projects use the same TFX pipeline and dataset. The distinction is entirely infrastructural: AirflowTFX demonstrates DAG-based orchestration and reproducibility in a local Airflow environment. KubeTFX replaces the orchestration layer with Kubeflow Pipelines on Kubernetes — shifting the emphasis from workflow automation to containerized, cloud-native scalability and deployment readiness.

Architecture TFX · Kubeflow · Kubernetes
Tech Stack
TFX Kubeflow Minikube Docker TFDV TFT TFMA SQLite

K8s

Kubernetes-Native Execution via Minikube

Kubeflow

Cloud-Native Pipeline Orchestration

Docker

Containerized Portable Components

The Problem

DAG orchestration solves reproducibility — but it does not solve scalability, portability, or cloud deployment readiness

An Airflow-orchestrated TFX pipeline running in a local Python environment is reproducible and automated — but it is not scalable, not containerized, and not cloud-ready. Each pipeline component runs in the same environment with the same dependencies, making it impossible to scale individual components independently. The pipeline cannot be ported to a cloud Kubernetes cluster without significant rearchitecting. There is no container-level isolation between components, no infrastructure-as-code configuration for storage and compute, and no Kubernetes-native execution model that maps cleanly to how production ML platforms like Vertex AI, AWS SageMaker, or Azure ML actually run workloads. Moving from a working local pipeline to a production-grade cloud deployment requires addressing all of these gaps — and that requires a fundamentally different orchestration layer.

The Solution

The same TFX pipeline — rebuilt on Kubeflow Pipelines and Kubernetes to demonstrate cloud-native scalability and deployment readiness

KubeTFX takes the same TFX pipeline from AirflowTFX — data ingestion, schema validation, feature transformation, model training, and evaluation — and replaces the orchestration layer entirely. Kubeflow Pipelines on a local Minikube Kubernetes cluster orchestrates each TFX component as a containerized Kubernetes pod, scheduled and managed by the Kubernetes control plane. Pipeline components are Dockerized for environment reproducibility and portability. Persistent storage is defined using YAML-based PersistentVolume and PersistentVolumeClaim configurations — infrastructure-as-code that maps directly to how storage is managed on cloud Kubernetes clusters. Pipeline metadata is tracked in SQLite for full traceability. The compiled pipeline is submitted to the Kubeflow Pipelines UI where runs can be monitored, compared, and managed. The result is a cloud-ready ML pipeline that can be lifted from Minikube to a managed Kubernetes service with minimal changes.

AirflowTFX — Local DAG Orchestration

Runs in a local Python environment · Airflow manages task dependencies as a DAG · Emphasis on reproducibility, logging, and workflow automation · Single environment for all components · Not containerized

KubeTFX — Kubernetes-Native Execution

Runs on Kubernetes via Kubeflow Pipelines · Each component executes as an isolated Docker container · Emphasis on scalability, portability, and cloud readiness · YAML-based infrastructure config · Cloud-deployable as-is

Key Outcome

A containerized, Kubernetes-native TFX pipeline that demonstrates the infrastructure layer required to move ML workloads from local development to cloud-ready production — with each component running as an isolated Docker pod, storage managed via YAML PV/PVC configuration, and pipeline execution orchestrated and monitored through the Kubeflow Pipelines UI.

Technical Deep Dive

Architecture & Design

Kubernetes Infrastructure & TFX Pipeline

Infrastructure Layer — Kubernetes + Docker

Cluster

Minikube

Local Kubernetes cluster · 4 CPUs, 8GB RAM · Docker driver

Storage · pv.yaml + pvc.yaml

PersistentVolume + PVC

3Gi ReadWriteMany · hostPath mount · YAML infrastructure-as-code

Containerization

Docker

Each TFX component runs as an isolated Docker pod · Environment reproducibility

Kubeflow Pipelines — Kubernetes-Native Orchestration

Stage 1 · ExampleGen

Data Ingestion

Ingests insurance cost CSV from PV mount · Train/eval splits · TFRecord artifacts

Stage 2a · StatisticsGen

Statistics

Dataset statistics for profiling and drift visibility

Stage 2b · SchemaGen

Schema

Learns dataset schema from training examples

Stage 2c · ExampleValidator

Validation

Detects anomalies and schema violations before training

Stage 3 · Transform

Feature Engineering

TFT preprocessing graph · Consistent train/serve transforms · Stored to PVC

Stage 4 · Tuner

Hyperparameter Search

Optional search over candidate training configurations before final fit

Stage 5 · Trainer

Model Training

Regression model training · module.py logic · Artifacts written to PVC

Stage 6 · Model Resolver

Baseline Resolution

Fetches the latest blessed baseline model for candidate comparison

Stage 7 · Evaluator

Model Evaluation

TFMA evaluation · Candidate-versus-baseline checks · Blessing decision

Stage 8 · Pusher

Model Promotion

Pushes the blessed model to the serving model directory for downstream deployment

Kubeflow Pipelines UI — pipeline_run.py

Pod-level component isolation Compiled YAML in pl_yaml_output/ KFP UI at :8080 SQLite metadata store Run monitoring & comparison

Pipeline Metadata & Traceability

SQLite Metadata Store

Artifact Lineage & Pipeline State

Every artifact, execution, and component state logged · Full reproducibility across Kubeflow pipeline runs

Infrastructure

Minikube + Docker + PV/PVC

A local Minikube Kubernetes cluster (4 CPUs, 8GB RAM, Docker driver) simulates a production cloud Kubernetes environment. Each TFX component runs as an isolated Docker container — a Kubernetes pod — with its own environment and dependencies. Storage is defined using YAML-based PersistentVolume and PersistentVolumeClaim configurations (pv.yaml, pvc.yaml) that allocate 3Gi of ReadWriteMany storage, mirroring how persistent storage is managed on cloud Kubernetes clusters.

Orchestration

Kubeflow Pipelines

pipeline_run.py compiles the TFX pipeline to a Kubeflow-compatible YAML artifact stored in pl_yaml_output/ and submits it to the Kubeflow Pipelines UI at port 8080. Kubeflow schedules each TFX component as a separate Kubernetes pod, manages execution order, monitors pod health, and surfaces the full pipeline graph and run history through its UI.

Pipeline Sequence

Full TFX Component Flow

The pipeline follows the full TFX progression: ExampleGen, StatisticsGen, SchemaGen, ExampleValidator, Transform, Tuner, Trainer, Model Resolver, Evaluator, and Pusher. This structure extends beyond the abbreviated version in the earlier draft, showing not only ingestion, validation, transformation, and training, but also hyperparameter search, baseline model resolution, model blessing, and final promotion to the serving directory.

Metadata

SQLite Metadata Store

Every artifact, execution, and component state produced by the pipeline is logged in a SQLite metadata store — providing a complete lineage trail from raw data to pushed model. This metadata layer enables full reproducibility across pipeline runs, supports artifact comparison between runs, and provides the audit trail needed for production ML governance.

Key Design Decisions

Minikube simulates production Kubernetes without cloud cost

Running Kubeflow Pipelines on a managed cloud Kubernetes service incurs significant infrastructure cost for development and experimentation. Minikube provides a functionally similar local Kubernetes environment — same kubectl workflow, same pod scheduling model, and same YAML configuration patterns — allowing the pipeline to be validated locally before cloud deployment.

YAML-based PV/PVC decouples storage from compute

Defining storage as PersistentVolume and PersistentVolumeClaim manifests rather than hardcoded local paths treats infrastructure as code — the storage layer is versioned, reviewable, and reproducible. The ReadWriteMany access mode allows multiple pipeline pods to read and write shared artifacts across the workflow.

Full TFX flow improves production readiness

Including Tuner, Model Resolver, Evaluator, and Pusher makes the architecture more representative of a production ML system. The pipeline no longer stops at training; it compares candidate models against an approved baseline, applies evaluation thresholds, and promotes only blessed models into a serving location.

Tech Stack

Technology Purpose
TensorFlow Extended (TFX) Standardized ML pipeline framework — ExampleGen through Pusher
Kubeflow Pipelines Kubernetes-native pipeline orchestration, scheduling, and UI
Minikube Local Kubernetes cluster simulation — cloud-compatible execution environment
Docker Containerization of TFX components for reproducibility and portability
PersistentVolume + PVC (YAML) Infrastructure-as-code storage configuration — 3Gi ReadWriteMany shared across pods
TF Data Validation (TFDV) Dataset statistics, schema inference, and anomaly detection
TF Transform (TFT) Consistent preprocessing graph for training and serving
TF Model Analysis (TFMA) Candidate-versus-baseline model evaluation and blessing
SQLite Pipeline metadata store — artifact lineage and execution state tracking

Results & Metrics

What the system delivers

K8s

Kubernetes-Native Execution

Each TFX component runs as an isolated Docker pod — independently scheduled and managed by the Kubernetes control plane

10

Pipeline Components

ExampleGen through Pusher — fully automated, containerized, and orchestrated via Kubeflow Pipelines

Docker

Containerized & Portable

YAML-based PV/PVC infrastructure config — pipeline lifts from Minikube to cloud Kubernetes with minimal changes

Containerized pipeline validated on local Kubernetes cluster

All 10 TFX components — from ExampleGen through Pusher — completed successfully as containerized Kubernetes pods in the Minikube cluster. The pipeline graph, pod states in the kubeflow namespace, and Minikube resource stats during execution were captured and verified through the Kubeflow Pipelines UI and kubectl, confirming end-to-end cloud-native execution.

☸️

Cloud-ready pipeline deployable to managed Kubernetes with minimal changes

The same pipeline YAML, Docker containers, and PV/PVC storage configuration used on Minikube map directly to managed Kubernetes services like GKE, EKS, or AKS. Migrating to cloud requires only updating the storageClassName in pvc.yaml and pointing kubectl to the cloud cluster — the pipeline execution model, component isolation, and orchestration layer remain identical.

🐳

Pod-level isolation enables independent component scaling

Each pipeline component runs as an independent Docker container with its own resource allocation and lifecycle. Compute-intensive stages like Tuner and Trainer can be assigned more CPU and memory than lightweight stages like ExampleValidator — enabling resource-efficient scaling that is architecturally impossible in a single-environment local pipeline like AirflowTFX.

📋

YAML infrastructure-as-code makes storage reproducible and auditable

PersistentVolume and PersistentVolumeClaim configurations in pv.yaml and pvc.yaml define the storage layer as versioned, reviewable code rather than manual configuration. The ReadWriteMany access mode allows multiple pipeline pods to share artifacts concurrently — a requirement for parallel component execution that is configured once and enforced automatically by the Kubernetes storage layer.

🔍

Full artifact lineage tracked via SQLite metadata store

Every artifact, execution, and component state across all 10 pipeline stages is logged in the SQLite metadata store — creating a complete, queryable lineage trail from raw data ingestion through model pushing. Any pipeline run can be fully reconstructed, any artifact traced to its producing component, and any model version linked to the exact data and hyperparameter configuration that generated it.