AirflowTFX: End-to-End MLOps Pipeline
An end-to-end TFX-based ML pipeline for medical insurance cost prediction — orchestrated via Apache Airflow DAGs for automated, reproducible, and modular workflow execution.
DAG
Airflow-Orchestrated Task Dependencies
TFX
Standardized ML Pipeline Framework
5 Stages
Ingestion to Evaluation — Fully Automated
The Problem
Notebook-based ML workflows are not reproducible, not auditable, and not production-ready
Most ML development happens in notebooks — a format optimized for exploration but fundamentally unsuited to production. Notebooks have no task dependency management, no schema validation to catch data quality issues before training, no consistent preprocessing contract between training and serving, no standardized evaluation framework, and no orchestration layer that ensures every step runs in the correct order with full logging and traceability. Every manual re-run is a potential source of inconsistency. Every ad hoc preprocessing step is a source of training-serving skew. Without a structured pipeline replacing notebook-based workflows, ML systems cannot be reliably reproduced, audited, or scaled — regardless of how good the model itself is.
The Solution
A TFX pipeline orchestrated via Apache Airflow DAGs — automated, modular, and fully reproducible from ingestion to evaluation
AirflowTFX replaces manual notebook-based ML development with a structured, automated pipeline for medical insurance cost prediction. The TFX pipeline covers five stages — data ingestion via ExampleGen, schema inference and anomaly detection via StatisticsGen, SchemaGen, and ExampleValidator, consistent feature preprocessing via Transform, model training via Trainer, and post-training evaluation via TFMA-powered Evaluator. Apache Airflow orchestrates the entire pipeline as a DAG, managing task dependencies, scheduling, logging, and run history through its web UI. A key engineering challenge was resolved during implementation — Airflow and TFX had environment compatibility conflicts, solved by running the compiled TFX pipeline in a dedicated virtual environment invoked directly from within the Airflow DAG definition in TFX_Pipeline_DAG.py.
Key Outcome
A fully automated, DAG-orchestrated TFX pipeline that replaces manual notebook workflows with a reproducible, modular ML system — covering every stage from data ingestion through schema validation, feature transformation, model training, and evaluation, with full task dependency management and run logging via Apache Airflow.
Technical Deep Dive
Architecture & Design
TFX Pipeline & Airflow DAG
Apache Airflow — DAG Orchestration Layer
Stage 1 · ExampleGen
Data Ingestion
Ingests medical insurance cost CSV · Splits into train/eval sets · Produces TFRecord artifacts
Stage 2a · StatisticsGen
Statistics
Computes dataset stats for drift detection
Stage 2b · SchemaGen
Schema
Infers feature schema from training data
Stage 2c · ExampleValidator
Validation
Detects anomalies against inferred schema
Stage 3 · Transform
Feature Engineering
TFT preprocessing graph · Consistent transforms for training + serving · Eliminates training-serving skew
Stage 4 · Tuner
Hyperparameter Tuning
Keras Tuner integration · Searches optimal hyperparameter configuration · Best trial passed to Trainer
Stage 5 · Trainer
Model Training
Regression model on insurance cost data · module.py defines training logic · Artifacts saved to pipeline store
Stage 6 · Model Resolver
Model Comparison
Compares newly trained model against production baseline · Auto-promotes if performance improves
Stage 7 · Evaluator
Model Evaluation
TFMA post-training evaluation · Configurable metrics · Evaluation artifacts logged for comparison
Stage 8 · Pusher
Model Deployment
Blessed models pushed to serving infrastructure · Deployment gated by Evaluator blessing
Airflow DAG — TFX_Pipeline_DAG.py
Stage 1
Data Ingestion
ExampleGen ingests the medical insurance cost CSV dataset and partitions it into training and evaluation splits, producing standardized TFRecord artifacts. This ensures all downstream components receive data in a consistent, versioned format — eliminating the ad hoc data loading that makes notebook-based workflows non-reproducible across runs.
Stage 2
Schema Validation
Three components run in sequence — StatisticsGen computes descriptive statistics across all features, SchemaGen infers a formal data schema from the training set, and ExampleValidator checks every example against that schema to detect anomalies such as missing values, unexpected types, or out-of-range values. Data quality issues are caught before they reach training rather than after they silently degrade model performance.
Stage 3
Feature Engineering
The Transform component applies TF Transform to preprocess features using a TensorFlow graph that is saved alongside the model. The same preprocessing graph is applied at both training and serving time — eliminating training-serving skew entirely. Feature engineering logic is defined once and guaranteed to be identical in every context where the model is used.
Stage 4
Hyperparameter Tuning
The Tuner component integrates Keras Tuner to search across model hyperparameters using the transformed features. The best trial is automatically passed to the Trainer — removing manual tuning from the pipeline and ensuring every run uses an optimized configuration without human intervention.
Stage 5
Model Training
The Trainer component runs the regression model defined in module.py using the tuned hyperparameters and transformed features. Training artifacts — model weights, saved model, and run metadata — are written to the pipeline artifact store, versioned and accessible to the Evaluator. The training logic is fully decoupled from the orchestration layer, making it independently testable and replaceable.
Stage 6
Model Comparison
The Model Resolver compares the newly trained model against the current production baseline. If the new model improves on the baseline metrics it is automatically promoted for evaluation and deployment — preventing a model that performs worse than its predecessor from silently replacing it in the serving pipeline.
Stage 7
Model Evaluation
The Evaluator uses TensorFlow Model Analysis (TFMA) to compute post-training metrics against the held-out evaluation split. Evaluation results are stored as versioned artifacts alongside the model — enabling metric comparison across pipeline runs and providing the audit trail needed to justify model promotion and deployment decisions.
Stage 8
Model Deployment
The Pusher component deploys blessed models to the serving infrastructure — deployment is gated by the Evaluator blessing, ensuring that only models that pass all evaluation thresholds are promoted to production. This closes the pipeline from training to serving in a fully automated, auditable sequence with no manual deployment steps.
Orchestration
Apache Airflow DAG
TFX_Pipeline_DAG.py defines the pipeline as an Airflow DAG — managing task dependencies, execution order, scheduling, and run logging through the Airflow web UI at port 8080. A key engineering decision resolved during implementation: Airflow and TFX had environment compatibility conflicts, solved by invoking the compiled TFX pipeline from a dedicated virtual environment within the DAG definition — keeping both tools isolated without sacrificing integration.
Key Design Decisions
Airflow DAG orchestration replaces manual execution order
Without an orchestration layer, pipeline stages must be run manually in the correct order — and any execution error leaves the system in an undefined state. By defining the TFX pipeline as an Airflow DAG, task dependencies are declared explicitly and enforced automatically. A failed stage stops downstream tasks from running on stale or missing artifacts, and the full run history is logged and searchable through the Airflow UI.
Dedicated TFX virtual environment resolves Airflow compatibility
Airflow and TFX have conflicting dependency requirements that prevent them from coexisting in the same Python environment. Rather than downgrading either tool, the pipeline compiles TFX separately and invokes it from a dedicated virtual environment within the Airflow DAG task. This isolation pattern — running TFX in its own environment and calling it from Airflow — is a practical, production-relevant solution to a real dependency management problem that practitioners encounter when building MLOps systems with heterogeneous toolchains.
TFT Transform graph eliminates training-serving skew by design
Training-serving skew — where preprocessing at training time differs from preprocessing at inference time — is one of the most common and hardest-to-debug production ML failures. TF Transform generates a preprocessing TensorFlow graph that is saved alongside the model and applied identically in both contexts. The skew is eliminated architecturally rather than by discipline, making it impossible for the two preprocessing paths to diverge across pipeline runs.
Tech Stack
| Technology | Purpose |
|---|---|
| TensorFlow Extended (TFX) | Standardized ML pipeline framework — ExampleGen through Pusher |
| Apache Airflow | DAG-based workflow orchestration, task dependency management, and scheduling |
| TF Data Validation (TFDV) | Dataset statistics, schema inference, and anomaly detection |
| TF Transform (TFT) | Consistent preprocessing graph for training and serving |
| TF Model Analysis (TFMA) | Post-training model evaluation with versioned metric artifacts |
| Keras Tuner | Automated hyperparameter search within TFX Tuner component |
| pandas | Data manipulation and feature engineering for the insurance cost dataset |
| Python | Core language and pipeline orchestration |
Results & Metrics
What the system delivers
DAG
Airflow-Orchestrated Pipeline
Task dependencies declared and enforced automatically — full run history and logging via Airflow UI
TFX
Standardized ML Framework
TFDV, TFT, and TFMA integrated — reproducible, auditable pipeline from ingestion to evaluation
5 Stages
Fully Automated Workflow
Ingestion through evaluation — every stage automated, logged, and reproducible across runs
Fully reproducible pipeline execution across every run
Every pipeline run produces identical results from the same inputs — data ingestion, schema validation, feature transformation, training, and evaluation all execute in a fixed, dependency-enforced order. TFX artifacts are versioned at every stage, making any run fully reproducible and any result traceable back to the exact data and code that produced it.
Data quality issues caught before training — not after
StatisticsGen, SchemaGen, and ExampleValidator form a three-component data quality gate that runs before any training begins. Schema violations, anomalous values, and missing features are detected and flagged at ingestion time — preventing silent data quality degradation from propagating into model weights and corrupting downstream evaluation metrics.
Single DAG trigger runs the entire pipeline end-to-end
Triggering the pipeline with airflow dags trigger tfx_pipeline_dag executes all five stages in the correct order without any manual intervention. Airflow manages the execution graph, retries failed tasks, and surfaces the full run log in the UI — replacing a sequence of manual script executions with a single, auditable, repeatable workflow trigger.
Environment isolation solves real-world toolchain compatibility
The Airflow-TFX compatibility conflict is a genuine engineering challenge that practitioners encounter when building MLOps systems with heterogeneous toolchains. The solution — compiling TFX in a dedicated virtual environment and invoking it from within the Airflow DAG — demonstrates practical production ML engineering that goes beyond academic pipeline implementations.
Improved reproducibility and scalability over notebook workflows
The pipeline successfully automated all five stages from data ingestion through evaluation — demonstrating a measurable improvement in reproducibility and scalability compared to manual notebook-based workflows. Every run produces versioned artifacts at each stage, evaluation metrics are stored and comparable across runs, and the full execution history is available through the Airflow UI for audit and debugging.