TFX Evaluator
A lightweight custom TFX component that fills a critical gap in the standard ecosystem — computing per-output and global metrics for multi-output TensorFlow models.
Per-Output + Global
Dual-Level Metric Granularity — Output-by-Output and Aggregated
TFMA-Compatible
Drop-in JSON Artifact for Downstream Pipeline Integration
PyPI
Published Package — MIT Licensed, Production-Ready
The Problem
The standard TFX Evaluator component cannot evaluate multi-output models — it treats the entire model as a single unit and produces no per-output metric granularity
TensorFlow Extended is the standard framework for production-grade ML pipelines — but its built-in Evaluator component is designed around single-output models. When a model produces predictions across multiple outputs simultaneously, the standard Evaluator collapses evaluation into a single aggregate score, making it impossible to understand how the model is performing on each individual output. For multi-output regression tasks — where different outputs may have vastly different scales, units, or error tolerances — this lack of granularity is a critical blind spot. A model that performs well overall can still be failing badly on specific outputs, and the standard TFX ecosystem provides no mechanism to detect this within the pipeline itself. Practitioners working with multi-output models are forced to write custom evaluation logic outside the pipeline, breaking reproducibility and making model analysis inconsistent across runs.
The Solution
A drop-in TFX component that computes per-output and global metrics for multi-output models and writes a TFMA-compatible artifact directly into the pipeline
The TFX Multi-Output Evaluator is a custom TFX component — MultiOutputEvaluator — that slots directly into any existing TFX pipeline in place of or alongside the standard Evaluator. It accepts the trained model, the transformed examples, and the TransformGraph artifact, along with a list of output names and the metrics to compute. The component's executor loads the SavedModel and the TF Transform output, builds an evaluation dataset via a pluggable input function, and computes MSE and MAE both per-output and globally across all outputs. Results are written as a TFMA-compatible JSON artifact — using the standardized per_output>>{name}>>{metric} and global>>{metric} key format — making them immediately consumable by downstream pipeline components, reporting tools, and custom analysis scripts without any format conversion.
Key Outcome
A published TFX component that fills a genuine gap in the standard ecosystem — enabling multi-output model evaluation with full per-output metric granularity directly inside TFX pipelines, with results written as TFMA-compatible JSON artifacts that integrate seamlessly with downstream analysis, reporting, and pipeline orchestration without any custom format handling.
Technical Deep Dive
Architecture & Design
Evaluation Pipeline
Stage 1 — Component Inputs
Input A
Trained Model
Channel[Model] · SavedModel format · shape (batch, num_outputs)
Input B
Examples
Channel[Examples] · TFRecords · split specified via example_split
Input C
TransformGraph
Channel[TransformGraph] · Loaded via tft.TFTransformOutput
Stage 2 — Executor
Step 1
Load Artifacts
SavedModel loaded · TransformGraph loaded via tft.TFTransformOutput
Step 2
Build Dataset
input_fn_path resolved · Pluggable dataset function invoked · tf.data.Dataset returned
Step 3
Run Inference
SavedModel predictions · shape (batch, num_outputs) · aligned to output_names
Stage 3 — Metric Computation
Per-Output Metrics
MSE & MAE Per Output
per_output>>{output_name}>>mse · per_output>>{output_name}>>mae · one entry per output_names item
Global Metrics
MSE & MAE Aggregated
global>>mse · global>>mae · aggregated across all outputs and all batches
Stage 4 — Artifact Output
Output · ModelEvaluation Artifact
TFMA-Compatible JSON — evaluation.json
Written to ModelEvaluation artifact URI · Consumable by downstream pipeline components, reporting tools, and custom analysis scripts
Stage 1
Component Inputs
MultiOutputEvaluator accepts three TFX artifact channels — the trained SavedModel, the transformed examples TFRecords for the specified split, and the TransformGraph from the upstream Transform component. Additionally, output_names defines the logical name for each model output dimension, metrics specifies which metrics to compute, and input_fn_path provides the dotted import path to the user-supplied dataset builder function.
Stage 2
Executor
The executor loads the SavedModel and the TF Transform output, then resolves input_fn_path at runtime using Python's importlib — supporting both colon-separated and dot-separated path formats. The resolved function is called with the TFRecords file pattern, the TFTransformOutput, and any additional kwargs from input_fn_kwargs, returning a tf.data.Dataset of (features, labels) pairs. The SavedModel is then run over the dataset, producing predictions of shape (batch, num_outputs) aligned to output_names.
Stage 3
Metric Computation
For each output index i, predictions[:, i] is compared against labels[:, 0, i] to compute per-output MSE and MAE. Keys follow the TFMA convention: per_output>>{output_name}>>mse and per_output>>{output_name}>>mae. Global metrics — global>>mse and global>>mae — are then computed by aggregating across all outputs and all batches, giving a single model-level performance summary alongside the granular per-output breakdown.
Stage 4
Artifact Output
All computed metrics are assembled into a single TFMA-style dictionary and written as evaluation.json under the ModelEvaluation artifact URI. The JSON structure is immediately parseable with standard Python — no TFMA library dependency required for reading results. The artifact URI is accessible via multi_output_evaluator.outputs['evaluation'].get()[0].uri, making it straightforward to load results into pandas, log to experiment trackers, or pass to downstream pipeline components.
Key Design Decisions
Filling a genuine gap in the TFX ecosystem
The standard TFX Evaluator is built around single-output models and provides no mechanism to compute per-output metrics for models with multiple simultaneous outputs. Rather than working around this limitation outside the pipeline — which breaks reproducibility and makes evaluation logic inconsistent — MultiOutputEvaluator extends TFX from within, conforming to the same artifact and channel conventions as every other TFX component. The component integrates without disrupting the rest of the pipeline and without requiring changes to existing components upstream or downstream.
Pluggable dataset function decouples evaluation from data format
Every multi-output model has a different feature schema, label structure, and data loading logic. Rather than baking in assumptions about data format, the component accepts input_fn_path — a dotted import path to a user-supplied function that returns a tf.data.Dataset. This means the executor handles all the TFX artifact resolution, model loading, and metric computation, while the user retains full control over how their data is read and prepared. Swapping the dataset function for a different split or schema requires only changing input_fn_path and input_fn_kwargs — no changes to the component itself.
TFMA-compatible output format enables zero-friction downstream integration
The evaluation.json artifact follows the TFMA key naming convention — per_output>>{name}>>{metric} and global>>{metric} — which means it can be consumed by any tool already built around TFMA output without format adaptation. The JSON is written to a standard ModelEvaluation artifact URI, accessible through the normal TFX output channel API. Downstream components, experiment tracking integrations, and custom reporting scripts can all read the results using standard JSON parsing — no TFMA library installation required on the reading side.
Tech Stack
| Technology | Purpose |
|---|---|
| TensorFlow / Keras | SavedModel loading and inference over multi-output model predictions |
| TensorFlow Extended (TFX) | Pipeline component framework — artifact channels, executor base class, component spec |
| TF Transform (TFT) | TransformGraph loading via tft.TFTransformOutput for consistent preprocessing |
| TF Model Analysis (TFMA) | Output artifact format — TFMA-compatible JSON key convention for downstream compatibility |
| Python | Core language, importlib-based input function resolution, metric computation |
| PyPI | Package distribution — pip install tfx-moe |
Results & Metrics
What the component delivers
Per-Output + Global
Dual-Level Metric Granularity
Output-by-output MSE and MAE alongside aggregated global scores — in a single evaluation pass
TFMA-Compatible
Drop-in JSON Artifact
Standardized key format consumed by downstream pipeline components and reporting tools without format conversion
PyPI
Published & MIT Licensed
Installable in one command · Open source · Free for any use
Per-output metrics expose what aggregate scores hide
A model with a strong global MSE can still be failing badly on specific outputs — particularly in multi-output regression tasks where different outputs operate at different scales or have different error tolerances. The component computes MSE and MAE independently for each output dimension, surfacing output-level performance problems that global aggregation masks. This granularity is essential for diagnosing model behavior before promotion to production and for identifying which outputs require further training attention.
Integrates directly into TFX pipelines without disrupting existing components
MultiOutputEvaluator conforms to TFX's artifact and channel conventions — it consumes standard Model, Examples, and TransformGraph channels and produces a standard ModelEvaluation artifact. Adding it to an existing pipeline requires only importing the component and wiring its inputs to existing upstream outputs. No changes to ExampleGen, Transform, Trainer, or any other component are needed. The component slots in as naturally as any built-in TFX component.
TFMA-compatible JSON artifact — readable with standard Python, no library dependency
The evaluation.json artifact follows TFMA's key naming convention, making it compatible with tools already built around TFMA output. Crucially, reading the results requires only Python's built-in json module — no TFMA installation is needed on the consuming side. Results can be loaded directly into a pandas DataFrame for analysis, logged to MLflow or any experiment tracker, or passed to downstream pipeline components through the standard artifact URI interface.
Pluggable dataset function supports any data schema and label structure
The component imposes no constraints on how evaluation data is loaded or prepared. The input_fn_path parameter accepts a dotted import path to any function that returns a tf.data.Dataset of (features, labels) pairs — giving the practitioner full control over feature parsing, batching, and label alignment. Additional keyword arguments can be passed via input_fn_kwargs without modifying the component. This design means the same component works across entirely different datasets, schemas, and model architectures without any code changes to the component itself.
Applied in production — used inside the AirflowTFX and KubeTFX pipelines
The component was developed to address a real evaluation gap encountered during the construction of the AirflowTFX and KubeTFX production pipeline projects — both of which involve multi-output regression models where per-output metric tracking is essential. MultiOutputEvaluator is used directly in those pipelines as the evaluation stage, replacing any need for post-hoc evaluation scripts outside the pipeline and keeping the entire ML lifecycle — including output-level model analysis — fully reproducible within the TFX framework.