Open-Source Tools · Python Package · PyPI

TFX Evaluator

A lightweight custom TFX component that fills a critical gap in the standard ecosystem — computing per-output and global metrics for multi-output TensorFlow models.

Architecture Python Package · PyPI
Tech Stack
Python TensorFlow TFX TFMA Pipelines PyPI

Per-Output + Global

Dual-Level Metric Granularity — Output-by-Output and Aggregated

TFMA-Compatible

Drop-in JSON Artifact for Downstream Pipeline Integration

PyPI

Published Package — MIT Licensed, Production-Ready

The Problem

The standard TFX Evaluator component cannot evaluate multi-output models — it treats the entire model as a single unit and produces no per-output metric granularity

TensorFlow Extended is the standard framework for production-grade ML pipelines — but its built-in Evaluator component is designed around single-output models. When a model produces predictions across multiple outputs simultaneously, the standard Evaluator collapses evaluation into a single aggregate score, making it impossible to understand how the model is performing on each individual output. For multi-output regression tasks — where different outputs may have vastly different scales, units, or error tolerances — this lack of granularity is a critical blind spot. A model that performs well overall can still be failing badly on specific outputs, and the standard TFX ecosystem provides no mechanism to detect this within the pipeline itself. Practitioners working with multi-output models are forced to write custom evaluation logic outside the pipeline, breaking reproducibility and making model analysis inconsistent across runs.

The Solution

A drop-in TFX component that computes per-output and global metrics for multi-output models and writes a TFMA-compatible artifact directly into the pipeline

The TFX Multi-Output Evaluator is a custom TFX component — MultiOutputEvaluator — that slots directly into any existing TFX pipeline in place of or alongside the standard Evaluator. It accepts the trained model, the transformed examples, and the TransformGraph artifact, along with a list of output names and the metrics to compute. The component's executor loads the SavedModel and the TF Transform output, builds an evaluation dataset via a pluggable input function, and computes MSE and MAE both per-output and globally across all outputs. Results are written as a TFMA-compatible JSON artifact — using the standardized per_output>>{name}>>{metric} and global>>{metric} key format — making them immediately consumable by downstream pipeline components, reporting tools, and custom analysis scripts without any format conversion.

Key Outcome

A published TFX component that fills a genuine gap in the standard ecosystem — enabling multi-output model evaluation with full per-output metric granularity directly inside TFX pipelines, with results written as TFMA-compatible JSON artifacts that integrate seamlessly with downstream analysis, reporting, and pipeline orchestration without any custom format handling.

Technical Deep Dive

Architecture & Design

Evaluation Pipeline

Stage 1 — Component Inputs

Input A

Trained Model

Channel[Model] · SavedModel format · shape (batch, num_outputs)

Input B

Examples

Channel[Examples] · TFRecords · split specified via example_split

Input C

TransformGraph

Channel[TransformGraph] · Loaded via tft.TFTransformOutput

Stage 2 — Executor

Step 1

Load Artifacts

SavedModel loaded · TransformGraph loaded via tft.TFTransformOutput

Step 2

Build Dataset

input_fn_path resolved · Pluggable dataset function invoked · tf.data.Dataset returned

Step 3

Run Inference

SavedModel predictions · shape (batch, num_outputs) · aligned to output_names

Stage 3 — Metric Computation

Per-Output Metrics

MSE & MAE Per Output

per_output>>{output_name}>>mse · per_output>>{output_name}>>mae · one entry per output_names item

Global Metrics

MSE & MAE Aggregated

global>>mse · global>>mae · aggregated across all outputs and all batches

Stage 4 — Artifact Output

Output · ModelEvaluation Artifact

TFMA-Compatible JSON — evaluation.json

Written to ModelEvaluation artifact URI · Consumable by downstream pipeline components, reporting tools, and custom analysis scripts

Stage 1

Component Inputs

MultiOutputEvaluator accepts three TFX artifact channels — the trained SavedModel, the transformed examples TFRecords for the specified split, and the TransformGraph from the upstream Transform component. Additionally, output_names defines the logical name for each model output dimension, metrics specifies which metrics to compute, and input_fn_path provides the dotted import path to the user-supplied dataset builder function.

Stage 2

Executor

The executor loads the SavedModel and the TF Transform output, then resolves input_fn_path at runtime using Python's importlib — supporting both colon-separated and dot-separated path formats. The resolved function is called with the TFRecords file pattern, the TFTransformOutput, and any additional kwargs from input_fn_kwargs, returning a tf.data.Dataset of (features, labels) pairs. The SavedModel is then run over the dataset, producing predictions of shape (batch, num_outputs) aligned to output_names.

Stage 3

Metric Computation

For each output index i, predictions[:, i] is compared against labels[:, 0, i] to compute per-output MSE and MAE. Keys follow the TFMA convention: per_output>>{output_name}>>mse and per_output>>{output_name}>>mae. Global metrics — global>>mse and global>>mae — are then computed by aggregating across all outputs and all batches, giving a single model-level performance summary alongside the granular per-output breakdown.

Stage 4

Artifact Output

All computed metrics are assembled into a single TFMA-style dictionary and written as evaluation.json under the ModelEvaluation artifact URI. The JSON structure is immediately parseable with standard Python — no TFMA library dependency required for reading results. The artifact URI is accessible via multi_output_evaluator.outputs['evaluation'].get()[0].uri, making it straightforward to load results into pandas, log to experiment trackers, or pass to downstream pipeline components.

Key Design Decisions

Filling a genuine gap in the TFX ecosystem

The standard TFX Evaluator is built around single-output models and provides no mechanism to compute per-output metrics for models with multiple simultaneous outputs. Rather than working around this limitation outside the pipeline — which breaks reproducibility and makes evaluation logic inconsistent — MultiOutputEvaluator extends TFX from within, conforming to the same artifact and channel conventions as every other TFX component. The component integrates without disrupting the rest of the pipeline and without requiring changes to existing components upstream or downstream.

Pluggable dataset function decouples evaluation from data format

Every multi-output model has a different feature schema, label structure, and data loading logic. Rather than baking in assumptions about data format, the component accepts input_fn_path — a dotted import path to a user-supplied function that returns a tf.data.Dataset. This means the executor handles all the TFX artifact resolution, model loading, and metric computation, while the user retains full control over how their data is read and prepared. Swapping the dataset function for a different split or schema requires only changing input_fn_path and input_fn_kwargs — no changes to the component itself.

TFMA-compatible output format enables zero-friction downstream integration

The evaluation.json artifact follows the TFMA key naming convention — per_output>>{name}>>{metric} and global>>{metric} — which means it can be consumed by any tool already built around TFMA output without format adaptation. The JSON is written to a standard ModelEvaluation artifact URI, accessible through the normal TFX output channel API. Downstream components, experiment tracking integrations, and custom reporting scripts can all read the results using standard JSON parsing — no TFMA library installation required on the reading side.

Tech Stack

Technology Purpose
TensorFlow / Keras SavedModel loading and inference over multi-output model predictions
TensorFlow Extended (TFX) Pipeline component framework — artifact channels, executor base class, component spec
TF Transform (TFT) TransformGraph loading via tft.TFTransformOutput for consistent preprocessing
TF Model Analysis (TFMA) Output artifact format — TFMA-compatible JSON key convention for downstream compatibility
Python Core language, importlib-based input function resolution, metric computation
PyPI Package distribution — pip install tfx-moe

Results & Metrics

What the component delivers

Per-Output + Global

Dual-Level Metric Granularity

Output-by-output MSE and MAE alongside aggregated global scores — in a single evaluation pass

TFMA-Compatible

Drop-in JSON Artifact

Standardized key format consumed by downstream pipeline components and reporting tools without format conversion

PyPI

Published & MIT Licensed

Installable in one command · Open source · Free for any use

🔬

Per-output metrics expose what aggregate scores hide

A model with a strong global MSE can still be failing badly on specific outputs — particularly in multi-output regression tasks where different outputs operate at different scales or have different error tolerances. The component computes MSE and MAE independently for each output dimension, surfacing output-level performance problems that global aggregation masks. This granularity is essential for diagnosing model behavior before promotion to production and for identifying which outputs require further training attention.

🔗

Integrates directly into TFX pipelines without disrupting existing components

MultiOutputEvaluator conforms to TFX's artifact and channel conventions — it consumes standard Model, Examples, and TransformGraph channels and produces a standard ModelEvaluation artifact. Adding it to an existing pipeline requires only importing the component and wiring its inputs to existing upstream outputs. No changes to ExampleGen, Transform, Trainer, or any other component are needed. The component slots in as naturally as any built-in TFX component.

📄

TFMA-compatible JSON artifact — readable with standard Python, no library dependency

The evaluation.json artifact follows TFMA's key naming convention, making it compatible with tools already built around TFMA output. Crucially, reading the results requires only Python's built-in json module — no TFMA installation is needed on the consuming side. Results can be loaded directly into a pandas DataFrame for analysis, logged to MLflow or any experiment tracker, or passed to downstream pipeline components through the standard artifact URI interface.

🧩

Pluggable dataset function supports any data schema and label structure

The component imposes no constraints on how evaluation data is loaded or prepared. The input_fn_path parameter accepts a dotted import path to any function that returns a tf.data.Dataset of (features, labels) pairs — giving the practitioner full control over feature parsing, batching, and label alignment. Additional keyword arguments can be passed via input_fn_kwargs without modifying the component. This design means the same component works across entirely different datasets, schemas, and model architectures without any code changes to the component itself.

⚙️

Applied in production — used inside the AirflowTFX and KubeTFX pipelines

The component was developed to address a real evaluation gap encountered during the construction of the AirflowTFX and KubeTFX production pipeline projects — both of which involve multi-output regression models where per-output metric tracking is essential. MultiOutputEvaluator is used directly in those pipelines as the evaluation stage, replacing any need for post-hoc evaluation scripts outside the pipeline and keeping the entire ML lifecycle — including output-level model analysis — fully reproducible within the TFX framework.