Continuous Delivery of ML Models

Building the CI/CD Pipeline for Data, Models, and Validation

Apr 20, 2024

Context

About 20 years ago, the software development industry got serious about continuous releases of online software characterized by frequent deployments, the ability to roll back or roll forward bad changes, and a pipeline of automated tests. Today, cutting edge software teams are learning to continuously release machine learning models. This memo describes a north star vision for an engineering system that enables frequent model deployments.

Complexity and Complications

If this was easy and a solved problem, there would be no need for this article. Let’s get into the systems and the automation workflows that tie them together.

Systems

Transaction System

Here we have a traditional distributed system that runs some web scale service. This deploys with a classic CI/CD system that runs tests on every change and generates release candidates from many different people’s merged changes. This system may also interface with a machine learning model in the middle of a transaction. For our purposes, this is ML in the “online” transaction pipeline.

Data Warehouse

A copy of the data in the transaction system flows through a data pipeline to a data lake. These are basically analytics databases with a distributed storage system that scales relatively cheaply to very high storage volumes. Another service may post-process this data with a machine learning model. For our purposes, this is ML in the “offline” pipeline.

Training System

This is where machine learning models have training runs. These machines can do model fine tuning, distillation, quantization, encoding, decoding, and other complex mathematical operations. These are typically GPU systems but can also be CPU-only.

Inference System

This is the runtime where the machine learning models are deployed. These can be CPU-only, GPU, or Tensor processing systems depending on the type of model. The model exposes an API service so the online or offline pipeline can request predictions, supplying input data and/or a prompt with the expectation of output data.

Sampling System

This system iterates over the data warehouse to identify interesting data for training. The data pipeline can insert metadata about interesting records, like records where there was a known error in the model prediction and post processing system. These records can be used in re-training.

Labeling System

For some types of training, data needs to be tagged and labeled by a human or a machine. This could be a classification of the data against a known taxonomy, or a tagging of specific spans in the data that contain interesting features. These features can be mapped to attributes, and an extraction of these positional labels can be performed. The attributes may or may not be mapped into a schema, which describes some business object (e.g. a document of classification resume may have a person and many experiences to extract).

Validation System

Think of this as a test harness that evaluates the performance of a newly trained machine learning model not yet fully operational in production. This system compares the results of the new model to the old model, so an informed judgement can be made about its suitability for release.

Automation Workflow

Pre-Training

Data is automatically sampled from the data warehouse and copied to a training set
Data sets (synthetic) are automatically generated by a machine, for training
Data sets are automatically labeled by a machine, for training
Data sets are automatically labeled and manually corrected by a human, for training
Labels are automatically extracted, for training
Changes to schemas are automatically versioned
Changes to data sets (records or labels) are automatically versioned

Training

Training jobs are automatically queued and slotted into a schedule to make full use of the training servers
Features, or transformed data points, are automatically saved and versioned
Notebooks on training hosts are automatically saved to version control
Hyperparameters used in training are automatically saved and timestamped
Completed training runs are automatically deployed in a model serving container, and versioned in the artifact repository
Evaluation reports on test data are automatically saved and versioned

Post-Training

Validation reports on candidate models are automatically re-run and saved when the model artifact is added to the registry

Deployment

The production system can automatically roll back to the previous version of a model
Saved predictions from the models are automatically tagged with the model version, for traceability
Runtime errors are automatically collected in a data set with the original input data and the failed prediction

Memos to Nobody