Fundamentals of Machine Learning Model Evaluation

Fundamentals of Machine Learning Model Evaluation

Featured on Hashnode


Evaluation of Machine Learning models is a crucial part of building a proper intelligent system that brings value to a business through the application of various algorithms. Evaluation of model’s performance or accuracy is usually split in two ways: offline and online.

Offline evaluation is done with already available data using well-known methods such as splitting of data into train and test samples, cross-validating and recording the performance of those sets. The goal is to find the performance of a model on data points that are present in databases and deploy the best possible model up to a knowledge at the time of training. On the other hand, online evaluation is done with non-existent data at a given time. The goal is to find performance of a model on data points that are not available in databases. In other words, the deployed model is affecting users we don’t know but we would like to measure how well this model works for them and business overall. The common metrics of measurements are heavily influenced by the problem the model is trying to solve and/or business KPIs.

One of those metrics can be CTR — click-through-rate or CR — conversion rate. This is usually the case with implicit feedback data, such as coming from recommendation system (bought/not-bought, viewed/not-viewed, etc.). An example of non-business related metrics would be constant comparison of predicted value of an item versus real dynamic value of an item such as predicted movie rating vs constantly changing movie rating.

A good Machine Learning model should improve both on offline and online evaluation steps over time. Even though improvement of online performance metrics is considered to be of a higher priority for any business, since it affects users and company KPIs directly, offline performance should not be disregarded as this might lead to false positive results in comparison of models. The main reason behind is that online performance metrics can be affected by external factors as well that are unknown to the model, such as a better marketing research, more investment or a more clever and efficient change of strategy.

Consequently, the ultimate goal is to design a proper infrastructure to measure offline and online model performance. This becomes of an even higher importance if the company possesses a data-centric strategy and/or its value is strongly tied to performance of its artificial intelligence systems. Thus, I would like to present few fundamentals to evaluate machine learning models in production. Note that, no external evaluation tools are needed such as MLFlow. The evaluation results can be stored on any choice of infrastructure.

Offline Evaluation

Offline evaluation of a newly trained model is performed on the available historical data which is unknown to the model. The well-known way to meet such requirement is to split available data into training and test samples in usually 75% to 25% or 70% to 30% ratio.

The training of the model is done using training samples and then the model performance is evaluated by predicting the targets of the test samples and measuring the accuracy of those predictions versus the observed values (e.g. predicted movie rating versus known movie rating). In addition to recording the model's performance on the training and test samples, average cross validation error should also be recorded in case the hyper-parameters of the model were tuned.

Having training, test and possibly cross-validation error rates, model accuracy and performance can already be estimated on this offline step. Those metrics should be recorded in a corresponding file, such as yaml or json, indicating all those values and reside together with a model. In addition to that, it is good to have other metadata, such as date of training, number of training and testing data points, and tuned/tunable hyper-parameters. An example of stats.yaml or stats.json file will look as follows:


training_error_rate: 0.15
test_error_rate: 0.25
cv_error_rate: 0.22
date: 20190220
model_id: movie_ratings_model_20190220
training_size: 30000
test_size: 10000
alpha: 0.2231
beta: 0.3212
gamma: 0.55

The following file contains trackable and easily observable metadata of a model. This metadata can help in the evaluation of model performance, and most importantly help in comparison of two different models over time.

Whenever a new model is trained and becomes a candidate for substitution of an already existing model, i.e. current model, the user-defined functions can be applied to compare metadata of that current model with the newly-trained model, i.e. candidate model, and make a decision to either substitute current model with a candidate and archive the current model. Some of those user-defined functions can look as follows:


In case all those user-defined functions pass and give a green light to publish/upload a candidate model, the current model can be substituted with the candidate model and current model can be archived together with their statistics files.

However, one must take into account that the results of offline evaluation should not be held as an absolute truth, since improvement of a model based only on offline evaluation does not necessarily mean that the model will perform better online. Rather, offline evaluation can be a first-step to decide whether to consider a candidate model as a possible substitution, in case it is better than a current one based on one ore more offline metrics. After that step, the candidate model can be put on progressive online evaluation where final decision will be made.

Online Evaluation

Online evaluation is a harder problem, since data can not be split into training and test samples. All data can be considered as test without a fixed ground truth and the evaluation metrics depend on the problem the model is trying to solve and/or business KPIs and goals. For example, in the case where the model is predicting movie ratings, it is necessary to track the predicted rating versus the current movie rating over time, i.e. it is necessary to predict how well the model is making recommendations. Since users usually do not tend to rate items or leave explicit reviews, implicit methods should be developed. Most of those tracking methods are binary, i.e. user clicked (1) or did not click (0) on an item, or user bought (1) or did not buy (0) an item.

Let's take an example of a job recommendation system whose goal is to recommend the most appropriate jobs to users. An implicit feedback vector can look like the following examples:

job_1: [user1, user2, user3]
job_2: [user2, user5, user6]

Let's say user1 and user3 clicked on job_1 and user2 and user5 clicked on job_2. Then their click vectors will look as follows:

job_1: [1, 0, 1]
job_2: [1, 1, 0]

These vectors can be used to calculate CTR - click-through-rate - of a model or in other words how many recommended jobs the user has clicked. Moreover let's say user1 applied to job_1 and user5 applied to job_2. Then their conversion rate vectors will look as follows:

job_1: [1, 0, 0]
job_2: [0, 1, 0]

These vectors can be used to calculate more important metric CR conversion rate - or in other words how many recommended jobs a user has applied to. The calculation of those metrics on aggregate can be done with the well known metric - Mean Average Precision (MAP).

Another important metric, that is not straight-forward to implement or not possible for some models is the ROI - return of investment - which is the difference between expenses of the model and the revenue obtained from it. Expenses of a model usually involve cost of virtual machines used to train/evaluate results and the cost of model maintenance.

And finally it is necessary to keep track of the volume of data points (users, items, etc.) the model has affected. Then other metrics can be penalised by the inverse of that volume, i.e. the more data points a model has affected the more credible other metrics such as CTR and CR are. The volume metric is also of a paramount importance in setting up an A/B test that can be run to compare the two models on two random user population (control group (A) and test group (B)) over a specific time window.

It is important to understand that those metrics are just a general guideline and are not necessarily a de facto of an online model performance evaluation. However the usage of each metric makes sense for most business problems since it helps to directly measure the impact of the model on the business.

All those metrics need to be stored in a form of vector per unique model id so that the performance of a model can be constantly evaluated. An example vector will look as follows:

[MODEL_ID, CTR, CR, ROI (optional), VOLUME]

It should be noted that not all of those metrics are necessarily relevant for every model. Rather, the choice of which online metrics to use is problem-specific. For example, there are cases where CTR is not needed at all or is a weak metric. Thus metrics of online evaluation should be aligned with business objectives.

Model Substitution

Since the ultimate goal is to have a model in production that improves over time and delivers the best results, a current or production model should always be substituted with a better candidate model. The candidate model should perform better both on offline and online metrics, or at least be similar. In the case where a candidate model does not improve on one set of metrics but delivers similar results, the current model can still be substituted for other reasons (e.g. candidate model has lower complexity), keeping in mind that the candidate model will be trained on fresher data whose trends/patterns might change over time and hence its performance.


Cover image: Blue vector created by vectorjuice -