Coffee Break Series: High-level ML system design

Got a coffee break? Great! Interested in learning how to Approach an ML problem while sipping your coffee? We are here with an article to give a high-level introduction of ML system design.

Feb 17, 2021

Wikipedia defines Systems design as a process of defining the architecture, modules, interfaces, and data for a system to satisfy specified requirements.

Machine learning System Design is the blueprint of high-level design of data infrastructure, machine learning, hardware and different interfaces for a machine learning system.

The goal is to impart the following attributes to the machine learning system:

Scalable: Ability to handle growth in # of transactions or increasing data
Reliable: Ability to Consistently performs according to its specifications
Maintainable: Ability of the system to support changes

Machine learning system design helps you to get a top-down view of your requirements and dependencies before you dive deeper into the intricacies.

After reading this post, you will be able to:

Understand and Define Machine learning System Design

Let’s get started before your coffee gets cold!

Steps in ML System Design

Alright, let’s see what this picture talks about and visit every step in detail:

Step 1: Problem statement

You can only find solutions to the problems you define accurately. The first step is usually the most important one as it helps you answer the “why?”.

Why do want to design this system?

For example, while creating a LinkedIn feed, the problem statement can be expressed as “Create a personalized feed for every user to maximize engagement”.

Step 2: Identify metrics

Once the problem statement is defined, the next step is to decide the metrics on which the evaluation will be done.

In the ML world, statistical metrics such as RMSE (Root Mean Squared Error), MAPE (Mean Absolute Percentage Error) etc. can be used for Regression Problems. Precision-Recall, AUC, Confusion metrics for a classification problem.

It is equally important to identify the business metrics (for example, KPIs, KBIs) that are being effectively modelled i.e. The “y” variable. Going back to the LinkedIn example, one of the measures of engagement on LinkedIn is Click-through rate (CTR). It can be expressed as:

CTR= Number of clicks/ number of times the content is shown

Step 3: Identify requirements

The requirements in an ML system can be two-fold:

Training requirements: To able to train a model after collecting data, cleaning and transforming data comes under training requirements. If there are imbalanced classes in a classification problem, it is important to resample and use appropriate metrics.
Inferencing requirements: The model deployed must have lower latency and should be able to serve several users at the same time. In our LinkedIn system, it is desirable that the feed is refreshed regularly so that users encounter fresh content with each refresh.

Step 4: Train and validate

This is where the majority of Machine learning is involved. You have collected the data, now it is time for feature engineering, feature selection and choosing the right model to train the data. Feature engineering is very important as it can increase the accuracy of your system while increasing the computation feasibility.

The choice of model depends on requirements. Few questions to answer:

Does your output have a corresponding input variable?
Is your data continuous or discrete?
What is the size of your training data set?
Do you lesser or higher training time?

Validation is an important step. The metrics can be evaluated in both online and offline environments so that model performs equally good in production with real-world data.

Offline evaluation: This can be done by using train-test splits and the evaluation metrics like logloss, MAE, R squared can be measured. The model can be further tweaked to reduce error and increase accuracy.

Online evaluation: This is when the model is put in production and tested on real-world parameters. It is initially done on a small percentage of real data and if the business metrics like user engagement are increased, the model is extended to a higher percentage. Here, A/B testing is useful.

Step 5: Design High-Level System

The goal in this step is to design a working model which can describe the entire workflow of the project. It is in this step that the individual components are decided as per their utility and it is determined how data will flow through the components.

In the LinkedIn example, when a user visits his homepage, LinkedIn sends a request to the application server. After getting the features for the model, a recommendation is provided to the user depending upon the prediction of CTR.

Step 6: Scale the Design

LinkedIn has millions of users worldwide who are creating, sharing and interacting with content every passing second. If scalability is not accounted for, then:

Training time will be unimaginably huge
Memory requirement will be pretty high
The cost associated with the project will touch the sky!

I am sure LinkedIn does not want that (no company wants that in fact!)

It is very much evident that scaling is the friend here. It can make computation budget-friendly with good accuracy making the entire project viable. It also helps to automate the entire process, reducing human touchpoints.

The million-dollar question: How do we scale the system design?

Well, it depends on the project but a few of the ways it can be done is:

Choose the right framework and programming language
Choose the right processor depending on the requirement
Choose the right hardware
Sweat on the database management system

I hope you have finished the coffee by now. Great! you have complete the article as well. In this article, we saw the ML system design from a bird’s eye view and understood how to approach an ML problem.

Let’s meet when you grab the next cup of coffee. Do not forget to share feedback on this article and share with other coffee-lovers :)

CafeIO

Discussion about this post