The crucial part of any machine learning project is the workflow behind the project. It serves as an integral tool to determine the success of the project. In this article, we will go over some of the essential aspects involved in an ML workflow. Such as How a standard ML workflow works, the different types of algorithms available, some of the best practices and tools available.
What is a machine learning workflow?
Machine learning project workflow defines the steps involved in executing an ML project.
These steps include:
Data Collection
Data Pre-processing
Building Datasets
Model Training or Selection
Model Deployment
Prediction
Monitoring Models
Maintenance, Diagnosis, and Retraining
While the above is a typical machine learning workflow, a lot depends on the project's scope. So have a flexible workflow, start small and scale up to production-ready projects.
ML Automation known as AutoML is available for some parts of the workflow. Like, model and feature selection processes. But, in general, it is minimal.
What is the goal of machine learning workflow?
The goal of a machine learning workflow is to ensure the project's successful execution. But, it is easier said than done.
For instance,
Take a look at the sample machine learning workflow diagram below:
It shows the entire lifecycle of a general machine learning workflow; let me go into detail about each machine learning workflow steps.
Different Steps in a Machine Learning workflow
First stage - Defining the problem:
Before any ML project, the question that needs an answer is:
'What is the problem I'm trying to solve here?', 'Is ML the right approach for this problem?'.
A well-defined problem allows you to choose the right approach to create the solution. But, machine learning depends on data, and there is no answer to how much.
The type of problem you decide to take on also determines the kind of data you need for that particular problem. For example, if you're trying to solve an IoT problem, you may need to work with real-time data.
What's next?
Second stage - Data collection and preprocessing:
So far, we've Identified our problem statement, Defined our project scope, and Decided whether ML is the right approach.
Assuming we go the ML path, the next obvious step is to collect data. Data collection is the most crucial step in the entire process; the data's quality is the key. You must retrieve, clean, and prepare the data from the source.
Retrieving the data: Data flows into businesses daily, which is in databases or third-party applications. You can also use public data repositories and then pull all the data into one single repository.
Clean the data: Large amounts of data is not equal to clean data. You must clean duplicates, correct errors, input missing data, and structure them. You need to delete garbage data and unwanted noise and eradicate any misinformation.
Prepare the data: After data cleansing, you need to format all those data into a model-ready format. First, you must split your data into train, validation, and test data sets. If required, you might combine many attributes into one and label them.
Post the data collection and cleansing, build the datasets needed for training.
Some of the datasets you require are:
Training set: Dataset that enables the model to process information by defining parameters.
Validation set: Verify the model's accuracy and fine-tune the parameters for better results.
Test set: Used to find any bugs or mis-trainings and to test the performance of the model.
After the data collection and preprocessing stage, we move on to,
Third stage - Training:
Exciting part of the workflow, you develop, train, validate, and test your model. The best practice is to choose an algorithm based on availability of resources. Such as, people, computational capabilities, and hardware.
Now,
We have several ML algorithms and three different types of machine learning techniques.
They are:
Supervised learning:
Feeding labeled data to an algorithm with a set expected outcome. Everything functions in a controlled environment where the deviation is minimal. The accuracy of the result is generally higher.
Unsupervised learning:
We feed unlabelled data to an algorithm, resulting in unexpected outcomes. This is complex and less accurate.
Reinforcement learning:
Reinforcing good behavior and punishing bad behavior, data labeling is not required.
What about the actual algorithms? Below, I am going through some popular algorithms and their use cases.
Supervised learning is further divided into classification and regression.
There are several algorithms we can put to use in supervised learning, such as:
K-Nearest Neighbor
Naive Bayes
Decision Trees/Random Forest
Support Vector Machine
Logistic Regression
Linear Regression
Support Vector Regression
Decision Trees/Random Forest
Gaussian Processes Regression
Ensemble Methods
Unsupervised learning is categorized into clustering and association. There are several algorithms that we can put to use as well.
Gaussian mixtures
K-Means Clustering
Boosting
Hierarchical Clustering
K-Means Clustering
Spectral Clustering
AIS
SETM
Apriori
Latter
Reinforcement learning is categorized into value-based, policy-based, and model-based.
Below are the algorithms used in reinforcement learning.
Monte Carlo
Q-learning
SARSA
Q-learning - Lambda
SARSA - Lambda
DQN - Deep Q Network
DDPG - Deep Deterministic Policy Gradient
A3C - Asynchronous Advantage Actor-Critic Algorithm
NAF - Q-Learning with Normalized Advantage Functions
TRPO - Trust Region Policy Optimization
PPO - Proximal Policy Optimization
TD3 - Twin Delayed Deep Deterministic Policy Gradient
SAC - Soft Actor-Critic
Based on your project definition, your solution can depend on any of the above algorithms. Let's get into training the model with the said algorithm.
Train the model using the training parameters for the classification. Then further tune the validation set, and finally test performance using the test set. Some libraries have functions for all algorithms such as Tensorflow, sci-kit-learn, and PyTorch.
Once you have your model trained, validated, and tested, we move on to the next stage, evaluation.
Fourth stage - Evaluation:
When we reach a point where your test data gives you near accurate results, we move on to the evaluation stage. we can identify the best model that provides near-to-accurate results. From here on, we can retrain, adjust the parameters, or deploy the model into production.
Fifth Stage - Deployment:
Post your evaluation stage; your model is generally a proof of concept. Now you need to convert the evidence into an actual viable product. Getting the model into production gives the ability to learn and relearn more.
Some of the ways you can make sure to improve your model in production are:
A/B testing: Comparing the performance of the existing process and system with your current model. You can improve the performance of your model based on the result.
Machine learning APIs: The best way to communicate with data sources and other services is through APIs. This is very important if you're planning to offer your model as a service or product to others.
Detailed documentation: Good documentation goes a long way with any tool or service. How to use your model, what results to expect, and where to access them are some of them.
Sixth and final stage Prediction, Maintenance, Diagnosis, and Retraining:
This is where all the hard work generates fruits. Real-time data feeds into the model and starts sending predictions back to you. To improve performance, track your model, log errors, run diagnoses and retrain them.
We've gone through the entire six-stage process in a machine workflow; now, let's discuss -
Best practices in a Machine Learning workflow
Below are some of the best practices you can follow in a machine learning workflow for your project.
Naming conventions:
We all know how important it is, gives you the ability to go back to a particular point in your workflow.
Code quality checks:
Writing code is tough, but you must go back to do code checks to ensure everything works.
Experiments:
We keep adding to the model, feature after feature, parameters, and many more. Keeping track of your experiments lets you know their status, which helps in many ways.
Data validation:
Unlike your training or test, production data will seem different. So, we need to ensure that data drifts do not occur and that the model is fed with reliable data.
Cross-validation across different segments:
There is no one-size-fits-all when it comes to ML models. It is a different scenario compared to reusing software. So, it would be best if you made sure that the model is recalibrated to the scenario it will serve.
Resource utilization:
We all get carried away with experiments. But, it is necessary to understand the system's requirements during each workflow stage. So that we keep track of the usage.
Open communication:
You must keep your product managers, UX designers, and other stakeholders in the loop. We ML engineers, data scientists, and developers are better left alone. But, the business works as a cohesive team.
So, what next?
What are the automations we can do in a Machine Learning workflow?
Automation in ML eliminates manual experimentation, makes workflow management more accessible. Thus allowing you to deploy models into production faster.
AutoML allows you to start a project faster than setting up everything from scratch. The goal of AutoML is to automate some parts of the workflow, not the entire workflow. It works by separating each stage of the workflow and then managing it.
Some of the stages where AutoML works are:
Data ingestion:
Once you identify the source of your data, crontab or on a pipeline orchestration can automate data ingestion.
Hyperparameter optimization:
Grid search, random search, and Bayesian methods find optimal combinations of pre-defined parameters. This helps you in automating hyperparameters.
Model selection:
Datasets with the same hyperparameters can be used in multiple models. This lets us determine the best set for the model in use.
Feature selection:
Pre-defined parameters help us in automating the feature selection process.
Also, we can automate experiment tracking, model serving, and monitoring.
With that, let us take a look on.
Best tools available for machine learning workflow
There are a lot of tools available on the market for ML workflow, but here are our favorite ones.
1. Kubeflow - Making deployments of ML workflow on Kubernetes simple, portable and scalable.
2. MLFlow - Manage the machine learning lifecycle. Including experimentation, reproducibility, deployment, and a central model registry.
3. Metaflow - An open-source python framework initially created in Netflix to tackle challenges in scalability and version control.
4. DVC - An open-source 'version control system' for machine learning projects.
5. NimbleBox.ai - A full-fledged MLOps platform that lets you train, deploy, and allocate jobs to your model.
Conclusion
In this article, we discussed the various stages of a machine learning workflow. From defining your project scope to predicting the outcome of your model.
Now, if you want to know more about setting up an ML team in your startup, read our article here. And, if you're skeptical about opting for an ML platform, read our article about 'Build vs. Buy here.
We will be writing in detail about machine learning model training soon. Stay tuned!
Written By
Thinesh Sridhar
Technical Content Writer