Machine Learning

Decoding the Machine Learning Algorithm Selection Process

Feb 28, 2023

5 min read

Machine Learning, a significantly impactful subset of Artificial Intelligence, has attained great glory over the past few decades. The main essence of this automated learning paradigm is based on a self-learning approach – data(results) is fed as input to the models, from where the machines learn patterns and deduce the rules required to make predictions on new data. Unlike the traditional programming paradigm, where programmers code the rules and feed them as input, which is then used to generate results, ML has brought a revolutionary approach towards computers replicating human intelligence.

Having been immensely successful, ML has touched and successfully influenced various spheres of business today. Subsequently, building a working, impactful ML model involves a laborious and elaborate pipeline – the crux of which is often focused on selecting the appropriate Machine Learning algorithm. A set of checkpoints are to be maintained to ensure a good selection of the machine learning algorithm, which we will enlist and discuss in this blog.

Let us start with a brief overview of the types of machine learning algorithms.

Understanding the types of machine learning algorithms

Broadly categorizing, there are four main types of ML algorithms -

1. Supervised Learning - These algorithms are employed when our training data is labeled with correct answers that we want our model to learn and make predictions accordingly. It is analogous to a situation where learning is emphasised through human supervision - the model is supervised with data and the correct annotations for each unit sample in the data. Depending on the type of output, this learning paradigm can further be categorized into: Regression (in case the output is a continuous variable, e.g., Linear Regression), Classification (in case the output is a discrete variable, e.g., Logistic Regression)

2. Unsupervised Learning - These algorithms are used when our training data is not labeled with answers (devoid of response variables). Hence these algorithms learn by identifying patterns and hidden structures in the data. There is no human supervision involved in this learning process. So the model does not have any starting point for training, and it learns to make predictions based on the patterns identified during training. This learning paradigm can also be divided into categories: Clustering (when objects in the data with a similar pattern are kept in clusters separately, eg: K-Means Clustering), Dimensionality Reduction (when data has a lot of features/dimensions making computation expensive, hence reduction of the dimensions is preferable, eg: Principal Component Analysis)

3. Reinforcement Learning - These algorithms are based on goal-oriented learning; they aim to achieve a considerable trade-off between exploration and exploitation. Unlike supervised and unsupervised learning algorithms, which depend heavily on data, these algorithms focus on learning in an environment with rules and a defined goal. It is a reward-punishment mechanism - the model is given rewards as positive reinforcements for correct behavior in the concerned environment and punishment for wrong behavior. Thus, it focuses on exploring the environment to find more rewards and then exploiting those moves to maximize the number of rewards. E.g., Autonomous vehicles.

How to choose a machine learning algorithm

Selecting a machine learning algorithm that will give a promising performance is the crux of the model-building pipeline. Here are a few steps to ensure before making a good selection -

1. Understanding the problem and defining the objective/goal – A large part of our selection depends on what kind of outcome we expect from the model and what input we have.

Considering our input, if there is labeled input data, we can deduce it’s a supervised learning process. If there is unlabeled data, it is unsupervised learning. It is reinforcement learning if we can optimize our model function by interacting with the environment.
Considering our output, we can use regression algorithms if we want our model to predict a constant value. If we want our model to predict classes, we can use classification. If our output is based on groups of data, then we can use clustering.

2. Analysis of the data through preprocessing and visualisation – Data is the foundation on which the ML model will be built.

It is necessary to derive valuable insights from the data during preprocessing, which will help decide which ML algorithm to select. While collecting diverse sets of data from various internal systems and external sources into a common platform (data ingestion) includes extensive preprocessing, cleaning, feature engineering, and analyzing the data through visualisation, plots, and statistics. E.g., Plotting pairs plots can help understand if there are any clusters in the data, linear relationships, etc.

Realistic constraints to check. It is essential to speculate on the potential accuracy, scalability, and complexity of the model. The amount of time it may take to train and Predict new data results should be feasible and align with the amount of time available.

We must see the relevance of all features. If there are a large number of features (e.g., textual data), it can result in longer training times. Support Vector Machines are helpful in data with more dimensions (many features). Consequently, Principal Component Analysis can reduce dimensionality while preserving the essential features.

Tips for implementing and fine-tuning your chosen algorithm

Once we have selected an ML algorithm, our next step is to implement it and train it on our data. After that, we can check the accuracy, classification report, etc., to understand the model’s performance. Then we can apply fine-tuning methods, if required, to optimize the model’s performance.

Here are a few tips for the same:

Appropriately divide data into train, test, and validation sets as per the ideal proportions – Training the model on a larger portion of the data (around 60-70%), testing on the smaller portion of the remaining data (15-20%), and finally validating parameters and hyperparameters for tuning (15-20%).
Using cross-validation methods (K-fold) – Training and testing the model on multiple subsets of data will improve the prediction capabilities of the model. In K-fold, the dataset is divided into k folds, the model is trained on (k-1) folds and tested on the remaining one fold. This is iterated k times, and the average of all iterations is calculated as a result.
Transfer learning – This is a very helpful technique for reducing training time and pipeline complexity. We can transfer the things learned by one primary model to a secondary model, which can learn objectives as per our goals. Once we have identified the goal and the Base model suitable for our project, we can check for an already existing model that exists in a similar pattern. We can quickly change a few layers depending on our required output and train the model for the few new layers only.
Balancing depth, width, and resolution – Taking inspiration from the EfficientNet paper, model performance can be enhanced mainly in terms of speed and accuracy by strategically balancing the parameters of depth, width, and resolution of the model. This is because each parameter has its importance while being mutually inclusive: more depth helps capture complex features, and more width and higher resolution help detect fine-grained features.
Select an appropriate score metric to evaluate the accuracy – there are many, like confusion matrix, recall, precision, F1 score, R1 score, etc.
Hyperparameters can be tuned using the Grid Search algorithm – After optimizing parameter values, a grid search algorithm can be used to identify the best combination of hyperparameters to get the highest accuracy. However, it can be a computationally intensive task.

Common pitfalls to avoid when selecting a machine learning algorithm

Hitherto we have mentioned essential checkpoints to ensure when selecting the suitable ml algorithm. Failure to follow either of them can lead to a pitfall and thus not generate the expected accuracy and results. A few other common pitfalls to avoid can be:

Not examining relationships (linear, nonlinear) between features – The presence of linear and non-linear data points can make a difference in accuracy for the model we select if we select an inappropriate ML algorithm for the specific environment. Linear algorithms are easier and simpler to implement. However, these algorithms fail to achieve reasonable accuracy if our data is multidimensional with non-linear relationships. Many algorithms like linear regression, logistic regression, support vector machines, and support linear data are based on linear algorithms. However, in the case of complex data structures and dimensions, other algorithms like random forests and neural networks should be selected for handling such data with better accuracy.

Checking bias-variance tradeoff in the model (overfitting and underfitting) – If our data has too many features, then there is a good chance that our model is overfitting, with a high variance and low bias. On the contrary, if our model has very few features, it might be underfitting with high bias and low variance. It is essential to perform proper feature selection and parameter selection to ensure the right fit and performance of the model.

Hardware competency – Training large extensive ml algorithms are majorly dependent on the hardware selection for the procedure. Hence, if we train a very ambitious ml model, our hardware is not competent enough to endure the labor-intensive task. It can result in a significant mistake in the selection process of the ml algorithm and, consequently, its accuracy and performance.

Conclusion: How to confidently choose the correct algorithm for your project?

We have seen some critical potential checkpoints that one can keep in mind while selecting an appropriate machine-learning algorithm. The ability to choose the suitable ML algorithm while avoiding the pitfalls can help prevent excessive resource and time consumption as well as improve the overall efficiency of the project. We hope this article helps you to make more informed and wiser decisions while choosing the ml algorithm for your model and also avoid potential mistakes.

Written By

Rusali Saha

Technical Content Writer