Machine Learning

ML Model Training: Data Collection and Preprocessing - Part 1

Jul 26, 2022

5 min read

This is a three part series with the first one discussing data collection and preprocessing. The second part of the series we will be talking about model training and in the third part about retraining, best practices, and some of the tools available to train.

In my previous article, I talked about how an entire lifecycle of a machine learning workflow works. In this first part of the three part series, I’ll go over data collection and preprocessing in machine learning. A crucial but most overlooked step by almost everyone.

Machine learning has benefited many organizations, especially in identifying patterns, detecting anomalies, or testing correlations using data. From detecting fraud in e-commerce to recommendation engines in OTT, ML has been proven to work.

What is machine learning model training?

Model training in machine learning is feeding training datasets to an ML algorithm, enabling it to learn and output predictions. The output is then validated with a validation set and correlated with the training dataset.

The accuracy of the training and validation dataset plays a vital role in the precision of the output.

This iterative machine learning model training process of training and retraining an ML algorithm is called model fitting. A well-fitted model generates accurate predictions compared to an over-fitted or under-fitted model.

There are several ways to train a machine learning algorithm, of which we will discuss three over the course of three articles—supervised learning, unsupervised learning, and reinforcement learning.

Steps involved in machine learning model training:

Data collection and preprocessing
Split datasets
Training
Deploying
Re-training
Prediction
Monitoring
Maintenance and Diagnosis

In this article, we will discuss data collection and preprocessing.

By now, we all know how important data is to machine learning. It's no joke; we call it the new oil. But, we can't use data as it is because raw data is pretty dirty, incomplete, noisy, and far from ideal.

At Columbia University, they created a project aiming to cut down healthcare costs to treat people with pneumonia. So, they turned to ML to sort out patients with the lowest and highest risk rate based on records to treat them at their homes or the hospital.

That should've worked. But, they did not factor into their data that patients with asthma show a low-risk for pneumonia. So, naturally, the doctors sent patients with asthma back home instead of treating them in the ICU. So, although the data collected had no death cases registered for asthma patients, they had the highest risk of pneumonia complications.

Steps in Data Collection and Preprocessing:

1. Acquire the data set - First and foremost step is data collection is to collect necessary data. Datasets might differ, some of them have it stored in cloud, some in excel, and some in papers. The most common format we use datasets are in .CSV format, sometimes in HTML or .Xlsx.

2. Importing libraries - Next to process the data let's say in Python, we need to import libraries. Each of these libraries perform certain functions, there are tons of libraries out there, but, let us tell you about the most used ones.

Numpy - as the name states used in mathematical operations throughout coding, and for scientific calculations. Very useful in large calculations of arrays and matrices.

To add Numpy to python, first install Pip. Pip is used to install libraries into python, it’s easier this way unless you want to compile it from source code, be my guest. You can download Pip package manager from PyPI.org.

Then, add the below code.

pip install numpy

If you need to check whether it is properly installed,

import numpy as np Np.__version__

You should get the current version of Numpy as output.

Matplotlib - This is a plotting library which is used to plot any chart against the dataset. Same as Numpy, you need to install Matplotlib with Pip and you also need to install plyplot a sub library.

Same as before,

pip install matplotlib

And then,

import matplotlib matplotlib.__version__

Pandas - One of the famous libraries in Python, used to import and manage datasets thanks to its open source data manipulation and analysis library.

Like before

pip install pandas

And then,

import pandas pandas.__version__

3. Importing datasets - After importing libraries, we need to import the datasets. Make sure to set the directory which contains the dataset as the required directory.

For example, we can import datasets using pandas as below,

import pandas as pd df = pd.ExcelFile("C:\visitors.xlsx") data=df.parse("visitors") print(data.head(50))

The above imports .csv file through pandas and prints the 50 visitors of a webpage.

4. Find Missing Data - these are annoying, even an entire row or column with just blanks can throw your datasets into a fit. There are a couple of ways with which we can deal with missing data.

Deleting - Commonly used, delete the rows or columns with null data, this can also lead to some complications

Calculating means - calculating the means of the entire row or column and filling the missing data. Useful in some datasets.

5. Coding Category Data - Let’s say our main data are ‘visitors’ and ‘time spent’ on a page. We all know ML works on mathematics, if one of our datasets is a variable it is necessary to encode them.

Data Collection and Preprocessing Tips:

1. Make sense of data - ML relies heavily on data, the most critical process in a machine learning project. The data dictates the project's output and the project's success. Unfortunately, in our day and age, where information is abundant, if you're unable to make headway with it, the machine will never be able to produce the desired output.

2. Take time - Well begun is half done; sometimes, the right data collection strategy, tools, and process may take more time than training the actual model. And that is all right, considering everything depends on the data collection and preprocessing stage.

3. Start collecting early - Data comes from years of hoarding data regardless of its usefulness. Some companies deal with terabytes of data daily, while others might start their data collection journey. There's nothing like owned data when it comes to ML. But, there are publicly available datasets that companies like Google are ready to give away.

4. Work with what you have - Also, if you're a legacy company that still works with .csv files, .xlsx files, or papers and ledgers, you still have a golden opportunity to turn those into your favor. You may not typically have a standard ML dataset, but you can still work with the existing datasets and convert them into ML-ready datasets.

5. Problem statement matters - Define the problem early on; knowing what you want to predict before collecting data is essential. Conduct data exploration in classification, clustering, regression, and ranking. It is also a good practice to define a data collection mechanism and set a standard data collection strategy company-wide to enable the data science and ML teams to work efficiently.

6. Data quality is gold - Double-check your data quality. To err is human, but machines are not forgiving. Always ask yourself, is this data correct? Were there any leaks in the data collection process? Are systems connected void of any breaches or malware data? Is there data missing? Is your dataset balanced for the machine to learn the right from the wrong?

7. Make sure your entire dataset is formatted correctly. Double check decimal points, currency denominations, addresses, date formats, and other mismatches.

8. More data is not equal to better accuracy of predictions. For example, if you're an e-commerce company trying to optimize your cart abandonment rate. Then data such as time spent on a page, bounce rate, items added to the cart, and item categories are the data that might be worth feeding to the algorithm to predict why cart abandonment takes place.

Split Datasets:

Datasets are split into training, validation, and testing sets for a machine learning project.

Training set: It is used to train the model using a set of parameters, letting the model learn it. Validation set: A set of data used to fine-tune the learning parameters of the model. It helps with validating the accuracy and helps select a model.Test set: The test is used to validate the final prediction against the previous data sets and evaluate the final model and algorithm.

When you’re done collecting the data, we’ll move on model training in the next article.

Written By

Thinesh Sridhar

Technical Content Writer