If a hatch in a space station is even millimeters bigger or smaller than the required measurements, do you know what happens? Implosion, the sheer pressure the space exerts on the station will not tolerate even a millimeter of error. So then, how do you expect a Machine Learning model or your customer base to be intolerant?
Anomaly Detection in machine learning, either pre or post-model development and deployment, is an essential task to ensure the smooth running of the MLOps pipeline. With small, skewed values in the data pre-training or frauds and misuse of your services, anomaly detection goes a long way to cut cost, time and boost performance.
Models nowadays can incorporate ML anomaly detection on a level that was not possible before, and herein we will be exploring some such ways and tricks. So, why fall heads over some outlier that decided to mess with your model? Let us take you through this guide that spans these said measures.
What is anomaly detection in Machine Learning?
Anomaly Detection in Machine Learning is the complete procedure of dealing with anomalies and irregularities in a dataset. These can either be outliers or data points significantly different from the usual trend that the other topics follow. These irregularities tend to become an issue while training causing unwanted skewing in the model predictions.
In our earlier blogs - Machine Learning Performance Metrics, we touched upon over-fitting and covered a number of algorithms and performance metrics used to tweak the model to get desired results. Anomaly Detection with Machine Learning ensures that once the outliers and anomalies are detected.
Apart from the pre-development of the machine learning algorithms, anomaly detection Algorithms further accentuate the suspicious and unwanted instances post-deployment. Furthermore, these algorithms can flag anomalies when screening these vast arrays of results from proper monitoring systems.
Anomaly Detection post-deployment can be used for tasks ranging from fraud detection, medical problems, product defects, and malfunction. In layman’s terms, Anomaly Detection is ultimately the task of training a machine to gauge the ability to define what is expected. Still, when paired with machine learning, it also ensures that the model doesn’t lose its ability to generalize.
What are the purposes of anomaly detection?
Purposes of anomaly detection, as already mentioned in the above points, can be boiled down to the following topics:
Product Performance: An anomaly Detection paired with machine learning can correlate the existing data to cross-check while maintaining generalization and find odd standing products with complete knowledge of what makes them an anomaly.
Technical Performance: Any faults in your own deployed system may leave your server to active DDoS attacks. Such errors can also be proactively avoided and treated at the root using machine learning integrated into the DevOps pipeline.
Training Performance: During the pre-training phase, anomaly detection can come in handy, pointing out irregularities in the dataset, which may cause the model to over-fit and, in turn, act poorly.
There can be various types of anomalies the machine learning model can be trained to identify:
Point Anomaly: A tuple within the dataset that can be an anomaly if it is far from the trend set by the other data points.
Contextual Anomaly: Contextual anomalies can be considered an anomaly only if taken in a particular context and may even be valid if taken from another context.
Collective Anomaly: Such anomalies occur when data points in a whole collection of points act strange towards the other values, making the subset a complete rarity.
Now that we have the purpose and needs of anomaly detection, let us check out how anomaly detection algorithms can work with machine learning.
Anomaly detection techniques in machine learning
With such shortcomings in monitoring techniques and lax human, detection comes something that every engineer strives towards, Solutions! Many algorithms and procedures have been devised to implement Anomaly Detection, which complements machine learning pipelines. All the methods can be broadly classified into three major categories, which we will discuss further. But, first, let us discuss one central assumption these techniques work upon.
Anomalies are rare findings of out-of-order data points, and while they are best removed, they should also be infrequent enough in the broader selection of the data to be removed. That is, just the sheer amount of data should be enough that the anomalies removed shouldn’t make our data too small to be used.
Unsupervised Anomaly Detection
In contrast to the unsupervised learning techniques, i.e., K-means, gaussian mixture techniques, K-medians, etc., Anomaly Detection for unsupervised learning also deals with unlabelled data. So much like the unsupervised learning techniques, anomaly detection for such knowledge also works by figuring out the pattern the unlabelled points are following.
The anomalies are detected by standing out from the trend set by the other data points. For example, a vast selection of unsupervised learning algorithms works on the concept of clustering techniques.
Let us look at some of the techniques that can be used for Anomaly Detection in Unsupervised Learning:
Isolation Forest: Based on the concepts derived from Random Forest Classifier, an Isolation Forest processes randomly subsampled data in a tree structure based on random attributes. Samples from deeper in the tree are less likely to be abnormalities because they require more cuts. Similarly, illustrations on shorter branches reveal abnormalities since the tree could distinguish them more easily.
Outlier Detection Factor: The local density deviation of a particular data point relative to its neighbors. It identifies as outlier samples with significantly lower density than their neighbors, hence along the way identifying the outliers.
Mahalanobis Distance: The Mahalanobis distance is a useful multivariate distance metric that evaluates the separation between a point and a distribution. This technique is the ideal go-to for dealing with one-class classification and highly imbalanced datasets.
Autoencoders: Autoencoders leverage a neural network's property in a unique way to achieve specific techniques of training networks to learn expected behavior. When an outlier data point arrives, the auto-encoder cannot codify it. As a result, the reconstruction will not be accurate, giving us anomalies in the dataset.
Supervised Anomaly Detection
Since supervised learning relies on labeled data, so do the techniques used to detect anomalies in such models. However, detecting anomalies in such labeled data can be much easier than doing so in unsupervised learning datasets; these techniques hold great potential to be automated and made more efficient.
As arduous as it may be to collect labeled data, in most cases and applications where such techniques are used, the labels are pretty intricate with various other parameters and variables that come with the labels. Therefore, studying these parameters also grants the methods in these categories much more efficient in dealing with unseen data.
Let us look at some of the techniques that can be used to detect anomalies in Supervised Machine Learning Techniques:
Support Vector Machines: SVMs employ multidimensional hyperplanes to segregate observations. In addition, SVM solves multi-class classification issues. SVM is widely used when data belongs to a single class. In this example, the algorithm learns what is "normal" to determine if it belongs in the group.
K Nearest Neighbors: The underlying premise of the nearest-neighbor family is that comparable observations are located close to one another and that outliers are typically solitary observations situated further away from the cluster of similar observations, thus giving us anomalies.
Semi-supervised Anomaly Detection
Semi-supervised anomaly detection works in a relatively simple and exciting. Moreover, it is a blend of the supervised and unsupervised analogies we had to deal with earlier. Typically, it occurs when there are marked input data but no identified outliers. The model will learn the trends of the standard data from the labeled training data and find anomalies in the unlabeled data that exceed this threshold.
Although this method and technique can be easily solved with techniques like Outlier Detection Factor and Isolation Forests, the question of better efficiency comes up again when dealing with the sheer amount of “semi-supervised” data that is usually encountered. Nevertheless, semi-supervised Anomaly Detection can be considered closest to the real-life conditions encountered in a deployed machine learning pipeline.
Most methods devised to deal with this problem are heavily engineered and inspired by the earlier techniques we just looked at. These methods can be explored on the benchmark showcased in the Papers With Code: Semi-supervised Anomaly Detection.
Conclusion
We just discussed the definition and need for anomaly detection in your MLOps pipeline and how it can work well in collaboration with the monitoring and scaling aspects of post-deployment. Apart from that, we have discussed some techniques that can be applied to different types of data you may encounter when dealing with anomaly detection.
We hope that this article helps you identify discussed anomalies and further be able to make your model more susceptible to anomalies and avoid them for better performance.
Written By
Aryan Kargwal
Data Evangelist