With the tech giants in a head-to-head competition launching ChatGPT and Google Bard, and the earthquake hitting Turkey, it's safe to say our world is constantly changing every day. In such a fast-paced world, businesses using Machine learning and AI can only benefit a little from stale and outdated data. You have to pull out the big guns: Real-time data for decision-making that can provide accurate insights and give the edge over competitors. But, using real-time data is more complicated, as it comes with many challenges and quality issues. In this blog, I'll walk you through the challenges of real-time machine learning and explore solutions and tools to handle it.
The conventional way to build an ML model is to train with batches of historical data. But, in many cases, issues like data distribution drifts and degrading accuracy occur and deteriorate the model performance over time. As seen below, this can be solved by updating the training dataset and re-training at regular intervals.
Hence, it would be better to supply continuous live data to our model to keep it robust. This is precise what data scientists do in real-time machine learning!
Real-time ML is most suitable for situations requiring quick real-time decision-making, where the model needs to learn and adapt to new patterns. For example, say you are booking a cab on Ola/Uber at 9 pm. The app shows an "ETA (Estimated Time of Arrival)" when you enter your pick-up and drop locations. All the data buffs here know there's an ML pipeline under the hoods. But to provide an accurate ETA, the model needs information on the current traffic in different routes, weather conditions (rains surge up the fares), availability of drivers near your pick-up location, weekday/weekend rush hours, and much more. This kind of data can be provided only with real-time data streaming and continuous learning. Similarly, transaction data is streamed live into models to capture financial frauds and stop them in real-time.
While adopting real-time ML is overgrowing, there are still cases where it is optional, and conventional batch predictions are the best fit. Like in the case of customer segmentation and demand forecasting, to name a few.
Now that we know why and how we use real-time data, let's also understand the everyday challenges. Real data time is a powerful tool, which, if used without awareness and responsibility, can result in an interruption, inaccurate results, or show biases towards a particular gender/race/country.
Uncertainty: Continuous data streaming means that we need to be made aware of the speed and volume of the input blow. The behavior can be inconsistent often. Apart from this, 'real-time' can mean different things in varying scenarios and projects. Misaligned understandings can occur between various levels of the organization without explicit communication.
Error handling: In real-time systems, undetected errors can significantly impact the system's performance. Errors can occur at various stages of the model pipeline, creating difficulty in pinpointing the source of an error. It is equally difficult to anticipate beforehand, as errors can occur due to unexpected data inputs or changes in the data distribution.
Inferior Data quality: Big data has been a boon for businesses leveraging ML to increase their performance. But, it can turn into a bane within minutes due to bad quality. Inferences and predictions from a bad-quality dataset aren't reliable and could result in huge losses. Data quality issues can arise due to missing data points, incomplete records, and noise in the data stream. Detecting biases present in the data inflow is challenging due to the time constraint.
Multiple sources and formats: Real-time ML systems get their data from different sources with varying formats and standards, making the integration complex. A practical example of this is Google Maps, which I'm sure we all have opened stuck in traffic! Apps like Google Maps gather data from various sources such as GPS devices, sensors, weather stations, and public transportation schedules. These data sources may produce different data formats, such as text, images, videos, or time-series data. New changes in the data schema that misalign with existing specifications can interrupt the data collection pipeline.
Data leakage refers to a situation when the model is inadvertently exposed to information outside of the intended model training dataset. Preventing data leakage is more difficult when data is streamed in real-time. For example, consider a real-time fraud detection model that flags a transaction as fraudulent or genuine. But, real-time data will come with information about the outcome of a transaction, such as whether it was charged back or not, which can lead to data leakage. The model may unintentionally learn to use the transaction's outcome as a predictor, rather than learning from the available feature values.
Data latency: Some data sources may delay producing data, resulting in stale or outdated data that hinder the model's accuracy. This occurs because Real-time ML models generally process data generated on remote devices or sent over a network. Network delays, data transfer delays, and processing delays due to insufficient computational resources can cause it.
Now that we know working with real-time data is no cakewalk let's look at the best-advised data collection & preprocessing steps you can use.
It is essential for teams to set up data pipelines that can capture, process, and analyze the data in real time and can scale to handle large volumes of data. Regular checks should be in place to ensure the data format is correct. The data collected should undergo preprocessing steps like duplicate removal, normalization, and feature engineering through smoothing, resampling, etc.
Teams should automate pipelines to continuously update and retrain machine learning models using the latest data. Many leading companies monitor real-time data to detect unusual patterns that may indicate a data leak or security breach.
Data governance is another critical component of real-time ML systems. Wondering why? In 2018, it was reported that Amazon had developed an ML bases recruiting tool to analyze resumes and recommend candidates. Unfortunately, the models turned out to be gender-biased against women applicants. The training data used was the resumes received over the past ten years, predominantly from men and hence biased. This could have been avoided with an efficient data governance policy in place. Clearly defined ownership, accountability, and data access control are essential. To achieve this, organizations should have well-defined AI governance policies focusing on each of the below metrics.
Companies leading in real-time ML, like Uber, also have data isolation and anomaly detection to prevent data leakage. Capital One uses real-time machine learning to detect and prevent fraud on their credit cards. To eliminate vulnerabilities, they have adopted data masking, anonymization, and regular security checks.
To efficiently set up a real-time ML pipeline, you can leverage tools like Apache Kafka, Apache Flink, NimbleBox Jobs, Apache Spark Streaming, Amazon SageMaker, and much more. Apache Kafka is a commonly used distributed streaming platform that can handle large volumes of data. It has low latency for real-time data streaming and is easy to integrate. A simple pipeline using Kafka API is depicted in the below image.
If you want a platform that can support complex event processing, you can try Apache Flink. It even provides a unified processing engine for batch and streaming data. The below figure shows a schema of how tools are used.
The manufacturing industry's revolution 4.0 has been hand-in-hand with the AI revolution. Smart factories are equipped with IoT sensors that provide real-time data. Analyzing this data has helped in predictive maintenance and fault control. For example, the real-time data of temperature and vibration of motors attached to conveyor belts can be used to detect anomalies and prevent failure. This has proven to decrease factory downtime and repair costs and increase the lifetime of machinery. Models can analyze data from multiple sources, such as machine sensors and production line sensors, and recommend production schedules in real-time to maximize output.
Can you think of another industry where the trends and patterns constantly change with a need for real-time data analysis? Of course, the stock markets! High-Frequency trading (HFT) firms like Two sigmas use real-time ML algorithms to analyze market trends, news feeds, and other sources to detect patterns, reduce risks and make better predictions. It has become an essential tool for HFT firms to stay competitive in the fast-paced and dynamic financial markets.
The consumer-centric apps like Uber, Swiggy, and eBay also have implemented real-time ML. But did you know Uber has developed one of the most robust feedback loops in the real-time ML pipeline?
Uber collects a vast amount of real-time data from its platforms, such as GPS locations, trip data, and user feedback, to update and improve its ML algorithms. An overview of the ML pipeline architecture used by Uber rental and Uber Eats is shown in the below image. In most applications, ETA's accuracy is essential to customer satisfaction and retention.
It has also successfully implemented dynamic pricing, adjusting to demand and supply, and predictive routing with the help of real-time traffic data. The real-time data helps provide accurate ETAs, which provide customer satisfaction, as shown in the image.
Now, we have a good idea of the capabilities of real-time data in various spheres and the challenges that come with it. To implement real-time ML, teams must shift from traditional batch processing to stream data processing. This shift comes with the high initial investment required to update thenfrasKafka and Flink are still incompatible with Python, making integration into the workflow more complex. The applications are also expanding into anomaly detection in medical and cybersecurity fields.