Just in the last month, article has been published Which showed that more than 30% of the data Google used in one of its common machine learning models was misclassified with wrong data. Not only was the model itself riddled with errors, but the actual training data used by that same model was riddled with errors. How can anyone using a Google Form hope to trust the results if it’s full of human-caused errors that computers can’t fix. And Google isn’t alone in misclassifying big data, like Study MIT in 2021 It found that approximately 6% of the images in the industry-standard ImageNet database were misclassified, and furthermore, it found “naming errors in test sets of 10 of the most common audio, natural language and computer vision datasets”. How can we hope to trust or use these models if the data used to train those models is so poor?
The answer is that you can’t trust that data or those models. As AI goes, the trash inside is definitely trash, and AI projects suffer from very poor data. If Google, ImageNet, and others make this mistake, you are sure to make this mistake as well. Research from Cognilytica shows that more than 80% of an AI project’s time is spent managing data, from collecting and compiling that data to cleaning and categorizing it. Even with all that time taken, errors are bound to happen, and that’s if the data is good to begin with. Bad data means bad results. This has been the case for all kinds of data-oriented projects for decades, and now it’s a big problem for AI projects as well, which are basically just big data projects.
Data quality is more than just ‘bad data’
Data is the core of artificial intelligence. What drives AI and machine learning projects is not software code, but the data from which the learning must be derived. All too often, organizations move too quickly on their AI projects only to realize later that the poor quality of their data is causing their AI systems to fail. If you don’t have your data in good shape, don’t be surprised when your AI projects get hit.
There is more to data quality than just “bad data” such as incorrect data labels, missing or erroneous data points, garbled data, or low quality images. Major data quality issues also arise when data sets are obtained or combined. They also appear when data is captured and enhanced with third-party datasets. Each of these procedures, and more, presents several potential sources of data quality issues.
Of course, how do you realize the quality of your data before you start your AI project? It is important to assess the state of your data up front and not move forward with your AI project only to realize in hindsight that you do not have good data needed for your project. Teams need to know their data sources such as flow data, customer data or third party data and then how to successfully integrate and integrate data from these different sources. Unfortunately, most data does not come in good, usable states. You need to remove extraneous data, incomplete data, duplicate data or unusable data. You will also need to filter this data to help reduce bias.
But we are not finished yet. You will also need to think about how you can transform the data to meet the specific requirements you have. What will you do to implement data purification, data transformation, and data processing? Not all data is created equal, and over time, you will have data decomposition and data skew.
Have you thought about how to monitor and evaluate this data to ensure that the quality remains at the level you need? If you need sorted data, how do you get that data? There are also steps for data augmentation to consider. If you need to increase the additional data, how will you monitor that? Yes, there are a lot of steps involved in quality statements and these are all aspects that you need to think about in order for your project to be successful.
Data categorization specifically is a common area where a lot of teams get stuck. For supervised learning approaches to work, they must be fed with good, clean, and well-categorized data so that they can learn from examples. If you are trying to identify images of boats in the ocean, you need to feed the system with good, clean, well-tagged images of boats to train your model. This way, when you feed it an image it has never seen before, it can give you a high degree of certainty whether the image has a boat in it or not. If you are only training your system with boats in the ocean on sunny days with no cloud coverage, how is the AI system expected to react when it sees a boat at night or a boat with 50% cloud coverage? If your test data doesn’t match real-world data or real-world scenarios, you’ll run into a problem.
Even when teams spend a lot of time making sure their test data is perfect, the quality of the training data often does not reflect the real world data. in file For example, AI industry leader Andrew Ng discussed how the quality of data in his test environment in his project with Stanford Health did not match the quality of real-world medical images, arguing that his AI models are useless outside the testing environment. This caused the entire project to stall and fail, putting millions of dollars and years of investment at risk.
Planning for project success
All this data quality-centric activity can seem overwhelming, which is why these steps are often skipped. But of course, as mentioned above, it is bad data that kills AI projects. So the lack of attention to these steps is a major reason for the failure of the AI project in general. This is why organizations are increasingly embracing best practice approaches such as CRISP-DM, Agile and CPMAI To make sure they don’t lose or skip critical data quality steps that will help avoid AI project failure.
The problem of teams moving forward without planning for project success is often very common. In fact, the second and third phases of both the CRISP-DM and CPMAI methodology are “data understanding” and “data preparation”. These steps even precede the first step in building models and are thus considered as best practice for AI organizations looking to succeed.
In fact, if the Stanford Medical Project had adopted CPMAI or similar approaches, they would have realized well before the million-dollar and several-year mark that data quality issues would drown out their project. While it may be comforting to realize that even dignitaries like Andrew Ng and companies like Google make big data quality mistakes, you still don’t want to be part of this club unnecessarily and let data quality issues plague your AI projects.