Let's start right from the beginning with a disclaimer: Artificial intelligence still doesn't know how to figure out data quality deficiencies. It can't even search for them very well on its own. You need to show it some love for it to manage. But you can easily get quite many benefits for the task at hand from machine learning as it is today.
Data profiling is one of my all-time favorite data development tools. A few years ago, I got to know the Pandas Profiling Python library, which does so much of the work that I previously had to do manually, mainly using SQL and Python. Data profiling can catch a wide variety of problems, but if the cause-and-effect relationship is not simple, it is not useful for a deeper investigation. The cause of the problem often has to be dug up more or less manually after profiling. So, I set out to investigate whether ML and the technologies used for its development, could somehow help me in finding the causes of quality problems.
ML visualizations to help find the culprit
The first way to dig a bit deeper in the investigation of the data set quality is to make use of different graphical representations of information. These visualizations are more commonly used in the development of machine learning models, in the data set preparation phase. The data of a single attribute can be evaluated based on dot, box, and scatter patterns, among other ones. These graphs provide more depth to data profiling, especially when trying to identify things from the data that are not quite clear. Like what is generally done in ML development. 😊
How then could these same methods be harnessed to find the causes of an error found in the profile results? The first step, like in ML development, is to transform and prepare the data so that it can be plotted on a graph. Dates can be converted to integers, and numeric values can be created for categorical data in text format. In order to visualize a specific error, the incorrect lines should be marked. So, for example, the value 1 for the line where the error occurs and the value 0 if the line is OK.
A good practice is to check for only one type of error at a time. Otherwise, the cause-and-effect relationships can be challenging to find, as there may be more than one possible reason for the errors.
Just from this, very quickly built visualization, you could find strong cause-and-effect relationships at a glance. However, there were so few incorrect rows in my example dataset that it is difficult to find the reasons. However, some correlations were noticeable and in many real data sets, they can often be found easily. For example, the values produced by faulty devices could appear in the first graph type, grouped for certain categories. Or similarly at a certain time of the day formed errors might be visible in the middle one, targeting a certain area in the distribution.
About the used data
Hidden features of AutoML
The basic visualizations used in Data Science already provide a lot of additional information about the data set being processed, but this can be taken a step deeper. And above all, it can be done with reasonably little effort. Databricks introduced the AutoML functionality a few years ago. AutoML, as the name suggests, automates ML development. It preprocesses values, selects variables, and tests different models to solve the set problem. AutoML may not quite be able to perform at the level of good AI developers, but it does perform significantly better than such an AI beginner as me. At least when the same amount of time is available, i.e. 15 minutes, for example.
But how can the results of AutoML be useful when solving the quality problem in the data set? The forecast model itself is not useful in this case, because we already know how to identify incorrect data. However, it is possible to conclude one thing or another from the final results related to the construction of the model.
Behind the link, you can find, without editing, the ROC curve and the confusion matrix. In practice, these graphs present the results of the previous table in graphic form. In my model example (Databricks: samples data.gov/farmers_markets_geographic_data) I tried to find "reasons" why the updatedTime column had incorrect values. There are probably no real reasons to be found in this data, but the results showed that the missing values were strongly correlated with the values of other fields. The ROC AUC value with the best model was 0.982, which indicates a very strong correlation.
AutoML and SHAP-values
It should also be mentioned that the example data initially had a total of 59 variables. At this point, the model has eliminated about 2/3 of these.
Fast results and missed opportunities
If such analyses arise in the works of a Data Engineer or a similar role, it is worth sharing them with others. Both business, analysts and ML developers may benefit from them. Personally, I would like to see profiling carried out for each integrated data set, and above all the results of the developer's own analysis be presented. As a method of operation, in my opinion, this would increase the quality of many implementations considerably, by eliminating unnecessary investigation work. In a suitable scope and by utilizing automation, data profiling makes it possible to reduce the total workload of data development. And by using ML techniques, even this work phase can be deepened and accelerated even further.