Have you ever come across a scenario where a team of data scientists is rigorously working on building an AI model to solve a business problem within a large organization? What follows is a series of back and forth where the team is accessing data, analyzing it, identifying any data quality issues, cleaning it, and building the features for the required AI model. However, despite all the resource, time, and monetary investments, the model remains inaccurate. Sounds familiar?
An analysis of this problem revealed a glaring loophole – DATA QUALITY! The team believes that this entire AI Model process would have turned impactful with a faster turnaround time had a comprehensive data quality report been shared with them. While several data scientists would have faced a quandary such as this one, what lags is the acceptance that data has always been an underemphasised factor while working with AI.
The Significance of Data Quality
Going by our opening example, if the system was fed with poor quality or inaccurate data, the result of real-time decisions being made based on this inaccuracy was grave. One reason that data issues are discovered as trial and error is that there is a paucity of automated tools and methods to enable AI developers to evaluate data quality, maintain a log of all the changes applied to data and write programs to fix every issue found. The challenge continues to be effectively examining multiple data sources, analyzing relevant data, and then transforming it into the required model.
In today’s age of Artificial Intelligence and automated decisions, data quality is critical and a prerequisite to building a strong automated system. Data quality needs to have integrity, accuracy, validity, consistency, and systems today must be aware of the potential issues they could run into given the lack of strong data. Data scientists also need to build a systematic study to improve data quality before it moves to the modelling stage.
The Problem of Data Bias
Data bias is usually an error that occurs in machine learning in which certain data is more weighed or represented than the others, representing a model poorly and causing a skewed outcome, error, or low accuracy. This also implies that data is the oxygen needed for the model to do its job accurately. Data bias can be of various kinds, some of which include:
- Sample bias: where data does not reflect in the actual environment of the model
- Measurement bias: occurs when there is a discrepancy in collected data for training vis-à-vis real-world data
- Association bias: usually occurs when data for a machine learning model reinforces cultural bias
While data scientists are trying to resolve issues through manual analyses, it continues to be an extremely time-consuming and challenging process, causing a delay in developing the AI models. This calls for a need to automate and build algorithms to assess data across different modalities, suggest recommendations to improve data quality, and auto-generate code to run these recommendations.
Making Data a First-Class Citizen in the AI Journey
It is more than established that data is the backbone of an AI model and critical to its success. It is now time for industry and academia to raise the bar and elevate data quality to a first-class citizen and build an accepting ecosystem.
- B2B: Commercial products in the market need to ensure their products include data quality for the AI matrix so that their clients can effectively use these methods to improve data
- Organisations or researchers should make their APIs or toolkits available for general developers and student communities for a more hands-on experience with issues first-hand and understand how to resolve them
- Most importantly, it is essential to engage academia to include topics such as data quality, data preparation, data lifecycle as part of its core curriculum in their AI course to train and develop the right talent
AI has its lifecycle and sometimes AI fails. However, one can avoid issues and failure with AI if it has a strong foundation of Data Quality. It is time for CIOs to revive conversations and bring data quality to the table and encourage the need for strong data to avoid shortfalls in their AI journey.
Views are personal. The author is IBM Distinguished Engineer and Lead- Data and AI Platforms, IBM Research India.
Copyright©2022 Living Media India Limited. For reprint rights: Syndications Today