Paper Summary : “Everyone wants to do the Model work, not the Data Work”

High-stake domains like health and wildlife conservation demand high-quality data. The current emphasis on model accuracy neglecting the domain expertise and data quality are causing a data cascade.

Feb 22, 2021

Data Science has been considered the next big thing in the world of computer science. AI is touted to be the most important driver of business and social growth in the 21st century. However, recent trends in AI have indicated a large emphasis on modelling and increasing accuracy, while neglecting data engineering. Alongside, the domain expertise in AI is not considered that important, as modelling and accuracy are.

In the paper titled “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI”, published by Google with authors Nithya Sambasivan, Shivani Kapania, et al. The paper compiles findings from semi-structured interview s of 53 AI practitioners based in India, East and West African countries and the USA. This post summarizes a few of the key points discussed in the paper.

Data Cascades and its Properties

Data cascades have been defined by the authors as:

“Compounding events causing negative, downstream effects from data issues, that result in technical debt over time”

The discussion around technical debt has been catching pace in recent times that focus on the scalability and maintainability of models apart from modelling and increasing their accuracy. In the paper “Hidden Technical Debt in Machine Learning Systems”, D. Sculley et al argue that while developing and deploying ML systems is fast and cheap, maintaining them is expensive. This incurs a “technical debt” which is a long-term cost of moving very fast in software engineering.

The properties of Data Cascades include:

They are opaque in diagnosis and manifestation
They are amplified in impact when applied to high stakes domain
It can negatively impact the intended outcomes
If proper checks and balances are put into place cascades can be avoided

The shifting of focus away from data issues in high-stake domains is particularly risky as it directly impacts human lives. If data cascades are not accounted for, it results in expensive iterations, difficult maintenance, loss to human lives and if things don’t out as intended, the project is also discarded. Therefore, it is elementary to estimate and evaluate the quality of data being used. Though there are parameters like F1 accuracy, Mean absolute error (MAE) etc to measure the performance of models, there are no tools currently existing to measure the quality of data. The next step forward is “From goodness-of-fit to goodness-of-data”.

Machine Learning in production

The paper mentions that most of the issues related to machine learning in production revolve around data acquisition and management. The practitioners are often trained on what authors call “toy dataset” obtained from Kaggle or UCl’s open data repositories. On top of that, most computer scientists are unaware of the context of the problem statement which hinders the machine learning system from getting into production.

Furthermore, there is a skew between training and serving data which reduces the performance of machine learning models when they are exposed to the real-world dataset. To avoid this, the quality of data being used must be high and the training dataset should be as close to the real-world scenarios as possible.

Why domain expertise with AI is what we should be talking about!

Till now, the focus has been on designing and implementing AI system and computer scientists have been doing this all by themselves. But lately, there has been a push for domain expertise with AI (AI+X) which means combining AI with certain domain expertise.

In the paper, the authors write that the interviewees revealed that domain experts were involved in data collection and trouble-shooting, instead of being engaged in the entire end-to-end process. Citing an example of a model which aims to control poaching, this has been explained further.

“As patrollers were already resource-constrained, the mispredictions of the model ran the risk of leading to over patrolling in specific areas, leading to poaching at other places.”

After this, when the team collaborated with the patrollers, they could understand that most of the poaching areas were not included in the dataset! This problem is further amplified in the case of medicine where a computer scientist might not be understanding the human anatomy and it may cause a huge loss of human lives.

So, AI+x is the future!

Incentivize Data Collection

While data scientists earn huge salaries, the data collectors working on the ground are rarely given extra bucks to collect good quality data. Take India for example, ASHA workers, the frontline health workers are the most important link to collect quality health data. But, they are heavily underpaid and are rarely educated on how to collect data.

In other fields also, the people on the ground face information asymmetry, coupled with a lack of data literacy which hinders them from realizing the importance of data they are collecting. When they are not properly incentivized and educated, they tend to include a lot of noise in the data by fabricating data themselves for the sake of completing the task.

Organizations must focus on proper data collection methods with proportionate incentives and must invest in data documentation and dataset repairs to establish an ML pipeline along with partnering subject matter experts throughout the project life cycle.

Conclusion

Access to resources is unequal in today’s world and it will translate to data inequities in the future placing the poor and vulnerable at a disadvantage. To offset this, the data cascades need to be avoided by thinking more about data quality and involving subject matter expertise. In the high-stakes domain, it becomes imperative to lay down rules on data quality and on establishing proper feedback loops for the entire AI data lifecycle.

CafeIO

Discussion about this post