Data-Centric AI

You might have heard of all the amazing breakthroughs in AI, such as Dall-E 2 by OpenAI, GPT-3, AlphaFold by DeepMind, among many other breakthroughs over the past 2 or 3 years. This is great progress by the AI Community at large and is quite frankly, very exciting!

However, there has been another revolution taking place in the AI Community recently, and that is the movement toward Data-Centric AI, led by LandingAI and Andrew Ng.

What is Data-Centric AI and why does it matter?

Traditionally, we've been approaching Machine Learning problems by collecting data, preprocessing it, and then doubling down on the Machine Learning Model to improve accuracy (as well as other Key Performance Indicators) From the equation below, we focused on Code (Models).

AI = Code + Data

With the Data-Centric approach, we double down on data, as much as we are improving our models. From the equation above, we focus more on data.

Tips for Data-Centric AI Development

Using multiple labelers to spot inconsistencies. For example, in computer vision problems the bounding box size as well as the number of bounding boxes matters. Use consensus labeling techniques to spot any inconsistencies that might arise.
Repeatedly clarify labeling instructions by tracking down ambiguous examples, decide on how they should be labeled, and document that decision in your labeling instructions.
Toss out bad examples. More data is not always better! Especially when dealing with smaller datasets where bad apples strongly corrupt your models.
Use error analysis tools to focus on a subset of the data to improve on.

We can improve data via:

Using multiple labelers to measure consistency
Improving label definitions and relabeling more consistently
Tossing out noisy examples, or improving the quality of input data (X)
Getting more data either through collection or data augmentation.

It is important to note that improving the data is not a preprocessing step that you do once. It is part of the iterative process of model development, as well as after that, deployment, monitoring, and maintenance.

Why is the Data-Centric approach important?

The model-centric approach was important in the early days of AI, as we had to make models that could perform and thus be applied to various problems. We have come a long way with the model-centric approach and we now have more than capable models.

Despite the vast amount of data that we have today, certain fields/industries still suffer from a lack of data. This is mostly due to legislation and stringent laws on certain data, but it could also be a result of the absence of digital records, especially in African countries where access to digital sensors and other digital data collection techniques weren't available for a long time. This makes data precious and thus the data-centric approach strongly suits this situation.

Andrew Ng predicts that Data-Centric AI will be the next great AI revolution after the shift from classical machine learning models to deep learning models.

Lenny Ng'ang'a