“It’s the dataset, stupid!” – What makes a good data set

Autonomous vehicles that stop in the sleet. Voice-controlled smart home assistants that only understand male voices. Despite a surge in innovation and refined technology, machine learning (ML) often faces unexpected challenges. A lot is written and discussed about what bad data sets can do in machine learning. But what actually is a good data set?

A selection of the most famous publications by Dr. Daniel Kondermann for more in-depth reading:

Daniel Kondermann asked this pragmatic question over ten years ago. The researcher in the field of computer vision at Heidelberg University has specialized in the quality assurance of data sets and has developed a system that generates high-quality data sets. In an interview with the author, he also explains how this system could contribute to achieving an ethically responsible, safer and transparent application of AI and ML systems. To understand the relevance of this question, it is first necessary to take a closer look at what is currently inhibiting machine learning.

” alt=”Dr. Daniel Kondermann, founder and CEO Quality Match” width=”499″ height=”502″ />

Daniel Kondermann has been researching the question of what is a good data set for the field of computer vision since 2009. In 2016 he completed his habilitation in the field. During this time, he and various teams have published numerous publications on good data sets.

As part of his first start-up, Pallas Ludens GmbH, he also made a significant contribution to the “Cityscapes” data set and to an expansion of the “KITTI” data set of the Karlsruhe Institute of Technology (KIT): The KITTI Benchmark Suite, Semantic Segmentation Evaluation .

Since 2019, Kondermann and his team at the start-up Quality Match have been helping companies ask the right questions in order to find good data set examples. The aim is to filter out errors, inconsistencies and ambiguities to make the data set as representative, accurate and difficult as possible – RAD. His motto is, “If data is the new oil, Quality Match is the refinery.”

The article explains what the RAD method (Representativeness, Accuracy, Difficulty) is all about.

With the development of neural networks, the Gordian knot in the field of artificial intelligence (AI) has burst. Suddenly everything seemed possible: A new global industry has emerged and well-known companies are turning to applying AI in increasingly innovative machine learning projects: from self-driving vehicles to smart home assistants, from spam filters to translation software. With the help of machine learning, stock market charts can be analyzed and heart rhythms checked for irregularities.

And yet the machine learning market is stagnating. According to the IT market research company Gartner, only 53 percent of all AI prototypes make it into production, and that’s probably a flattering assessment: According to the VentureBeat IT portal, the vast majority of data science projects do not reach production maturity.

Why is that? Why does Waymo’s self-driving taxi stop at a few construction site cones, perplexed? Why can’t a self-driving vehicle cope with rain, snow or sleet? After a lot of thought and energy has been put into code and models, the focus is now on the data set.

At a workshop in 2013, Daniel Kondermann was still an outsider with his opinion that one should also pay attention to good training data sets. Why, if you could just as easily optimize machine learning methods? At the time of this workshop, all methods of data generation for computer vision were still in their infancy. Whether special measuring technology, computer graphics simulation or annotation – among other things, the annotation of texts had already arrived in the industry, as the example of the language learning app Duolingo shows.

The ML methods are now precise, but the equally good results are still a long time coming. It is only now becoming clear in the industry that you have to think about good data sets. Companies are increasingly developing systems that improve data records incrementally – for example with the help of automatic and manual quality assurance steps.

To make this clear, it is first necessary to understand exactly how machine learning works.