Data Labeling: An Empirical Investigation into Industrial Challenges and Mitigation Strategies

Teodor Fredriksson,David Issa Mattos,Jan Bosch,Helena Holmström Olsson
DOI: https://doi.org/10.1007/978-3-030-64148-1_13
2020-01-01
Abstract:Labeling is a cornerstone of supervised machine learning. However, in industrial applications, data is often not labeled, which complicates using this data for machine learning. Although there are well-established labeling techniques such as crowdsourcing, active learning, and semi-supervised learning, these still do not provide accurate and reliable labels for every machine learning use case in the industry. In this context, the industry still relies heavily on manually annotating and labeling their data. This study investigates the challenges that companies experience when annotating and labeling their data. We performed a case study using a semi-structured interview with data scientists at two companies to explore their problems when labeling and annotating their data. This paper provides two contributions. We identify industry challenges in the labeling process, and then we propose mitigation strategies for these challenges.
What problem does this paper attempt to address?