Dealing with "Very Large" Datasets - An Overview of a Promising Research Line: Distributed Learning

D. Peteiro-Barral,B. Guijarro-Berdiñas,Beatriz Pérez-Sánchez
DOI: https://doi.org/10.5220/0003288804760481
Abstract:Traditionally, a bottleneck preventing the development of more intelligent systems was the limited amount of data available. However, nowadays in many domains of machine learning, the size of the datasets is so large that the limiting factor is the inability of learning algorithms to use all the data to learn with in a reasonable time. In order to handle this problem a new field in machine learning has emerged: large-scale learning, where learning is limited by computational resources rather than by the availability of data. Moreover, in many real applications, “very large” datasets are naturally distributed and it is necessary to learn locally in each of the workstations in which the data are generated. However, the great majority of well-known learning algorithms do not provide an admissible solution to both problems: learning from “very large” datasets and learning from distributed data. In this context, distributed learning seems to be a promising line of research with which to deal with both situations, since “very large” concentrated datasets can be partitioned among several workstations. This paper provides some background regarding distributed environments as well as an overview of distributed learning for dealing with “very large” datasets.
What problem does this paper attempt to address?