One of the main challenges that our society is facing in the 21st century, is the enormous capacity of generation and storage of data, linked to the economy and the digital society. This is known as Big Data. In recent years, however, the popularity of this term seems to be declining, while other related topic, "Data Science", stands out as a generic way of referring to an emerging scientific discipline, placed between mathematics, computer science and statistics, which aims to extract important information from the data. Many statisticians recognize our own discipline, Statistics, in that definition. However, the term Data Science is associated with concepts and techniques that have being developed outside the field of statistics: Big Data, machine learning, predictive algorithms, complex data, among others. From the point of view of statistics, as a scientific discipline, the emergence of the concept of data science is a challenge and an opportunity that the collective of statisticians do not let pass. We must make clear our centuries-old experience in the treatment and analysis of data sets of very different types and sizes, which legitimately allows us to participate in the new discipline. But it is also imperative for statisticians to acquire the necessary skills to be able to efficiently solve the problems currently facing the data analyst: large volumes of data, distributed databases (Hadoop file systems, for example) requiring computation In Parallel (Map-Reduce programming model, for instance), streaming data, new problems, data complexity (images, graphs, ...), to name a few. The ultimate goal of our project is to advance on the convergence between Statistics and Data Science. We are aware that the path is not easy: we can not trust that the techniques we have developed so far and that we have applied in standard data sets (in the form of a data matrix, with sample sizes not exceeding tens of thousands of Individuals, which could be loaded into the memory of a personal computer) are applicable to more complex data sets, or to millions of individuals, distributed on multiple computers and impossible to load into memory. We know that we will have to use computer techniques different from the ones we use, in which parallel computing is the key. In the medium and long term we set ourselves two main objectives. On the one hand, the development, under the direction of P. Delicado and Professor Eddie Schrevens, of the doctoral thesis of F. de Mendiburu (just begun), which will carry out the analysis of data from images captured by drones. The methodology proposed is Functional Data Analysis. On the other hand, to explore research topics between Statistical and Data Science, and specifically to study the implementation of the Multidimensional Scale in Spark, following the Map-Reduce programming model. The GorgoBase database, which contains about 3 million chess games, will be used as a test set.
de Mendiburu, F.; Delicado, P.; Morales, R.; Lozoya, H.; Quiroz, R.; Schrevens, E. Congreso Nacional de Estadística e Investigación Operativa p. 182 Presentation's date: 2018-06-01 Presentation of work at congresses