Go to the content (press return)

Bridging the gap between Statistics and Data Science

Total activity: 2
Type of activity
Competitive project
Funding entity
Funding entity code
5.566,00 €
Start date
End date
Data science, Funcional Data Analysis, Map-Reduce, Spark, análsisi de datos funcionales, drones, imágenes, multidimensional scaling
One of the main challenges that our society is facing in the 21st century, is the enormous capacity of generation and storage of data,
linked to the economy and the digital society. This is known as Big Data. In recent years, however, the popularity of this term seems to be
declining, while other related topic, "Data Science", stands out as a generic way of referring to an emerging scientific discipline, placed
between mathematics, computer science and statistics, which aims to extract important information from the data. Many statisticians
recognize our own discipline, Statistics, in that definition. However, the term Data Science is associated with concepts and techniques that
have being developed outside the field of statistics: Big Data, machine learning, predictive algorithms, complex data, among others.
From the point of view of statistics, as a scientific discipline, the emergence of the concept of data science is a challenge and an
opportunity that the collective of statisticians do not let pass. We must make clear our centuries-old experience in the treatment and
analysis of data sets of very different types and sizes, which legitimately allows us to participate in the new discipline. But it is also
imperative for statisticians to acquire the necessary skills to be able to efficiently solve the problems currently facing the data analyst: large
volumes of data, distributed databases (Hadoop file systems, for example) requiring computation In Parallel (Map-Reduce programming
model, for instance), streaming data, new problems, data complexity (images, graphs, ...), to name a few.
The ultimate goal of our project is to advance on the convergence between Statistics and Data Science. We are aware that the path is not
easy: we can not trust that the techniques we have developed so far and that we have applied in standard data sets (in the form of a data
matrix, with sample sizes not exceeding tens of thousands of Individuals, which could be loaded into the memory of a personal computer)
are applicable to more complex data sets, or to millions of individuals, distributed on multiple computers and impossible to load into
memory. We know that we will have to use computer techniques different from the ones we use, in which parallel computing is the key.
In the medium and long term we set ourselves two main objectives. On the one hand, the development, under the direction of P. Delicado
and Professor Eddie Schrevens, of the doctoral thesis of F. de Mendiburu (just begun), which will carry out the analysis of data from
images captured by drones. The methodology proposed is Functional Data Analysis. On the other hand, to explore research topics
between Statistical and Data Science, and specifically to study the implementation of the Multidimensional Scale in Spark, following the
Map-Reduce programming model. The GorgoBase database, which contains about 3 million chess games, will be used as a test set.


Scientific and technological production

1 to 2 of 2 results