In data analysis it is becoming more frequent that the sample that is being analyzed is formed by sampling units that do not correspond to the classical statistical concept of a phisical individual from which one has measured one or more variables of interest. It is already usual that sampling units conrespond to a subset of individual data, of a lower level, with information that has been aggregated to generate the sampling units that make the sample under study. During the last eight years the research team members have gained experience in the statistical analysis of situations that fall into this hierarchical scheme of samples of samples. In particular we have worked in the analysis of functional data, with functions like probability densities that summarize the information in the lower level samples, in the analisis of electoral results at a small area level and in the analisis of textual data, where at a higher level data are discrete distributions and at a lower level the sampling units are the individuals voting in each area or the words used in each chapter. In the context of the analysis of samples of samples one could also fit for example the analysis of social network or web page graphs. The statistical objects that appear as summaries of the samples at a lower level and make the sample at a higher level can be very comples and difficult to handle. In particular one will not always be able to define a scalar product on them, even though in most cases one should be able to define a distance measure on them. We also find that sets of data from very different origins, with completely different lower level individuals, give place to higher level samples of the same kind. We consider that the analysis of samples of samples is a good starting point to frame the analysis of Big Data. The treatment of Big Data often requires the aggregation of individual data either because the data base is not stored in a single server, or because the size of the data base (say N) might recommend its partition into subsamples (say K) of smaller size (say n, with N=k*n). The individual analysis of each one of these K subsamples and the corresponding transformation of the information contained in an hyper-datum, allows one to go from a sample of N individuals to a sample of K higher level statistical units. Our goal is to move ahead starting from our work in the analysis of samples of samples towards the statistical analysis of Big Data. In order for that transition to be smooth, we plan on widening the range of applications in which to test our proposals (adding data on income distribution and medium sized graphs), working with new data bases (demographic data, income micro-data, species counts in ecology and graph repositories), and tackling problems with a more ambitious structure than the ones dealt with so far (functional data that depend on two arguments, the joint observation of r discrete distributions in each higher level sampling unit, and allowing temporal and spatial dependencies among the higher level data).
Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016
Programa Estatal de I+D+i Orientada a los Retos de la Sociedad
Retos de Investigación: Proyectos de I+D+i
Gobierno De España. Ministerio De Economía Y Competitividad, Mineco
Chen, K.; Delicado, P.; Müller, H-G. Journal of the Royal Statistical Society. Series B, statistical methodology Vol. 79, num. 1, p. 177-196 DOI: 10.1111/rssb.12160 Date of publication: 2017 Journal article