When exploring a sample composed with a set of bivariate density functions, the question of the visualisation of the data has to front with the choice of the relevant level set(s). The approach proposed in this paper consists in defining the optimal level set(s) as being the one(s) allowing for the best reconstitution of the whole density. A fully data-driven procedure is developed in order to estimate the link between the level set(s) and their corresponding density, to construct optimal level set(s) and to choose automatically the number of relevant level set(s). The method is based on recent advances in functional data analysis when both response and predictors are functional. After a wide description of the methodology, finite sample studies are presented (including both real and simulated data) while theoretical studies are reported to a final appendix.
The final publication is available at link.springer.com
Chen, K.; Delicado, P.; Müller, H-G. Journal of the Royal Statistical Society. Series B, statistical methodology Vol. 79, num. 1, p. 177-196 DOI: 10.1111/rssb.12160 Data de publicació: 2017 Article en revista
We introduce a simple and interpretable model for functional data analysis for situations where the observations at each location are functional rather than scalar. This new approach is based on a tensor product representation of the function-valued process and utilizes eigenfunctions of marginal kernels. The resulting marginal principal components and product principal components are shown to have nice properties. Given a sample of independent realizations of the underlying function-valued stochastic process, we propose straightforward fitting methods to obtain the components of this model and to establish asymptotic consistency and rates of convergence for the estimates proposed. The methods are illustrated by modelling the dynamics of annual fertility profile functions for 17 countries. This analysis demonstrates that the approach proposed leads to insightful interpretations of the model components and interesting conclusions.
Functional Data Analysis (FDA) is a statistical field which has gained importance due to the progress in modern science, mainly in the ability to measure in continous time results of an experiment and the possibility to record them. Many methods such as discriminant analysis, principal components analysis and regression analysis that are used on vector spaces for classification, dimension reduction and modelling have been adapted to the functional case. FDA is concerned on variables that are defined on a continuum or that have continous structure. Therefore, FDA has an important role in the analysis of spectral data sets and images that are mostly recorded in the fields of chemometry, medicine and ecology. Especially in ecology, the analysis of images that are recorded in satellite sensors inform us in a fast and economical way about the use of land, the crop production in land, the water pollution and the amount of minerals include the water. The aim of this study is to propose the use of FDA approach and to predict the amount of Total Suspended Solids (TSS) in the estuary of Guadalquivir river in Cadiz on remote sensing data by using different Functional Linear Regression Models (FLRM). Besides, it is purposed to compare the results obtained from various FLRMs and classical statistical methods practically, to design a simulation study in order to support findings and to determine the best prediction model.
This paper introduces local distance-based generalized linear models. These models extend (weighted) distance-based linear models first to the generalized linear model framework. Then, a nonparametric version of these models is proposed by means of local fitting. Distances between individuals are the only predictor information needed to fit these models. Therefore, they are applicable, among others, to mixed (qualitative and quantitative) explanatory variables or when the regressor is of functional type. An implementation is provided by the R package dbstats, which also implements other distance-based prediction methods. Supplementary material for this article is available online, which reproduces all the results of this article.
In bivariate density representation there is an extensive literature on level set estimation when the level is fixed, but this is not so much the case when choosing which level is (or which levels are) of most interest. This is an important practical question which depends on the kind of problem one has to deal with as well as the kind of feature one wishes to highlight in the density, the answer to which requires both the definition of what the optimal level is and the construction of a method for finding it. We consider two scenarios for this problem. The first one corresponds to situations in which one has just a single density function to be represented. However, as a result of the technical progress in data collecting, problems are emerging in which one has to deal with a sample of densities. In these situations, the need arises to develop joint representation for all these densities, and this is the second scenario considered in this paper. For each case, we provide consistency results for the estimated levels and present wide Monte Carlo simulated experiments illustrating the interest and feasibility of the proposed method. (C) 2015 Elsevier Inc. All rights reserved.
The paper under discussion is a very well-written and interesting piece of work by Secchi et al. (2015) dealing with spatio-temporal data on mobile phone use in the area of Milan. I congratulate the authors for such a stimulating and interesting paper. It clearly points out that Erlang data on mobile phone use contain a large amount of rich information. The paper is an excellent example of statistical analysis of Big Data. I discuss briefly two alternative ways of dimension reduction of spatio-temporal data and illustrate them with artificial data that has been simulated according to the scheme proposed by the authors.
We deal with the problem of representing a bivariate density function by level sets. The choice of which levels are used in this representation are commonly arbitrary (most usual choices being those with probability contents .25, .5 and .75). Choosing which level is (or which levels are) of most interest is
an important practical question which depends on the kind of problem one has to deal with as well as the kind of feature one wishes to highlight in the density.
The approach we develop is based on minimum distance ideas.
Given n independent, identically distributed random vectors in R-d, drawn from a common density f, one wishes to find out whether the support of f is convex or not. In this paper we describe a decision rule which decides correctly for sufficiently largen, with probability 1, whenever f is bounded away from zero in its compact support. We also show that the assumption of boundedness is necessary. The rule is based on a statistic that is a second-orde U-statistic with a random kernel. Moreover, we suggest a way of approximating the distribution of the statistic under the hypothesis of convexity of the support. The performance of the proposed method is illustrated on simulated data sets. As an example of its potential statistical implications, the decision rule is used to automatically choose the tuning parameter of ISOMAP, a nonlinear dimensionality reduction method.
The assembly of iron-sulfur clusters (ISCs) in eukaryotes involves the protein Frataxin. Deficits in this protein have been associated with iron inside the mitochondria and impair ISC biogenesis as it is postulated to act as the iron donor for ISCs assembly in this organelle. A pronounced lack of Frataxin causes Friedreich’s Ataxia, which is a human neurodegenerative and hereditary disease mainly affecting the equilibrium, coordination, muscles and heart. Moreover, it is the most common autosomal recessive ataxia. High similarities between the human and yeast molecular mechanisms that involve Frataxin have been suggested making yeast a good model to study that process. In yeast, the protein complex that forms the central assembly platform for the initial step of ISC biogenesis is composed by yeast frataxin homolog, Nfs1–Isd11 and Isu. In general, it is commonly accepted that protein function involves interaction with other protein partners, but in this case not enough is known about the structure of the protein complex and, therefore, how it exactly functions. The objective of this work is to model the protein complex in order to gain insight into structural details that end up with its biological function. To achieve this goal several bioinformatics tools, modeling techniques and protein docking programs have been used. As a result, the structure of the protein complex and the dynamic behavior of its components, along with that of the iron and sulfur atoms required for the ISC assembly, have been modeled. This hypothesis will help to better understand the function and molecular properties of Frataxin as well as those of its ISC assembly protein partners.
Spatially correlated curves are present in a wide range of applied disciplines. In this paper we describe the R package geofd which implements ordinary kriging prediction for this type of data. Initially the curves are
pre-processed by tting a Fourier or B-splines basis functions. After that the spatial dependence among curves is estimated by means of the tracevariogram function. Finally the parameters for performing prediction by ordinary kriging at unsampled locations are by estimated solving a linear system based estimated trace-variogram. We illustrate the software analyzing real and simulated data.
Classification problems of functional data arise naturally in many applications. Several approaches have been considered for solving the problem of finding groups based on functional data. In this paper, we are interested in detecting groups when the functional data are spatially correlated. Our methodology allows to find spatially homogeneous groups of sites when the observations at each sampling location consist of samples of random functions. In univariable and multivariable geostatistics, various methods of incorporating spatial information into the clustering analysis have been considered. Here, we extend these methods to the functional context to fulfil the task of clustering spatially correlated curves. In our approach, we initially use basis functions to smooth the observed data, and then, we weight the dissimilarity matrix among curves by either the trace-variogram or the multivariable variogram calculated with the coefficients of the basis functions. This paper contains a simulation study as well as the analysis of a real data set corresponding to average daily temperatures measured at 35 Canadian weather stations.
A bivariate densities can be represented as a density level set containing a fixed amount of probability (0.75, for instance). Then a functional dataset where the observations are bivariate density functions can be analyzed as if the functional data are density level sets.We compute distances between sets and perform standard Multidimensional Scaling. This methodology is applied to analyze electoral results.
Functional Data Analysis deals with samples where a whole function is observed for each
individual. A relevant case of FDA is when the observed functions are density functions.
Among the particular characteristics of density functions, the most of the fact that they are an example of infinite dimensional compositional data (parts of some whole which
only carry relative information) is made. Several dimensionality reduction methods for
this particular type of data are compared: functional principal components analysis with or without a previous data transformation, and multidimensional scaling for different interdensity distances, one of them taking into account the compositional nature of density functions. The emphasis is on the steps previous and posterior to the application of a particular dimensionality reduction method: care must be taken in choosing the right density function transformation and/or the appropriate distance between densities before performing
dimensionality reduction; subsequently the graphical representation of dimensionality
reduction results must take into account that the observed objects are density
functions. The different methods are applied1 to artificial and real data (population pyramids for 223 countries in year 2000). As a global conclusion, the use of multidimensional scaling based on compositional distance is recommended.
In various scientific fields properties are represented by functions varying over space. In this paper, we present a methodology to make spatial predictions at non-data locations when the data values are functions. In particular, we propose both
an estimator of the spatial correlation and a functional kriging predictor. We adapt an
optimization criterion used in multivariable spatial prediction in order to estimate the
kriging parameters. The curves are pre-processed by a non-parametric fitting, where
the smoothing parameters are chosen by cross-validation. The approach is illustrated
by analyzing real data based on soil penetration resistances.
Functional data analysis (FDA) is a relatively new branch in statistics. Experiments where a complete function is observed for each individual give rise to functional data. In this work we focus on the case of functional data presenting spatial dependence. The three classic types of spatial data structures (geostatistical data, point patterns, and areal data) can be combined with functional data as it is shown in the examples of each situation provided here. We also review some contributions in the literature on spatial functional data.
Giraldo, R.; Delicado, P.; Mateu, J. Journal of agricultural biological and environmental statistics Vol. 15, num. 1, p. 66-82 DOI: 10.1007/s13253-009-0012-z Data de publicació: 2010-03 Article en revista
The problem of nonparametrically predicting a scalar response variable from a functional
predictor is considered. A sample of pairs (functional predictor and response) is observed.
When predicting the response for a new functional predictor value, a semi-metric is used to compute the distances between the new and the previously observed functional predictors. Then each pair in the original sample is weighted according to a decreasing function of these distances. A Weighted (Linear) Distance-Based Regression is fitted, where the weights are as above and the distances are given by a possibly different semi-metric. This approach can be extended to nonparametric predictions from other kinds of explanatory variables (e.g., data of mixed type) in a natural way.
Several methods of constructing confidence intervals for the median survival time of a recurrent event data are developed. One of them is based on asymptotic variances estimated using some transformations. Others are based on bootstrap techniques. Two types of recurrent event models are considered: the first one is a model where the inter-event times are independent and identically distributed, and the second one is a model where the inter-event times are associated, with the association arising from a gamma frailty model. Bootstrap and asymptotic confidence intervals are studied through simulation. These methods are applied and compared using two real data sets arising in the biomedical and public health settings, using an available R package. The first example belongs to data from a study concerning small bowel motility where an independent model may be assumed. The second example involves hospital readmissions in patients diagnosed with colorectal cancer. In this example the interoccurrence times are correlated.
We propose new dependence measures for two real random variables not necessarily linearly related. Covariance and linear correlation are expressed in terms of principal components and are generalized for variables distributed along a curve. Properties of these measures are discussed. The new measures are estimated using principal curves and are computed for simulated and real data sets. Finally, we present several statistical applications for the new dependence measures.
The final publication is available at link.springer.com
Functional data analysis (FDA) is a relatively new branch of statistics devoted to describing and modelling data that are complete functions. Many relevant aspects of musical performance and perception can be understood and quantified as dynamic processes evolving as functions of time. In this paper, we show that FDA is a statistical methodology well suited for research into the field of quantitative musical performance analysis. To demonstrate this suitability, we consider tempo data for 28 performances of Schumann's Träumerei and analyse them by means of functional principal component analysis (one of the most powerful descriptive tools included in FDA). Specifically, we investigate the commonalities and differences between different performances regarding (expressive) timing, and we cluster similar performances together. We conclude that musical data considered as functional data reveal performance structures that might otherwise go unnoticed.
Cedano, J.; Huerta, M.; Estrada, I.; Ballllosera, F.; Conchillo, O.; Delicado, P.; Querol, E. Computers in biology and medicine Vol. 37, num. 11, p. 1672-1675 DOI: 10.1016/j.compbiomed.2007.03.008 Data de publicació: 2007-11 Article en revista
Motivation: This application aims at assisting researchers with the extraction of significant medical and biological knowledge from data sets with complex relationships among their variables.
Results: Non-hypothesis-driven approaches like Principal Curves of Oriented Points (PCOP) are a very suitable method for this objective. PCOP allows for obtaining of a representative pattern from a huge quantity of data of independent variables in a very flexible and direct way. A web server has been designed to automatically realize ‘non-linear pattern’ analysis, ‘hidden-variable-dependent’ clustering, and new samples ‘local-dispersion-dependent’ classification from the data involving new statistical techniques using the PCOP calculus. The tools facilitate the managing, comparison and visualization of results in a user-friendly graphical interface.