Chen, K.; Delicado, P.; Müller, H-G. Journal of the Royal Statistical Society. Series B, statistical methodology Vol. 79, num. 1, p. 177-196 DOI: 10.1111/rssb.12160 Data de publicació: 2017 Article en revista
We introduce a simple and interpretable model for functional data analysis for situations where the observations at each location are functional rather than scalar. This new approach is based on a tensor product representation of the function-valued process and utilizes eigenfunctions of marginal kernels. The resulting marginal principal components and product principal components are shown to have nice properties. Given a sample of independent realizations of the underlying function-valued stochastic process, we propose straightforward fitting methods to obtain the components of this model and to establish asymptotic consistency and rates of convergence for the estimates proposed. The methods are illustrated by modelling the dynamics of annual fertility profile functions for 17 countries. This analysis demonstrates that the approach proposed leads to insightful interpretations of the model components and interesting conclusions.
Functional Data Analysis (FDA) is a statistical field which has gained importance due to the progress in modern science, mainly in the ability to measure in continous time results of an experiment and the possibility to record them. Many methods such as discriminant analysis, principal components analysis and regression analysis that are used on vector spaces for classification, dimension reduction and modelling have been adapted to the functional case. FDA is concerned on variables that are defined on a continuum or that have continous structure. Therefore, FDA has an important role in the analysis of spectral data sets and images that are mostly recorded in the fields of chemometry, medicine and ecology. Especially in ecology, the analysis of images that are recorded in satellite sensors inform us in a fast and economical way about the use of land, the crop production in land, the water pollution and the amount of minerals include the water. The aim of this study is to propose the use of FDA approach and to predict the amount of Total Suspended Solids (TSS) in the estuary of Guadalquivir river in Cadiz on remote sensing data by using different Functional Linear Regression Models (FLRM). Besides, it is purposed to compare the results obtained from various FLRMs and classical statistical methods practically, to design a simulation study in order to support findings and to determine the best prediction model.
This paper introduces local distance-based generalized linear models. These models extend (weighted) distance-based linear models first to the generalized linear model framework. Then, a nonparametric version of these models is proposed by means of local fitting. Distances between individuals are the only predictor information needed to fit these models. Therefore, they are applicable, among others, to mixed (qualitative and quantitative) explanatory variables or when the regressor is of functional type. An implementation is provided by the R package dbstats, which also implements other distance-based prediction methods. Supplementary material for this article is available online, which reproduces all the results of this article.
In bivariate density representation there is an extensive literature on level set estimation when the level is fixed, but this is not so much the case when choosing which level is (or which levels are) of most interest. This is an important practical question which depends on the kind of problem one has to deal with as well as the kind of feature one wishes to highlight in the density, the answer to which requires both the definition of what the optimal level is and the construction of a method for finding it. We consider two scenarios for this problem. The first one corresponds to situations in which one has just a single density function to be represented. However, as a result of the technical progress in data collecting, problems are emerging in which one has to deal with a sample of densities. In these situations, the need arises to develop joint representation for all these densities, and this is the second scenario considered in this paper. For each case, we provide consistency results for the estimated levels and present wide Monte Carlo simulated experiments illustrating the interest and feasibility of the proposed method. (C) 2015 Elsevier Inc. All rights reserved.
The paper under discussion is a very well-written and interesting piece of work by Secchi et al. (2015) dealing with spatio-temporal data on mobile phone use in the area of Milan. I congratulate the authors for such a stimulating and interesting paper. It clearly points out that Erlang data on mobile phone use contain a large amount of rich information. The paper is an excellent example of statistical analysis of Big Data. I discuss briefly two alternative ways of dimension reduction of spatio-temporal data and illustrate them with artificial data that has been simulated according to the scheme proposed by the authors.
Given n independent, identically distributed random vectors in R-d, drawn from a common density f, one wishes to find out whether the support of f is convex or not. In this paper we describe a decision rule which decides correctly for sufficiently largen, with probability 1, whenever f is bounded away from zero in its compact support. We also show that the assumption of boundedness is necessary. The rule is based on a statistic that is a second-orde U-statistic with a random kernel. Moreover, we suggest a way of approximating the distribution of the statistic under the hypothesis of convexity of the support. The performance of the proposed method is illustrated on simulated data sets. As an example of its potential statistical implications, the decision rule is used to automatically choose the tuning parameter of ISOMAP, a nonlinear dimensionality reduction method.
The assembly of iron-sulfur clusters (ISCs) in eukaryotes involves the protein Frataxin. Deficits in this protein have been associated with iron inside the mitochondria and impair ISC biogenesis as it is postulated to act as the iron donor for ISCs assembly in this organelle. A pronounced lack of Frataxin causes Friedreich’s Ataxia, which is a human neurodegenerative and hereditary disease mainly affecting the equilibrium, coordination, muscles and heart. Moreover, it is the most common autosomal recessive ataxia. High similarities between the human and yeast molecular mechanisms that involve Frataxin have been suggested making yeast a good model to study that process. In yeast, the protein complex that forms the central assembly platform for the initial step of ISC biogenesis is composed by yeast frataxin homolog, Nfs1–Isd11 and Isu. In general, it is commonly accepted that protein function involves interaction with other protein partners, but in this case not enough is known about the structure of the protein complex and, therefore, how it exactly functions. The objective of this work is to model the protein complex in order to gain insight into structural details that end up with its biological function. To achieve this goal several bioinformatics tools, modeling techniques and protein docking programs have been used. As a result, the structure of the protein complex and the dynamic behavior of its components, along with that of the iron and sulfur atoms required for the ISC assembly, have been modeled. This hypothesis will help to better understand the function and molecular properties of Frataxin as well as those of its ISC assembly protein partners.
Spatially correlated curves are present in a wide range of applied disciplines. In this paper we describe the R package geofd which implements ordinary kriging prediction for this type of data. Initially the curves are
pre-processed by tting a Fourier or B-splines basis functions. After that the spatial dependence among curves is estimated by means of the tracevariogram function. Finally the parameters for performing prediction by ordinary kriging at unsampled locations are by estimated solving a linear system based estimated trace-variogram. We illustrate the software analyzing real and simulated data.
Classification problems of functional data arise naturally in many applications. Several approaches have been considered for solving the problem of finding groups based on functional data. In this paper, we are interested in detecting groups when the functional data are spatially correlated. Our methodology allows to find spatially homogeneous groups of sites when the observations at each sampling location consist of samples of random functions. In univariable and multivariable geostatistics, various methods of incorporating spatial information into the clustering analysis have been considered. Here, we extend these methods to the functional context to fulfil the task of clustering spatially correlated curves. In our approach, we initially use basis functions to smooth the observed data, and then, we weight the dissimilarity matrix among curves by either the trace-variogram or the multivariable variogram calculated with the coefficients of the basis functions. This paper contains a simulation study as well as the analysis of a real data set corresponding to average daily temperatures measured at 35 Canadian weather stations.
In various scientific fields properties are represented by functions varying over space. In this paper, we present a methodology to make spatial predictions at non-data locations when the data values are functions. In particular, we propose both
an estimator of the spatial correlation and a functional kriging predictor. We adapt an
optimization criterion used in multivariable spatial prediction in order to estimate the
kriging parameters. The curves are pre-processed by a non-parametric fitting, where
the smoothing parameters are chosen by cross-validation. The approach is illustrated
by analyzing real data based on soil penetration resistances.
Functional Data Analysis deals with samples where a whole function is observed for each
individual. A relevant case of FDA is when the observed functions are density functions.
Among the particular characteristics of density functions, the most of the fact that they are an example of infinite dimensional compositional data (parts of some whole which
only carry relative information) is made. Several dimensionality reduction methods for
this particular type of data are compared: functional principal components analysis with or without a previous data transformation, and multidimensional scaling for different interdensity distances, one of them taking into account the compositional nature of density functions. The emphasis is on the steps previous and posterior to the application of a particular dimensionality reduction method: care must be taken in choosing the right density function transformation and/or the appropriate distance between densities before performing
dimensionality reduction; subsequently the graphical representation of dimensionality
reduction results must take into account that the observed objects are density
functions. The different methods are applied1 to artificial and real data (population pyramids for 223 countries in year 2000). As a global conclusion, the use of multidimensional scaling based on compositional distance is recommended.
Functional data analysis (FDA) is a relatively new branch in statistics. Experiments where a complete function is observed for each individual give rise to functional data. In this work we focus on the case of functional data presenting spatial dependence. The three classic types of spatial data structures (geostatistical data, point patterns, and areal data) can be combined with functional data as it is shown in the examples of each situation provided here. We also review some contributions in the literature on spatial functional data.
Giraldo, R.; Delicado, P.; Mateu, J. Journal of agricultural biological and environmental statistics Vol. 15, num. 1, p. 66-82 DOI: 10.1007/s13253-009-0012-z Data de publicació: 2010-03 Article en revista
The problem of nonparametrically predicting a scalar response variable from a functional
predictor is considered. A sample of pairs (functional predictor and response) is observed.
When predicting the response for a new functional predictor value, a semi-metric is used to compute the distances between the new and the previously observed functional predictors. Then each pair in the original sample is weighted according to a decreasing function of these distances. A Weighted (Linear) Distance-Based Regression is fitted, where the weights are as above and the distances are given by a possibly different semi-metric. This approach can be extended to nonparametric predictions from other kinds of explanatory variables (e.g., data of mixed type) in a natural way.
Several methods of constructing confidence intervals for the median survival time of a recurrent event data are developed. One of them is based on asymptotic variances estimated using some transformations. Others are based on bootstrap techniques. Two types of recurrent event models are considered: the first one is a model where the inter-event times are independent and identically distributed, and the second one is a model where the inter-event times are associated, with the association arising from a gamma frailty model. Bootstrap and asymptotic confidence intervals are studied through simulation. These methods are applied and compared using two real data sets arising in the biomedical and public health settings, using an available R package. The first example belongs to data from a study concerning small bowel motility where an independent model may be assumed. The second example involves hospital readmissions in patients diagnosed with colorectal cancer. In this example the interoccurrence times are correlated.
Motivation: This application aims at assisting researchers with the extraction of significant medical and biological knowledge from data sets with complex relationships among their variables.
Results: Non-hypothesis-driven approaches like Principal Curves of Oriented Points (PCOP) are a very suitable method for this objective. PCOP allows for obtaining of a representative pattern from a huge quantity of data of independent variables in a very flexible and direct way. A web server has been designed to automatically realize ‘non-linear pattern’ analysis, ‘hidden-variable-dependent’ clustering, and new samples ‘local-dispersion-dependent’ classification from the data involving new statistical techniques using the PCOP calculus. The tools facilitate the managing, comparison and visualization of results in a user-friendly graphical interface.
This paper deals with the k-sample problem for functional data when the observations are density functions. We introduce test procedures based on distances between pairs of density functions (L1 distance and Hellinger distance, among others). A simulation study is carried out to compare the practical behaviour of the proposed tests. Theoretical derivations have been done in order to allow weighted samples in the test procedures. The paper ends with a real data example: for a collection of European regions we estimate the regional relative income densities and then we test the significance of the country effect.
La necesidad del análisis de supervivencia aparece cuando necesitamos estudiar las propiedades estadísticas de una variable que describe el tiempo hasta que ocurre un evento único. En algunas ocasiones, podemos observar que el evento de interés ocurre repetidamente en un mismo individuo, como puede ser el caso de un paciente diagnosticado de cáncer que recae a lo largo del tiempo o cuando una persona es reingresada repetidas veces en un hospital. En este caso hablamos de análisis de supervivencia con eventos recurrentes. La naturaleza recurrente de los eventos hace necesario el uso de otras técnicas distintas a aquellas que utilizamos cuando analizamos tiempos de supervivencia para un evento único. En esta tesis, tratamos este tipo de análisis principalmente motivados por dos estudios en investigación en cáncer que fueron creados especialmente para este trabajo. Uno de ellos hace referencia a un estudio sobre readmisiones hospitalarias en pacientes diagnosticados con cáncer colorectal, mientras que el otro hace referencia a pacientes diagnosticados con linfomas no Hodgkinianos. Este último estudio es especialmente relevante ya que incluimos información sobre el efecto del tratamiento después de las recaídas y algunos autores han mostrado la necesidad de desarrollar un modelo específico para pacientes que presentan este tipo de enfermedades. Nuestra contribución al análisis univariante es proponer un método para construir intervalos de confianza para la mediana de supervivencia en el caso de eventos recurrentes. Para ello, hemos utilizado dos aproximaciones. Una de ellas se basa en las varianzas asintóticas derivadas de dos estimadores existentes de la función de supervivencia, mientras que el otro utiliza técnicas de remuestreo. Esta última aproximación es útil ya que uno de los estimadores utilizados todavía no tiene una forma cerrada para su varianza. La nueva contribución de este trabajo es el estudio de cómo hacer remuestreo en la presencia de datos con eventos recurrentes que aparecen de un esquema conocido como --sum-quota accrual" y la informatividad del mecanismo de censura por la derecha que presentan este tipo de datos. Demostramos la convergencia d bil y los intervalos de confianza asintóticos se construyen utilizando dicho resultado. Por otro lado, el análisis multivariante trata el problema de cómo incorporar más de una covariable en el análisis. En problemas con eventos recurrentes, también necesitamos tener en cuenta que además de las covariables, la hetereogeneidad, el número de ocurrencias, o especialmente, el efecto de las intervenciones después de las reocurrencias puede modificar la probabilidad de observar un nuevo evento en un paciente. Este último punto es muy importante ya que todavía no se ha tenido en cuenta en estudios biomédicos. Para tratar este problema, hemos basado nuestro trabajo en un nuevo modelo para eventos recurrentes propuesto por Peña y Hollander, 2004. Nuestra contribución a este punto es la adaptación de las recaídas en cáncer utilizando este modelo en el que el efecto de las intervenciones se representa mediante un proceso llamado --edad efectiva' que actúa sobre la función de riesgo basal. Hemos llamado a este modelo modelo dinámico de cáncer (--dynamic cancer model'). También tratamos el problema de la estimación de parámetros de la clase general de modelos para eventos recurrentes propuesta por Peña y Hollander donde el modelo dinámico de cáncer se puede ver como un caso especial de este modelo general. Hemos desarrollado dos aproximaciones. La primera se basa en inferencia semiparamétrica, donde la función de riesgo basal se especifica de forma no paramétrica y usamos el algoritmo EM. La segunda es una aproximación basada en verosimilitud penalizada donde adoptamos dos estrategias diferentes. Una de ellas se basa en penalizar la verosimilitud parcial donde la penalización recae en los coeficientes de regresión. La segunda penaliza la verosimilitud completa y da una estimación no paramétrica de la función de riesgo basal utilizando un estimador continuo. La solución se aproxima utilizando splines. La principal ventaja de este método es que podemos obtener fácilmente una estimación suave de la función de riesgo así como una estimación de la varianza de la varianza de la fragilidad, mientras que con las otras aproximaciones esto no es posible. Además este último método presenta un coste computacional bastante más bajo que los otros. Los resultados obtenidos con datos reales, indican que la flexibilidad de este modelo es una garantía para analizar datos de pacientes que recaen a lo largo del tiempo y que son intervenidos después de las recaídas tumorales. El aspecto computacional es otra de las contribuciones importantes de esta tesis al campo de los eventos recurrentes. Hemos desarrollado tres paquete de R llamados survrec, gcmrec y frailtypack que están accesibles en CRAN, http://www.r-project.org/. Estos paquetes permiten al usuario calcular la mediana de supervivencia y sus intervalos de confianza, estimar los par metros del modelo de Peña y Hollander (en particular el modelo dinámico de cáncer) utilizando el algoritmo EM y la verosimilitud penalizada, respectivamente.
Survival analysis arises when we are interested in studying statistical properties of a variable which describes the time to a single event. In some situations, we may observe that the event of interest occurs repeatedly in the same individual, such as when a patient diagnosed with cancer tends to relapse over time or when a person is repeatedly readmitted in a hospital. In this case we speak about survival analysis with recurrent events. Recurrent nature of events makes necessary to use other techniques from those used when we analyze survival times from one single event. In this dissertation we deal with this type of analysis mainly motivatedby two studies on cancer research that were created specially for this research. One of them belongs to a study on hospital readmissions in patients diagnosed with colorectal cancer, while the other one deals with patients diagnosed with non-Hodgkin's lymphoma. This last study is mainly relevant since we include information about the effect of treatment after relapses and some authors have stated the needed of developing a specific model for relapsing patients in cancer settings. Our first contribution to univariate analysis is to propose a method to construct confidence intervals for the median survival time in the case of recurrent event settings. Two different approaches are developed. One of them is based on asymptotic variances derived from two existing estimators of survival function, while the other one uses bootstrap techniques. This last approach is useful since one of the estimators used, does not have any closed form for its variance yet. The new contribution to this work is the examination of the question of how to do bootstrapping in the presence of recurrent event data arising from a sum-quota accrual scheme and informativeness of right censoring mechanism. Weak convergence is proved and asymptotic confidence intervals are built to according this result. On the other hand, multivariate analysis addresses the problem of how incorporate more than one covariate in the analysis. In recurrent event settings, we also need to take into account that apart from covariates, the heterogeneity, the number of occurrences or specially, the effect of interventions after re occurrences may modify the probability of observing a new event in a patient. This last point is a very important one since it has not been taken into consideration in biomedical studies yet. To address this problem, we base our work on a new model for recurrent events proposed by Peña and Hollander. Our contribution to this topic is to accommodate the situation of cancer relapses to this model model in which the effect of interventions is represented by an effective age process acting on the baseline hazard function. We call this model dynamic cancer model. We also address the problem of estimating parameters of the general class of models for recurrent events proposed by Peña and Hollander, 2004, where the dynamic cancer model may be seen as a special case of this general model. Two general approaches are developed. First approach is based on semiparametric inference, where a baseline hazard function is nonparametrically specified and uses the EM algorithm. The second one is a penalized likelihood approach where two different strategies are adopted. One of them is based on penalizing the partial likelihood where the penalization bears on a regression coefficient. The second penalized approach penalized full likelihood, and it gives a non parametric estimation of the baseline hazard function using a continuous estimator. The solution is then approximated using splines. The main advantage of this method is that we caneasily obtain smooth estimates of the hazard function and an estimation of the variance of frailty variance, while in the other approaches this is not possible. In addition, this last approach has a quite less computational cost than the other ones. The results obtained using dynamic cancer model in real data sets, indicate that the flexibility of this method provides a safeguard for analyzing data where patients relapse over time and interventions are performed after tumoral reoccurrences. Computational issue is another important contribution of this work to recurrent event settings. We have developed three R packages called survrec, gcmrec, and frailtypack that are available at CRAN, http://www.r-project.org/. These packages allow users to compute median survival time and their confidence intervals, to estimate the parameters involved in the Peña and Hollander's model (in particular in the dynamic cancer model) using EM algorithm, and to estimate this parameters using penalized approach, respectively.