Séminaires de l'année académique 2007-2008
- 9 octobre 2007. Anne Ruiz Gazen, Institut de Mathématiques de Toulouse Gremaq, Université de Toulouse, France
Analyse exploratoire de données spatiale : mise en oeuvre interactive et nouveaux outils
L'objet de ce séminaire est de présenter des outils exploratoires adaptés à l'analyse de données spatiales. Ces données correspondent à des observations dont la position géographique est connue et se rencontrent en particulier dans des domaines de l'économie, l'écologie, la géographie et la géochimie. Les SIG (systèmes d'information géographiques) sont généralement des outils puissants pour cartographier des données spatiales mais ils ne permettent pas l'analyse exploratoire interactive et n'incluent pas ou peu de méthodes statistiques adaptées. Dans la première partie de cette présentation, nous présenterons le logiciel GeoXp qui permet de lier de façon interactive des diagrammes statistiques à une carte. La version la plus récente de GeoXp que nous avons développée (en collaboration avec T. Laurent et C. Thomas-Agnan) se présente sous la forme d'un package R disponible sur le site CRAN. Ce package contient des outils statistiques usuels comme l'histogramme, la boîte à moustaches, la courbe de Lorenz, le diagramme de dispersion ou l'analyse en composantes principales mais il intègre aussi des outils dédiés à l'analyse de données spatiales. Les principaux objectifs de l'analyse exploratoire de données spatiales sont la détection de tendances spatiales, la détection d'individus atypiques et l'analyse de l'autocorrélation spatiale. Pour répondre à ces différents objectifs, nous présenterons plusieurs outils développés dans GeoXp comme le diagramme des voisins, le diagramme ``angle'', le diagramme de Moran et le nuage variographique. Dans une deuxième partie, nous présenterons des développements récents d'outils exploratoires pour données spatiales. En particulier, nous introduirons des indices de Moran robustes ainsi que de nouveaux outils pour la détection d'observations atypiques multivariées spatiales. Ces derniers travaux sont en cours, en collaboration avec M. Genton pour les I de Moran et P. Filzmoser et C. Thomas-Agnan pour la détection d'observations atypiques multivariées.
- 16 octobre 2007. Olivier Renaud, Université de Genève
Prediction and State-Space Filtering of Time Series.
We present a new method for the prediction of time series and for the filtering of processes that are measured with noise (state-space models). It is based on a special (overcomplete) multiscale decomposition of the signal called the à trous wavelet transform.
In the case of prediction, we use this decomposition to fit either a very simple autoregressive model or a more complex neural network. However, virtually any prediction scheme can be adapted to this representation. Even with the simplest autoregressive model, this method is able to capture short and long memory components of the series in an adaptive and very efficient way and we show in this case the convergence of the method towards the optimal prediction. Simulations show that it can capture fractionnal ARIMA series as well as very short-term dependency series. The number of parameters to be estimated is adative to the sturcture of the series and always stay moderate, but thanks to the multiresolution, the model capture long dependencies.
In the filtering case, the same prediction scheme can be used and we use the multiscale entropy filtering instead of the usual trade-off inherent in the Kalman filter. The entropy method, adapted from the denoising purpose, reveals very powerful in this multiscale framework and has several advantages. It is competitive in the cases where the Kalman filter is known to be optimal, but it is much more useful when the transition equation is not linear any more. Moreover, the multiscale entropy filtering is robust relative to Gaussianity of the transition noise.
- 23 octobre 2007. Christian Mazza, Université de Fribourg
Robustesse et probabilité de ruine empirique en temps fini
On considère le modèle classique de la théorie de la ruine lorsque la loi des sinistres est la loi empirique associée à un échantillon. On présente une étude de sensitivité en calculant par exemple la fonction d'influence associée, ainsi que la convergence de cette probabilité vers une gaussienne.
- 30 octobre 2007. Tanya Garcia, Université de Neuchâtel
Smoothing and Bootstrapping the PROMETHEUS Fire Growth Model
The PROMETHEUS model is a spatially explicit, deterministic fire growth model, praised for being beneficial in various aspects of fire management. Our goal is to build on this success, applying statistical smoothing to alleviate some computational diffculties and to increase accuracy; a pleasant by-product is the opportunity to introduce stochasticity to the model using a residual-based block bootstrap.
- 13 novembre 2007. Hans Rudolf Kuensch, ETH Zurich
Statistics in climate research: Two examples
Deterministic models are predominant in climate research because they represent knowledge about physical processes and because the parameters have a clear physical interpretation. However, it is clear that measurement errors are not the only source of uncertainty and statistics can and should play a bigger role to assess these different sources of uncertainty within the framework of deterministic models. One approach is to reduce and quantify uncertainty by combining the outputs from different models. I will illustrate this with an example from regional climate predictions. It turns out that the crucial issue are the assumptions about the prediction biases. A different approach is based on the technique of time-varying inputs or parameters to diagnose deficits of deterministic models. This will be illustrated by a simple hemispherically averaged energy balance model for global climate.
- 20 novembre 2007. Ludovic Lebart, ENST Paris, France
Resampling techniques for assessing the visualisations of multivariate data
Multivariate descriptive techniques involving singular values decomposition (such as Principal Components Analysis, Simple and Multiple Correspondence analysis) may provide misleading visualisations of the data. We briefly show that several types of resampling techniques could be carried out to assess the quality of the obtained visualisations: a) Partial bootstrap, that considers the replications as supplementary variables, without diagonalization of the replicated moment-product matrices. b) Total bootstrap type 1, that performs a new diagonalization for each replicate, with corrections limited to possible changes of signs of the axes. c) Total bootsrap type 2, which adds to the preceding one a correction for the possible exchanges of axes. d) Total bootstrap type 3, that implies procrustean transformations of all the replicates striving to take into account both rotations and exchanges of axes. e) Specific bootstrap, implying a resampling at different levels (case of a hierarchy of statistical units). Examples are presented and discussed for each technique
- 27 novembre 2007. Gerda Claeskens, Catholic University of Leuven, Belgium
Lack-of-fit tests and order selection in inverse regression models.
We propose two test statistics for use in inverse regression problems where only noisy, indirect observations for the mean function are available. Both test statistics have a counterpart in classical hypothesis testing, where they are called the order selection test and the data-driven Neyman smooth test.
We also introduce two model selection criteria which extend the classical AIC and BIC to inverse regression problems. In a simulation study we show that the inverse order selection and Neyman smooth tests outperform their direct counterparts in many cases.
The methods are applied to data arising in confocal fluorescence microscopy. Here, images are observed with blurring (modeled as deconvolution) and stochastic error at subsequent times. The aim is then to reduce the signal to noise ratio by averaging over the distinct images. In this context it is relevant to test whether the images are still equal (or have changed by outside influences such as moving of the object table).
This is joint work with N. Bissantz, H. Holzmann and A. Munk.
- 4 décembre 2007. Peter Buehlmann ETH Zurich
Variable selection for high-dimensional data: with applications in molecular biology
In many application areas, the number of covariates is very large (e.g. in the thousands) while the sample size is quite small (e.g. in the dozens). Standard exhaustive search methods for variable selection quickly become computationally infeasible, and forward selection methods are typically very unstable.
We will show that in generalized linear models, L1-penalty methods (Lasso) can be very powerful as a first step: with high probability, the (mathematical) true model is a subset of the estimated model. Moreover, some adaptations correct Lasso's overestimation behavior, yielding consistent variable selection schemes, and their exhaustive computation can be done very efficiently.
Our illustrations cover both theory and methodology as well as concrete applications in molecular biology.
- 11 décembre 2007. Chris Skinner, University of South Hampton, UK
Estimation of a Distribution Function from Survey Data with Nonresponse
The estimation of a finite population distribution function from sample survey data is considered for the case when nonresponse is present. It is assumed that information is available for all sample units on auxiliary variables which are predictive of the variable of interest. Two broad approaches are considered: an imputation/regression approach (in particular, a fractional nearest neighbour method) and a propensity score weighting approach. The paper is motivated by an application to the estimation of the distribution of hourly pay using data from the Labour Force Survey in the United Kingdom. In this case the main auxiliary variable is a proxy measure of the variable of interest. Some theoretical and numerical comparisons of the approaches will be presented.
- 19 février 2008. Lutz Duembgen, Institut für mathematische Statistik und Versicherungslehre, Université de Berne, Suisse
P-Values for Computer-Intensive Classifiers
In the first part of the talk presents p-values for classification in general. The latter are an interesting alternative to classifiers or posterior distributions of class labels. Their purpose is to quantify uncertainty when classifying a single observation, even if we don't have information on the prior distribution of class labels.
After illustrating this concept with some examples and procedures, we focus on computational issues and discuss p-values involving regularization, in particular, LASSO type penalties, to cope with high-dimensional data.
(Part of this talk is based on joint work with Axel Munk, Goettingen, and Bernd-Wolfgang Igl, Luebeck.)
- 26 février 2008. Marc Hallin, Free university of Brussels, Belgium
The General Dynamic Factor Model: Determining the Number of Factors
In this talk we briefly present the general dynamic factor developed by Forni et al. (2000) for the analysis of large panels of time series data. Although developed in an econometric context, this method is likely to applyin all fields where a very large number of interrelated time series or signals are observed simultaneously. We then consider the problem of identifying the number q of factors driving the panel. The criterion we propose is based on the fact that this number q is also the number of diverging eigenvalues of the spectral density matrix of the observations as the cross-sectional dimension n goes to infinity. We provide sufficient conditions for consistency of the criterion for large n and T (where T is the series length). We show how the method can be implemented, and provide simulations and empirics illustrating its excellent finite sample performance. Application to real data brings some new contribution in the ongoing debate on the number of factors driving the US economy.
References
Forni, M., M. Hallin, M. Lippi, and L. Reichlin (2000). The generalized dynamic factor model: identification and estimation. The Review of Economics and Statistics 82, 540-554.
Hallin, M. And R. Liska (2007). The generalized dynamic factor model: determining the number of factors. Journal of the American Statistical Association 102, 603-617.
- 4 mars 2008. Anthony Davison, EPFL, Lausanne
Accurate confidence intervals for wavelet estimates of curves
Wavelets are widely used for reconstruction of the signals underlying noisy data.
The talk will discuss how saddlepoint methods can be used to obtain highly accurate Bayesian posterior probability intervals for the signals.
The work is joint with David Hinkley, Daniel Mastropietro, and Claudio Semadeni.
- 11 mars 2008. Jacques Zuber, Heigh Ecole Technique de Vaud
Les plans d'expériences dans le milieu industriel, de la théorie à la pratique
Les plans d'expériences jouent un rôle prépondérant dans différents secteurs industriels comme la chimie, l'agroalimentaire, l'automobile et l'électronique. Ils permettent en effet d'optimiser des procédés ou des produits dans le développement, la production ou le contrôle de la qualité. À l'aide de plans d'expériences, on parvient à découvrir toutes les synergies existant entre les paramètres entrant en jeu en minimisant l'effort expérimental tout en obtenant un maximum d'exactitude, et donc une meilleure productivité.
Dans cet explosé, on présentera les enjeux liés à l'optimisation des procédés ou des produits puis on abordera successivement les trois stades utilisés dans les protocoles industriels : le criblage, la modélisation et l'optimisation. Pour chacun d'eux, on décrira les principaux plans d'expériences ainsi que les méthodes d'analyse statistique qui s'y rattachent.
- 1 avril 2008. Guosheng Yin, MD Anderson Cancer Center, University of Texas, Houston, USA
Power-Transformed Linear Quantile Regression with Censored Data
We propose a class of power-transformed linear quantile regression models for survival data subject to random censoring. The estimation procedure follows two sequential steps: first, for a given transformation parameter, we can easily obtain the estimates for the regression coefficients associated with covariates by minimizing a well-defined convex objective function; and second, we can estimate the transformation parameter based on a model discrepancy measure by constructing cumulative sum processes. We show that both the regression and transformation parameter estimates are strongly consistent and asymptotically normal. The variance-covariance matrix depends on the unknown density function of the error term. To avoid nonparametric functional estimation, the variance can be naturally estimated by the usual bootstrap method. We examine the performance of the proposed method for finite sample sizes through simulation studies, and illustrate it with a real data example.
(joint work with DONGLIN ZENG and HUI LI)
- 15 avrill 2008. Tony Rossini, Novartis, Basel
Statistical Challenges for Modeling and Simulation
We review the current state of statistical support for Modeling and Simulation (M&S) in support of clinical drug development. While M&S has a strong mathematical modeling component, there are challenging statistical issues which need to be solved. We discuss some progress towards these issues and what the impact could be in terms of the re-usability of statistical models which incorporate scientific "knowledge" (current beliefs and data-supported opinions) about biological, pharmacological, and clinical beliefs.
- 22 avril 2008. Elvezio Ronchetti, Université de Genève
Robust Second-Order Accurate Inference for Generalized Linear Models
Generalized linear models have become the most commonly used class of regression models in the analysis of a large variety of data. In particular, generalized linear model can be used to model the relationship between predictors and a function of the mean of a continuous or discrete response variable.
The estimation of the parameters of the model can be carried out by maximum likelihood or quasi-likelihood methods, which are equivalent if the link is canonical. Standard asymptotic inference based on likelihood ratio, Wald and score test is then readily available for these models. However, two main problems can potentially invalidate p-values and confidence intervals based on standard classical techniques.
First of all, the models are ideal approximations to reality and deviations from the assumed distribution can have important effects on classical estimators and tests for these models (nonrobustness). Secondly, even when the model is exact, standard classical inference is based on (first order) asymptotic theory.
This can lead to inaccurate p-values and confidence intervals when the sample size is moderate to small or when probabilities in the extreme tails are required.
The nonrobustness of classical estimators and tests for the parameters is a well known problem and alternative methods have been proposed in the literature. These methods are robust and can cope with deviations from the assumed distribution. However, they are based on first order asymptotic theory and their accuracy in moderate to small samples is still an open question.
In this paper we propose a test statistic which combines robustness and good accuracy for small sample sizes. We combine results from Cantoni and Ronchetti (2001) and Robinson, Ronchetti and Young (2003) to obtain a new test statistic for hypothesis testing and variable selection which is asymptotically
chi2 distributed as the three classical tests but with a relative error of order O(1/n). Moreover, the accuracy of the new test statistic is stable in a neighborhood of the model distribution and this leads to robust inference even in moderate to small samples.
This is joint work with S. N. Lo.
- 13 mai 2008. Chris Jones, Open University, UK
The t family and their close and distant relations
Student's t distribution arose, of course, and has huge application, as a normal-based sampling distribution. In modern times, the t distribution has also found considerable use as a symmetric heavy-tailed empirical distribution. It is in that role that I will explore some of its properties, special cases, extensions and super-extensions. By the latter I mean that I will also look into a variety of three- and four-parameter families of distributions allowing skewness and tailweights from (t-like) heavy tails to (normal-like and) lighter tails. Indeed, I'll talk about three new general families of distributions that I have recently been looking into. Inter alia, the t distribution on 2 degrees of freedom and the logistic and log F distributions will have particular roles.
- 27 mai 2008, Christophe Croux, Catholic University of Leuven, Belgium
Robust online estimation of scale
This paper presents variance extraction procedures for univariate time series. The volatility of a times series is monitored allowing for non-linearities, jumps and outliers in the level. The volatility is measured using the height of triangles formed by consecutive observations of the time series. The statistical properties of the new methods are derived and finite sample properties are given. A financial and a medical application illustrate the use of the procedures.
- 24 juin 2008, Alina Matei, Université de Neuchâtel
Un logiciel libre de traitement d'enquête
Comment et pourquoi peut-on obtenir des résultats sur une très grande population en observant seulement une petite partie de celles-ci ? Derrière l'aspect médiatique des enquêtes par sondage, il existe une théorie statistique assez complexe : la théorie de l'échantillonnage qui permet de justifier l'extrapolation des résultats d'un échantillon à toute la population.
L'Institut de Statistique de l'Université de Neuchâtel a entrepris depuis trois ans la réalisation d'un logiciel libre permettant de traiter les enquêtes par sondage au moyen des méthodes statistiques les plus modernes. Initialement, ce projet a vu le jour afin de servir d'outil pédagogique pour des cours avancés sur les méthodes d'échantillonnage organisé par l'Office Fédéral de la Statistique sous l'égide d'Eurostat et de l'Association Européenne de Libre Echange (AELE). Ces cours étaient destinés aux statisticiens des instituts de statistique des pays européens et des pays de la Méditerranée.
Aujourd'hui, ce projet, poursuivi dans le cadre de la collaboration avec l'Office Fédéral de la Statistique, a dépassé le cadre strictement pédagogique. L'intérêt de ce logiciel (intitulé 'sampling') est d'être écrit au moyen du logiciel libre et gratuit R. Il permet de sélectionner des échantillons selon plusieurs méthodes, de traiter les problèmes de non-réponse, d'ajuster des données d'enquêtes sur des données de recensement, et d'évaluer la précision des estimations ainsi obtenues.
Référence :
Y. Tillé, A. Matei, The package 'sampling', R contributed package,
http://cran.r-project.org/web/packages/sampling/index.html