Colloquia

When: Thursday, September 4, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Anru Zhang, Department of Biostatistics & Bioinformatics and Department of Computer Science, Duke University

Abstract: The increasing availability of electronic health records (EHRs) and other biomedical data calls for methodologies that can generate high-quality synthetic data while preserving privacy, correcting bias, and addressing complex data structures. In this talk, I will present a series of recent advances in generative modeling for synthetic health data. First, using denoising diffusion probabilistic models, we develop a framework for generating realistic, privacy-preserving EHR time series that achieve superior fidelity and lower privacy risk than existing methods. Second, to address irregularly observed functional data, we introduce Smooth Flow Matching (SFM), a semiparametric copula flow framework capable of generating smooth, infinite-dimensional trajectories under irregular sampling and non-Gaussian structures. Finally, we propose a bias-corrected data synthesis strategy for imbalanced learning, which mitigates distortions introduced by synthetic samples and enhances predictive performance in rare-event classification. Collectively, these methods provide a principled foundation for generative modeling of synthetic health data, enabling privacy-preserving bias-reduced analysis and broader utilization of sensitive biomedical datasets.

When: Thursday, September 4, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Cong Ma, Department of Statistics, University of Chicago

Abstract: Integrative data analysis often requires separating shared from individual variations across multiple datasets, typically using the Joint and Individual Variation Explained (JIVE) model. Despite its popularity, theoretical insights into JIVE methods remain limited, particularly in the context of multiple matrices and varying degrees of subspace misalignment. In this talk, I will present new theoretical results on the Angle-based JIVE (AJIVE) method—a two-stage spectral algorithm. Specifically, we establish that AJIVE achieves decreasing estimation error with an increasing number of matrices in high signal-to-noise ratio (SNR) regimes. In contrast, AJIVE faces inherent limitations in low-SNR conditions, where estimation error remains persistently high. Complementary minimax lower bounds confirm AJIVE’s optimal performance at high SNR, while analysis of an oracle estimator highlights fundamental limitations of spectral methods at low SNR.

When: Thursday, September 18, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Christopher Wikle, Department of Statistics, University of Missouri

Abstract: The world is full of extreme events. For example, a central question in public health planning might be to assess the likelihood of extreme exposures (meteorological conditions, air pollution, social stress, etc.). Such extreme events typically occur in spatial and/or temporal clusters. Yet, the principal methodologies that statisticians deal with spatially dependent processes (Gaussian processes and Markov random fields) are not suitable for complex tail dependence structures. This is particularly true of simulation model emulation. More flexible spatial extremes models exhibit appealing extremal dependence properties but are often exceedingly prohibitive to fit and simulate from in high dimensions. Here I present recent work where we develop a new spatial extremes model that has flexible and non-stationary dependence properties, and we integrate it in the encoding-decoding structure of a variational autoencoder (XVAE), whose parameters are estimated via variational Bayes combined with deep learning. The XVAE can be used to analyze high-dimensional data or as a spatio-temporal emulator that characterizes the distribution of potential mechanistic model output states and produces outputs that have the same statistical properties as the inputs, especially in the tail. Through extensive simulation studies, we show that our XVAE is substantially more time-efficient than traditional Bayesian inference while also outperforming many spatial extremes models with a stationary dependence structure. We demonstrate our method applied to a high-resolution satellite-derived dataset of sea surface temperature in the Red Sea and to a high-resolution simulation model of a turbulent plume, such as one would find in a wildfire. We note, however, that these methods can be applied to any data set or simulation model that exhibits extremes.

When: Thursday, September 25, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Seungchul Baek, Department of Mathematics and Statistics, University of Maryland, Baltimore County

Abstract: I introduce two projects related to high-dimensional classification. The first project focuses on developing a classifier using random partitioning. Specifically, we split the original high-dimensional data ($p>n$) into multiple low-dimensional subsets, making sure the number of selected covariates is less than the sample size. Using these partitioned datasets, we apply linear discriminant analysis (LDA) to each subset and propose a method to aggregate the results. We provide theoretical justification for our approach by comparing its misclassification rates to those of LDA in high dimensions. The second project concerns variable selection in high-dimensional classification. By utilizing the recently proposed mirror statistic, we first identify significant variables and then develop a new classifier based on a modified version of the $\epsilon$-greedy algorithm.

When: Tuesday, October 14, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Philip Ernst, Department of Mathematics, Imperial College London

Abstract: In 1926, G. Udny Yule considered the following problem: given two i.i.d. random walks independent from each other, what is the distribution of their empirical correlation coefficient? Yule empirically observed the distribution of this statistic to be heavily dispersed and frequently large in absolute value, leading him to call it “nonsense correlation.'' This unexpected finding led to his formulation of two concrete questions, each of which would remain open for more than ninety years: (i) Find (analytically) the variance of the empirical correlation coefficient and (ii): Find (analytically) the higher order moments and the density of the empirical correlation coefficient. Ernst, Shepp, and Wyner (Annals of Statistics, 2017) considered the empirical correlation coefficient of two independent Wiener processes, the limit to which the empirical correlation for two independent random walks converges weakly. Using tools from integral equation theory, we closed question (i) by explicitly calculating the second moment of the empirical correlation coefficient to be .240522. This talk begins where Ernst et al. (2017) leaves off. I shall explain how we finally succeeded in closing question (ii) by explicitly calculating all moments of the empirical correlation coefficient (up to order 16). This leads, for the first time, to an approximation to the density of Yule's nonsense correlation. I shall then proceed to explain how we were able to explicitly compute higher moments of the empirical correlation coefficient when the two independent Wiener processes are replaced by two correlated Wiener processes, two independent Ornstein-Uhlenbeck processes, and two independent Brownian bridges. I will conclude by stating a Central Limit Theorem for the case of two independent Ornstein-Uhlenbeck processes. This result shows that Yule's “nonsense correlation” is indeed not “nonsense” for stochastic processes which admit stationary distributions. This work is joint with L.C.G. Rogers (Cambridge) and Quan Zhou (Texas A&M) and recently appeared in Bernoulli in February 2025. We shall conclude with a discussion of some concrete applications of our work to the study of weather and climate extremes. The latter is part of our ongoing collaboration with the U.S. Office of Naval Research (2018-present).

When: Thursday, October 16, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Jason Klusowski, Department of Operations Research and Financial Engineering, Princeton University

Abstract: Statisticians often work in settings with limited labeled data and abundant unlabeled data. During training, they may even have access to extra side information (some labeled, some not) that won’t be available once the model is deployed. When can this side information actually improve performance? I’ll present a simple framework where a rich-view model that sees the extra features generates pseudo-labels on the large unlabeled data, and a deployment model that only sees the standard features is trained on both real and pseudo-labels. The two are trained iteratively: each deployment model update calibrates the next round of pseudo-labels, and those refined pseudo-labels in turn guide the deployment model. Our theory shows that side information helps precisely when the rich-view and deployment models make different kinds of errors. We formalize this with a decorrelation score that quantifies how independent those errors are; the more independent, the greater the performance gains.

When: Thursday, October 30, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Tingting Zhang, Department of Statistics, University of Pittsburgh

Abstract: The human brain is a high-dimensional directed network system of brain regions involving directed connectivity. Seizures are a directed network phenomenon, as abnormal neuronal activities start from a seizure onset zone (SOZ) and propagate to otherwise healthy regions. To localize the SOZ of an epileptic patient, clinicians use intracranial EEG (iEEG) to record the patient’s brain activity in many small regions. iEEG data are high-dimensional multivariate time series. To model the underlying directed brain network, we build a state-space multivariate autoregression (SSMAR) model for iEEG data. To produce scientifically meaningful network results, we incorporate prior knowledge that brain networks tend to exhibit modular organization. Specifically, we assign a stochastic-blockmodel-motivated prior to the SSMAR parameters, which encourages modularity in the estimated networks. We develop a Bayesian framework to estimate the SSMAR model, infer directed connections, and identify network modules. The method is robust to violations of model assumptions and outperforms existing network approaches. When applied to iEEG data from an epileptic patient, the model reveals patterns of seizure initiation and propagation and uncovers a distinct connectivity profile of the SOZ. We also extend this Bayesian approach to fMRI data, identifying functionally specialized modules and directed interactions between them.

When: Thursday, November 6, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Nathaniel Josephs, Department of Statistics, North Carolina State University

Abstract: The graph-matching problem is a classic task that involves finding the correspondence between the vertices of two graphs. A new class of nonparametric priors is introduced for permutations by borrowing ideas from the extensive literature on partition structures. This enables a Bayesian approach to graph matching that combines the position-aware Chinese restaurant process with a correlated stochastic block model likelihood. A node-wise blocked Gibbs sampler is proposed for posterior inference, as well as an efficient posterior summary technique that leverages variation-of information (VI) summaries for partitions.

When: Thursday, November 13, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Yichao Wu, Department of Mathematics, Statistics, and Computer Science, University of Illinois Chicago

Abstract: The first part of the talk will focus on the general partially linear model without any structure assumption on the nonparametric component. For such a model with both linear and nonlinear predictors being multivariate, we propose a new variable selection method. Our new method is a unified approach in the sense that it can select both linear and nonlinear predictors simultaneously by solving a single optimization problem. We prove that the proposed method achieves consistency.The second part of the talk will be based on an ongoing research project. In this project, we are extending the above variable selection method to partially global Fréchet regression (Tucker and Wu, 2025 Statistica Sinica).

When: Thursday, November 20, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Gemma Moran, Department of Statistics, Rutgers University

Abstract: High-dimensional data often exhibit variation that can be captured by lower dimensional factors. For high-dimensional data from multiple studies or environments, one goal is to understand which underlying factors are common to all studies, and which factors are study or environment-specific. As a particular example, we consider platelet gene expression data from patients in different disease groups. In this data, factors correspond to clusters of genes which are co-expressed; we may expect some clusters (or biological pathways) to be active for all diseases, while some clusters are only active for a specific disease. To learn these factors, we consider a nonlinear multi-study factor model, which allows for both shared and specific factors. To fit this model, we propose a multi-study sparse variational autoencoder. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. In the genomics example, this means each gene is active in only a few biological processes. Further, the model implicitly induces a penalty on the number of latent factors, which helps separate the shared factors from the group-specific factors. We prove that the shared factors are identified, and demonstrate our method recovers meaningful factors in the platelet gene expression data.

When: Thursday, February 26, 2026—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Jay Bartroff, Department of Statistics and Data Sciences, University of Texas at Austin

Abstract : A novel method for fixed-width confidence intervals -- called the Push Algorithm -- for the binomial success probability appeared in Asparaouhov's PhD thesis, and cited an unknown manuscript by Lorden. In this talk I'll discuss the little-known method, and our extension of it to any bounded parameter in a monotone likelihood ratio family. The method produces the shortest possible fixed-width confidence interval for a given confidence level, and if the Push interval does not exist for a given width and level then no such interval exists. We demonstrate it on the binomial, hypergeometric, and normal distributions with our available R package, where it outperforms the standard intervals, including the venerable z-interval in the normal case. This is joint work with undergraduate student Asmit Chakraborty.

When: Thursday, February 26, 2026—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Ciprian Crainiceanu, Department of Biostatistics, Johns Hopkins University

Abstract : Wearable devices, such as accelerometers and heart monitors, are used in health research because they provide objective, continuous, unbiased, anddetailed information about human activity either in the laboratory or the free-living environment. In this talk I will explore the different resolutions of the data, ways to summarize it, and inferential methods for exploring the associations with health outcomes. We will illustrate these methods using large, publicly available datasets, including the NHANES and UK Biobank. We will also show that objectively measured physical activity is the strongest predictor of mortality and cardiovascular mortality and the strongest modifiable risk factor of Multiple Sclerosis, Parkinson's Disease, and Alzheimer's Disease.

Department of Statistics

2025 – 2026 Department of Statistics Colloquium Speaer

Past colloquium talks are archived here.

Challenge the conventional. Create the exceptional. No Limits.