2024 – 2025 Department of Statistics Colloquium Speakers
When: Thursday, August 22, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Melissa Smith, Department of Biostatistics, University of Alabama at Birmingham
Abstract: A causal decomposition analysis allows researchers to determine whether the difference in a health outcome between two groups can be attributed to a difference in each group's distribution of one or more modifiable mediator variables. With this knowledge, researchers and policymakers can focus on designing interventions that target these mediator variables. In this talk, I will discuss the similarities and differences between a causal mediation analysis and a causal decomposition analysis. I will then present our recent work on a method for performing causal decomposition analyses with multiple correlated mediator variables. Existing methods for causal decomposition analysis either focus on one mediator variable or assume that each mediator variable is conditionally independent given the group label and the mediator-outcome confounders. Our Monte Carlo-based causal decomposition analysis method is designed to accommodate multiple correlated and interacting mediator variables, while identifying path-specific effects through individual mediators. I will illustrate an evaluation of our method through a simulation study and an application to examine potential reasons for Black-White differences in incident diabetes using data from a national cohort study.
When: Thursday, September 12, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Will Cipolli, Department of Mathematics, Colgate University
Abstract: Much work has been done in "robustifying" standard statistical approaches with mixtures of multivariate Polya trees (MMPTs). In this talk, I will present a FAST Markov chain Monte Carlo (MCMC) sampling technique for MMPTs that overcomes difficulties in traditional sampling procedures and is completed in a fraction of the time. This new technique permits time-feasible Bayesian nonparametric solutions to contexts requiring many or repeated density estimates. The efficacy of this approach will be demonstrated via simulation and biomedical applications.
When: Thursday, September 19, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Ian Dryden, Department of Statistics, University of South Carolina
Abstract: Complex object data such as networks and shapes are becoming increasingly available, and so there is a need to develop suitable methodology for statistical analysis. Networks can be represented as graph Laplacian matrices, which are a type of manifold-valued data. Shapes of 3D objects are also a type of manifold-valued data, invariant to translation, rotation and scale. Our main objective is to estimate a regression curve from a sample of graph Laplacian matrices or 3D shapes conditional on a set of Euclidean covariates, for example in dynamic objects where the covariate is time. We develop an adapted Nadaraya-Watson estimator which has uniform weak consistency for estimation using Euclidean and power Euclidean metrics, and we also explore splines on shape spaces.
When: Thursday, September 26, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Kimberly Sellers, Department of Statistics, North Carolina State University
Abstract: While the Poisson distribution is a classical statistical model for count data, it hinges on the constraining equi-dispersion property (i.e. that the mean and variance equal). This assumption, however, does not usually hold for real count data; over-dispersion (i.e. when the variance is greater than the mean) is a more common phenomenon for count data, however data under-dispersion has also been prevalent in various settings. It would be more convenient to work with a distribution that can effectively model data (over- or under-) dispersion because it can offer more flexibility (and, thus, more appropriate inference) in the statistical methodology. This talk introduces the Conway-Maxwell-Poisson distribution along with several associated statistical methods motivated by this model to better analyze count data under various scenarios (e.g. distributional theory, generalized linear modeling, control chart theory, and count processes). As time permits, this talk will likewise acquaint the audience with available associated tools for statistical computing.
When: Thursday, October 03, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Rahul Ghosal, Department of Epidemiology and Biostatistics, University of South Carolina
Abstract: Modern clinical and epidemiological studies widely employ wearables to record parallel streams of real-time data on human physiology and behavior. With recent advances in distributional data analysis, these high-frequency data are now often treated as distributional observations resulting in novel regression settings. Motivated by these modelling setups, we develop a distributional outcome regression via quantile functions (DORQF) that expands existing literature with three key contributions: i) handling both scalar and distributional predictors, ii) ensuring jointly monotone regression structure without enforcing monotonicity on individual functional regression coefficients, iii) providing statistical inference via asymptotic projection-based joint confidence bands and a statistical test of global significance to quantify uncertainty of the estimated functional regression coefficients. The method is motivated by and applied to Actiheart component of Baltimore Longitudinal Study of Aging that collected one week of minute-level heart rate (HR) and physical activity (PA) data on 781 older adults to gain deeper understanding of age-related changes in daily life heart rate reserve, defined as a distribution of daily HR, while accounting for daily distribution of physical activity, age, gender, and body composition. Intriguingly, the results provide novel insights in epidemiology of daily life heart rate reserve.
When: Thursday, October 10, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Hongtu Zhu, Department of Biostatistics, University of North Carolina at Chapel Hill
Abstract: This talk provides an insightful overview of integrating artificial intelligence (AI) and statistical methods in medical data analysis. It is structured into three key sections: Introduction to Medical Image Data Analysis: This section sets the stage by outlining the fundamentals and significance of medical image analysis in healthcare, charting its evolution and current applications. State-of-the-Art AI Applications and Statistical Challenges: Here, we explore the impact of AI, particularly deep learning, on medical imaging, and address the accompanying statistical challenges, such as data quality and model interpretability. Opportunities for Statisticians: The final section highlights the critical role of statisticians in refining AI applications in medical imaging, focusing on opportunities for advancing algorithmic accuracy and integrating statistical rigor. The talk aims to demonstrate the crucial synergy between AI and statistics in enhancing medical data analysis, emphasizing the evolving challenges and the vital contributions of statisticians in this domain.
When: Tuesday, October 15, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Weijie Su, Department of Statistics and Data Science, University of Pennsylvania
Abstract: Large language models (LLMs) have rapidly emerged as a transformative innovation in machine learning. However, their increasing influence on human decision-making processes raises critical societal questions. In this talk, we will demonstrate how statistics can help address two key challenges: ensuring fairness for minority groups through alignment and combating misinformation through watermarking. First, we tackle the challenge of creating fair LLMs that equitably represent and serve diverse populations. We derive a regularization term that is both necessary and sufficient for aligning LLMs with human preferences, ensuring equitable outcomes across different demographics. Second, we introduce a general statistical framework to analyze the efficiency of watermarking schemes for LLMs. We develop optimal detection rules for an important watermarking scheme recently developed at OpenAI and empirically demonstrate its superiority over the existing detection method. Throughout the talk, we will showcase how statistical insights can not only address pressing challenges posed by LLMs but also unlock substantial opportunities for the field of statistics to drive responsible generative AI development. This talk is based on arXiv:2405.16455 and arXiv:2404.01245.
When: Thursday, October 24, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Whitney Huang, School of Mathematical and Statistical Sciences, Clemson University
Abstract: The class of max-stable models is commonly used for modeling multivariate and spatial extremes. Despite recent advancements in model construction and implementation, a fundamental limitation persists in incorporating timing information for extreme events due to the "component-wise maximum" data selection process. This limitation can lead to inaccurate assessments of multivariate and spatial extreme risk. In this talk, I will present a conditional approach to model multivariate extremes, aiming to capture extremes at the event level by conditioning on the timing and corresponding vector values when at least one variable is extreme. The proposed approach shares some similarities with the conditional extreme value models developed by Jonathan Tawn and his collaborators, but it treats the modeling of the conditional distribution of the concomitant variable(s) differently when the conditioning variable is extreme. Specifically, the conditional distribution function is modeled by a composition of distribution functions, where an extreme value base distribution is enriched by a conditional beta distribution. Simulated examples and an application to bivariate concurrent wind and precipitation extremes will illustrate the proposed approach.
When: Thursday, October 31, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Sanat Sarkar, Department of Statistics, Temple University
Abstract: Simultaneous testing of multivariate Gaussian means against two-sided alternatives is considered under two different scenarios – (i) when the correlation matrix is known, and (ii) when the correlation matrix is unknown but estimated from an independent Wishart matrix. New methods, capturing dependence among the variables and with theoretically proven finite-sample control of the false discovery rate (FDR), are presented. When the correlation matrix is known, two methods, referred to as shifted-BH methods, are produced. Each of them is developed by shifting the p-values and considering BH-type step up procedure based on the shifted p-values. The amount of shift for each p-value is appropriately determined from the correlation matrix to achieve the desired FDR control. Simulation studies and real-data application show favorable performances of the shifted-BH methods when compared with their relevant competitors available in the literature. When the correlation matrix is estimated using an independent Wishart matrix, no method with theoretically proven finite-sample FDR control is available in the literature, as far as we know. This talk will present some new results in this context addressing the long-standing open question: Can the Benjamini-Hochberg method in its original form theoretically control FDR?
When: Thursday, November 07, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Andee Kaplan, Department of Statistics, Colorado State University
Abstract: With the ubiquity of data, linking data sets has become crucial for myriad applications including healthcare, official statistics, ecology, and fraud detection and national security. Record linkage is the task of resolving duplicates in two or more partially overlapping sets of records, or files, from noisy data sources without a unique identifier. In any field where multiple sources of messy data are available to answer a scientific problem, record linkage is critical in the analysis pipeline. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrives. We approach the problem from a Bayesian perspective with estimates in the form of posterior samples of parameters and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. In this talk, we present a Bayesian linkage model for the multi-file case formulated specifically for the streaming data context and propose computational methods to perform streaming updates that achieve near-equivalent posterior inference at a small fraction of the compute time.
When: Thursday, November 14, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Yuqi Gu, Department of Statistics, Columbia University
Abstract: Mixed membership models are popular individual-level mixture models widely used in various fields including network analysis, topic modeling, and multivariate categorical data analysis. This work focuses on mixed membership models for multivariate categorical data, which are also called Grade of Membership (GoM) models. GoM models drastically increase the modeling flexibility of latent class models by allowing each individual to partially belong to multiple extreme latent profiles. However, such flexibility also comes with challenging identifiability and estimation issues, especially for high-dimensional polytomous (categorical with more than two categories) data. Such data take the form of a three-way tensor, with N subjects responding to J items each with C categories. Existing estimation methods based on maximum likelihood or Bayesian MCMC inference are not computationally efficient and lack high-dimensional theoretical guarantees. We propose an SVD-based spectral method for high-dimensional polytomous Models with potential local dependence. We innovatively flatten the three-way tensor into a “fat” matrix and exploit the singular subspace geometry based on the matrix SVD for estimation. We establish fine-grained finite-sample entrywise error bounds for all the parameters. Moreover, we develop a novel two-to-infinity singular subspace perturbation theory under arbitrary local dependent noise, which is of independent interest. Simulations and applications to real-world data in genetics, political science, and single-cell sequencing demonstrate the merit of the proposed method.
When: Thursday, November 21, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Dana Tudorascu, Department of Psychiatry and Biostatistics, University of Pittsburgh
Abstract: Multisite imaging studies increase statistical power and enable the generalization of research outcomes; however, due to the variety of imaging acquisition, different PET tracer properties and inter-scanner variability hinders the direct comparability of multi-scanner PET data. The PET imaging field is lacking behind in terms of harmonization methods due to the complexity associated with combination of different tracers and different scanners. Similarly, MRI present similar challenges, but mainly due to scanner differences. In this study we investigate samples of cognitively normal participants, mild cognitive impaired and Alzheimer’s disease subjects in two major multisite studies of Alzheimer’s disease.We present challenges and solutions associated with different MRI scanners, PET tracers, as well as analysis and harmonization techniques including simple imaging standardization, Combat and deep learning methods. We show regions of interest differences in PET outcome measures before and after the harmonization in multisite studies of Alzheimer’s Disease as well as voxel level harmonization along with summary measures before and after harmonization in MRI studies.
When: Thursday, December 05, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224
Speakers: Dr. Sumanta Basu, Department of Statistics and Data Science, Cornell University
Abstract: With advances in data collection and storage, statistical learning algorithms are becoming increasingly popular for structure learning and prediction with large-scale data sets that exhibit temporal or spatial dependence. Most algorithms in the literature focus on using off-the-shelf machine learning algorithms that ignore the dependent nature of the data. In this talk, we aim to demonstrate the merit of incorporating classical statistical wisdoms for scale and dependence modeling into the statistical learning framework through two algorithms that we developed. The first, called RF-GLS, extends random forests (RF) for dependent error processes in the same way Generalized Least Squares (GLS) fundamentally extends Ordinary Least Squares (OLS) for linear models under dependence. The second algorithm, called AutoTune, offers an automatic tuning parameter selection algorithm for LASSO, by revisiting the well-known problem of scale estimation and adjustment for high-dimensional regression. We illustrate the benefit of these algorithms on simulated data sets, and provide some theoretical analysis to shed insight on their asymptotic properties.