Research in Statistics and Mathematics …

I have extensive interests and expertise in Statistics and Mathematics. I have conducted research on simultaneous inference for high-dimensional dependent data and for high-dimensional discrete data, and on exponential family of distributions, Ito diffusion processes and time series models. I am shifting my research efforts to statistical modeling and inference on non-Euclidean data, and mathematical foundation of neural network models . These research topics will use a variety of tools in mathematics, including real/complex/functional/harmonic analysis, measure/probability theory, (stochastic) differential/integral equations, differential geometry, and topology.

Information on Some Projects

On this web page, papers that are marked as "Preprint" can be downloaded from arXiv.org, as "Published" can be obtained from their publishing journals, as "Manuscript" are not on arXiv or published, and as "In preparation" are not finished.

A. Current projects by Topics

Interaction between algebra, geometry, topology and statistics: The Euclidean space is a topological space, an additive group, a metric space, a manifold, and a vector space, for which these 5 structures are compatible. This is why we can do nice statistics and probability theory in the Euclidean space. Suppose we remove the vector space structure from the Euclidean space, then linear models cannot be defined any more (globally); suppose we remove the additive group structure from the Euclidean space, then additive models and location-shift distributions cannot be defined any more; suppose we remove the manifold structure from the Euclidean space, then the central limit theorem may not hold anymore and the second pillar of statistics collapses; suppose we remove the metric structure from the Euclidean space, then the Glivenko–Cantelli theorem theorem may not hold any more and the first pillar of statistics collapses. In fact, without a metric or group structure, pretty much little statistics can be done. However, there are many real world data sets whose modeling data spaces lack one or several of the 5 structures mentioned earlier. So, a natural question is "To conduct sensible statistical inference and/or modeling, what are the needed minimal requirements on the algebraic, topological and/or geometric structures on the data space?"

Statistics for neural networks (and in particular, deep convolutional neural networks (CNNs)) and study of "big models": Deep CNNs have demonstrated remarkable precision in binary classification (and other prediction) tasks. However, they belong to the class of models, which I call "big models", whose complexities are far beyond "small models" that have been the dominant ones in statistics. Little is known on the non-asymptotic statistical properties of CNNs; e.g., their performances in terms of FDR. In general, statistics has just started studying "big models" in the era of "big data ", and there is a lot to be done for them.

The Tukey-Kramer conjecture for multiple testing: For some details on this conjecture, please read “John W. Tukey's contributions to multiple comparisons” by Yoav Benjamini and Henry Braun. It is about a claim that using the average of correlations among a set of test statistics in multiple testing for FDR control actually works (conservatively) and that we do not need to use or take into account the full dependence structure.

Control of multiple error criteria in multiple testing: Most works in multiple testing control one error criterion such as the k-FDR or k-FWER without necessarily ensuring power at a prespecified level. However, in practice there are situations where we need to control multiple error criteria simultaneously or control the same error criterion on different "layers" or "groups" of hypotheses when taking into consideration structures of a set of hypotheses and at the same time ensure a prespecified power level. Needless to say, this is a very challenge task since there are examples of procedures that optimize for an error criterion and a power criterion but behave unstably or insensibly. Further, this is related to multiple testing structured hypotheses, which is a current trend in multiple testing.

FDR control under dependence: Up till now, we are only aware of two types of dependence, namely, PRDS and reverse martingale, for which the famous Benjamini-Hochberg (BH) procedure is conservative. It is known that BH procedure is not conservative when, e.g., PRDS is reverted. So, a natural question is "Can we identify another type of dependence for which the BH procedure is conservative?" or "How can we modify the step-up critical constants of the BH procedure nontrivially to account for dependence and maintain FDR control?" On the other hand, there is considerable numerical evidence that some adaptive FDR procedures are conservative under positive dependence even though they have not been theoretically proven so. So, a natural question is "Can we prove that they are actually under such dependence?" or "Can we classify distributions or nontrivial dependence structures that satisfy conditional PRDS?"

B. Past Projects in Statistics by Topics

The proportion of true null hypotheses or false null hypotheses: The proportions of true null and false null hypotheses play important roles in statistical modeling and inference (e.g., they are key components of the Bayesian two-component mixture model, their extensions, and decision rules for these models) and appear in the upper bounds on the false discovery rate (FDR) and false nondiscovery rate (FNR) of a multiple testing procedure (MTP). These proportions have to be estimated in order to implement decision rules that are associated with the Bayesian two-component mixture models, and information on these proportions can help better control the FDR and FNR of an MTP. For the latter, estimators of these proportions can be used to construct adaptive FDR and FNR procedures that are more powerful than their non-adaptive counterparts. A good starting point is to read #d below.

Xiongzhi Chen (2025): Uniformly consistent proportion estimation for composite hypotheses via integral equations: ``the case of Gamma random variables". (To appear)
Xiongzhi Chen (2025): Uniformly consistent proportion estimation for composite hypotheses via integral equations: ``the case of location-shift families''. (Preprint. Manuscript in Item 5 has been splited into Item 1 and Item 2)
Xiongzhi Chen (2021+): Consistent estimation of the proportion of false nulls and FDR for adaptive multiple testing Normal means under weak dependence. (Preprint)
Xiongzhi Chen (2019): Uniformly consistently estimating the proportion of false null hypotheses via Lebesgue-Stieltjes integral equations. (Published)
Xiongzhi Chen (2019): Uniformly consistently estimating the proportion of false null hypotheses for composite null hypotheses via Lebesgue-Stieltjes integral equations. (Preprint)
Xiongzhi Chen and R.W. Doerge (2014): A consistent estimator of the proportion of nonzero Normal means under certain strong covariance dependence. (Preprint)
Xiongzhi Chen and John D. Storey (2014): Estimating the proportion of true null hypotheses via goodness of fit. (Mansucript)

False discoverate rate (FDR) procedures: FDR is a modern criterion on "overall type I error" for simultaneously testing tens up to millions of hypotheses. FDR procedures are MTPs whose FDRs are controlled at a prespecified level that is specified by a users. A recent trend in research on FDR procedures is to utilize additional information beyond just p-values or structural information possessed by hypotheses, so as to make more scientific discoveries by controlling FDRs at a prespecified level. On the other hand, conventional FDR procedures may lose some power when applied to discrete statistics or discrete p-values, and there is a need to develop FDR procedures or improve existing ones for discrete data. There are various ways to each of these two tasks, and any of the following papers can be a starting point.

Xiongzhi Chen, R.W. Doerge and Sanat K. Sarkar (2020): A weighted FDR procedure under discrete and heterogeneous null distributions. (Published; R package ‘‘fdrDiscreteNull’’ on CRAN).
Xiongzhi Chen, R.W. Doerge and Joseph F. Heyse (2018): Multiple testing with discrete data: proportion of true null hypotheses and two adaptive FDR procedures. (Published; R package ‘‘fdrDiscreteNull’’ on CRAN).
Xiongzhi Chen, David G. Robinson and John D. Storey (2019): Functional false discovery rate with application to genomics. (Published)
Xiongzhi Chen (2020): False discovery rate control for multiple testing based on discrete p-values. (Published)
Xiongzhi Chen and Sanat K. Sarkar (2019): On Benjamini-Hochberg procedure applied to mid p-values. (Published)
Shinjini Nandi, Sanat K. Sarkar and Xiongzhi Chen (2021): Adapting to one- and two-way classified structures of hypotheses while controlling the false discovery rate. (Published)
Xiongzhi Chen and R.W. Doerge (2012): Towards better FDR procedures for discrete test statistics. (Published)

Modelling via latent, low-dimensional linear space: Due to experimental designs that have produced a high-dimensional data set, observations in the data set may have a low dimensional structure for the expectations (or variances) of the random variables that are used to model them. A simple way is to assume that the expectations lie in the linear space spanned by some latent vectors, leading to low dimensional, linear latent space for the mean (or variance) space. This strategy can be generalized, so that the means (or variances) as point set in an ambient Euclidean space has a low dimensional manifold. These low dimensional structures not only provide an efficient way of dimension reduction (and also assist downstream analysis of data) but also can be used to assess whether certain parametric assumptions on the modeling random variables are valid. It turns out that when the modeling random variables are members of an exponential family with a specific mean-variance relationship, this low dimensional, latent linear space can be consistently estimated even when each such random variable has only one observation. Identifying exponential families for which such a low dimensional, latent linear space can be consistently estimated when each modeling random variable has only one observation leads to the study of reduction functions of exponential families and classification of such families via the existence of reduction functions. A good starting point is to read #a below.

Xiongzhi Chen and John D. Storey (2015): Consistent estimation of low-dimensional latent structure in high-dimensional data. (Preprint)
John D. Storey, Keyur H. Desai and Xiongzhi Chen (2013): Empirical Bayes inference of dependent high-dimensional data. (Manuscript)
Xiongzhi Chen and John D. Storey (2014): Nonparametric empirical Bayes estimation of the surrogate variable analysis model. (Manuscript)
Xiongzhi Chen, Wei Hao and John D. Storey: Regression herding. (In preparation.)

Multiple testing and stochastic processes under dependence: For inference on high-dimensional data, the associated test statistics are often dependent and can have a complicated dependence structure. In this setting, the false discovery proportion (FDP), whose expectation is the FDR, can be very unstable even though its expectation can be close to a prespecified value. Since often an experiment is carried out once, we only observe an instance of the FDP when we want to control the FDR. So, it is somehow dangerous to claim a control of FDR when we actually just have observed one value of FDP. One line of research to control FDR under dependence is to investigate the convergence of three key empirical processes that are induced by an MTP, namely, the process of the number of false rejections, that of the total number of rejections, and the FDP process. Among several notations of convergence of probability measures, the strong law of large numbers (SLLN) is perhaps the strongest. If an empirical process satisfies the SLLN, then we are very certain about its instability and stability asymptotically. The following are works along this line, and most of them use orthogonal polynomials that are associated with bivariate Lancaster distributions (including, e.g., bivariate Gaussian, bivariate Gamma, etc) and asymptotics of special functions such as the Gamma functions and Bessel functions. A good starting point is to read #a below.

Xiongzhi Chen and R.W. Doerge (2020): A strong law of larger numbers related to multiple testing normal means. (Published)
Xiongzhi Chen (2020): A strong law of large numbers for simultaneously testing parameters of Lancaster bivariate distributions. (Published)
Xiongzhi Chen and R.W. Doerge (2015): Stopping time property of thresholds of Storey-type FDR procedures. (Preprint)

C. Current Projects in Mathematics by Topics

Classification of measures on the 2-sphere with unique Frechet mean: This is a part of Project 1 in Current Projects.

Classification of measures on symmetric spaces with unique Frechet mean: This is also a part of Project 1 in Current Projects.

Geometric probability: Geometric probability is an exciting research the lies at the intersection of geometry and probability theory. This is a very challenging problem on the solution of a system of random linear equations. More information on it can be found at: https://mathoverflow.net/questions/202899/samuel-karlins-problem-probability-of-positive-solution-to-system-of-random-li

Xiongzhi Chen (2015+): On Samuel Karlin's problem of geometric probability and a variant. (In prepartion.)

D. Past Projects in Mathematics by Topics

Theory on one-parameter natural exponential families (NEFs): Natural exponential families are probability distributions that are widely used in statistical modeling and inference due to their simplicity and flexibility. However, the mathematics behind NEFs are often quite difficult and require excellent skills in analysis (both real analysis and complex analysis), measure theory and harmonic analysis since they are often studied and analyzed via their Laplace transforms on the complex domain. We often use generalized linear models to model observations by trying to capture the mean-variance relationship of the underlying but unknown data generating distributions via the use of an NEF. However, if a postulated mean-variance relationship does not exist, then the resulting model is meaningless. The following #a partially resolves the issue. On the other hand, for modeling with a low dimensional, latent linear space in the mean space (mentioned earlier) of random variables from an NEF, the existence of a reduction function is needed to estimate this space consistently. The following #b partially resolves this issue. There are several open problems related to NEFs that were communicated to me by Prof. Gerard Letac, who has been providing constant help and guidance on my research on NEFs.

Xiongzhi Chen (2016): Resolution of a conjecture on variance functions for one-parameter natural exponential family. (Published.)
Xiongzhi Chen (2018): Reduction functions for the variance function of one-parameter natural exponential family. (Published.)

Time series models

Xiongzhi Chen (2015): Explicit solutions to a vector time series model and its induced model for business cycles. (Preprint.)

Diffusion processes: modeling and estimation

Master's Thesis

Neural networks

Xiongzhi Chen and Changlin Cai (2006): A new architecture for multilayer perceptrons as function approximators. Natural Science Edition, Journal of Sichuan University, Vol. 2.
Changlin Cai, Zhongzhi Shi, Xiongzhi Chen (2006): The Fisher information matrix on neural manifolds of multilayer perceptrons. Natural Science Edition, Journal of Sichuan University, Accepted. pdf.