- T. S. Cheng, A. Lucchi, A. Kratsios, D. Belius - Characterizing Overfitting in Kernel Ridgeless Regression Through the Eigenspectrum, Proceedings of the 41st International Conference on Machine Learning (ICML), 2024.
[ Abstract Proceedings link Conference PDF ]
We derive new bounds for the condition number of kernel matrices, which we then use to enhance existing non-asymptotic test error bounds for kernel ridgeless regression (KRR) in the over-parameterized regime for a fixed input dimension. For kernels with polynomial spectral decay, we recover the bound from previous work; for exponential decay, our bound is non-trivial and novel. Our contribution is two-fold: (i) we rigorously prove the phenomena of tempered overfitting and catastrophic overfitting under the sub-Gaussian design assumption, closing an existing gap in the literature; (ii) we identify that the independence of the features plays an important role in guaranteeing tempered overfitting, raising concerns about approximating KRR generalization using the Gaussian design assumption in previous literature.
- D. Belius, F. Concetti, G. Genovese - On the determinant in Bray-Moore's TAP complexity formula, preprint.
[ Abstract arxiv ]
In the computation of the TAP complexity, originally carried out by Bray and Moore, a fundamental step is to calculate the determinant of a random Hessian. As the replica method does not give a clear prescription, physicists debated how to perform this computation and its consequences on the TAP complexity for a long time. In this paper we prove the original Bray and Moore formula for the behaviour of the determinant at exponential scale to be correct, and compute an important prefactor coming from a small outlier in the spectrum.
- T. S. Cheng, A. Lucchi, I. Dokmanic, A. Kratsios, D. Belius - A Theoretical Analysis of the Test Error of Finite-Rank Kernel Ridge Regression, Proceedings of the 36th Conference on Neural Information Processing Systems (NIPS), 2023.
[ Abstract Proceedings link Conference PDF ]
Existing statistical learning guarantees for general kernel regressors often yield loose bounds when used with finite-rank kernels. Yet, finite-rank kernels naturally appear in a number of machine learning problems, e.g. when fine-tuning a pre-trained deep neural network's last layer to adapt it to a novel task when performing transfer learning. We address this gap for finite-rank kernel ridge regression (KRR) by deriving sharp non-asymptotic upper and lower bounds for the KRR test error of any finite-rank KRR. Our bounds are tighter than previously derived bounds on finite-rank KRR and, unlike comparable results, they also remain valid for any regularization parameters.
- D. Belius, L. Fröber - Fluctuations of the ground state of the spiked spherical Sherrington-Kirkpatrick model, submitted.
[ Abstract arxiv ]
The Sherrington-Kirkpatrick Hamiltonian is a random quadratic function on the high-dimensional sphere. This article studies the ground state (i.e. maximum) of this Hamiltonian with external field, or more generally with a non-linear "spike" term. We compute the level of the maximum to leading order, and under appropriate condition its first- and second-order fluctuations. The equivalent results are also derived for the maximum of the model's TAP free energy on the ball.
- D. Belius, L. Fröber, J. Ko - TAP variational principle for the constrained overlap multiple spherical Sherrington-Kirkpatrick model, submitted.
[ Abstract arxiv ]
Spin glass models involving multiple replicas with constrained overlaps have been studied in [FPV92; PT07; Pan18a]. For the spherical versions of these models [Ko19; Ko20] showed that the limiting free energy is given by a Parisi type minimization. In this work we show that for Sherrington-Kirkpatrick (i.e. 2-spin) interactions, it can also be expressed in terms of a Thouless-Andersson-Palmer (TAP) variational principle. This is only the second spin glass model where a mathematically rigorous TAP computation of the free energy at all temperatures and external fields has been achieved. The variational formula we derive here also confirms that the model is replica symmetric, a fact which is natural but not obviously deducible from its Parisi formula.
- A. Maillard, A. S. Bandeira, D. Belius, I. Dokmanić, S. Nakajima - Injectivity of ReLU networks: perspectives from statistical physics, submitted.
[ Abstract arxiv ]
When can the input of a ReLU neural network be inferred from its output? In other words, when is the network injective?
We consider a single layer, \(x \mapsto \mathrm{ReLU}(Wx)\), with a random Gaussian \(m \times n\) matrix \(W\),
in a high-dimensional setting where \(n, m \to \infty\).
Recent work connects this problem to spherical integral geometry giving rise to a conjectured sharp injectivity threshold for \(\alpha = \frac{m}{n}\) by studying the expected Euler characteristic of a certain random set.
We adopt a different perspective and show that injectivity is equivalent to a property of the ground state of the spherical perceptron, an important spin glass model in statistical physics. By leveraging the (non-rigorous) replica symmetry-breaking theory, we derive analytical equations for the threshold whose solution is at odds with that from the Euler characteristic. Furthermore, we use Gordon's min--max theorem to prove that a replica-symmetric upper bound refutes the Euler characteristic prediction.
Along the way we aim to give a tutorial-style introduction to key ideas from statistical physics in an effort to make the exposition accessible to a broad audience.
Our analysis establishes a connection between spin glasses and integral geometry but leaves open the problem of explaining the discrepancies.
- D.Belius, M. Schmidt - Complexity of local maxima of given radial derivative for mixed p-spin Hamiltonians, submitted.
[ Abstract arxiv ]
We study the number of local maxima with given radial derivative of spherical
mixed p-spin models and prove that the second moment matches the square of the first moment
on exponential scale for arbitrary mixtures and any radial derivative. This is surprising, since
for the number of local maxima with given radial derivative and given energy the corresponding
result is only true for specific mixtures [Sub17; BSZ20].
We use standard Kac-Rice computations to derive formulas for the first and second moment
at exponential scale, and then find a remarkable analytic argument that shows that the second
moment formula is bounded by twice the first moment formula in this general setting.
This also leads to a new proof of a central inequality used to prove concentration of the
number critical points of pure p-spin models of given energy in [Sub17] and removes the need
for the computer assisted argument used in that paper for 3 ≤ p ≤ 10.
- D.Belius, M. Schmidt - Phase diagram for the TAP energy of the p-spin spherical mean field spin glass model, preprint.
[ Abstract arxiv ]
We solve the Thouless-Anderson-Palmer (TAP) variational principle associated to the spherical pure p-spin mean field spin glass Hamiltonian and present a detailed phase diagram.
In the high temperature phase the maximum of variational principle is the annealed free energy of the model. In the low temperature phase the maximum, for which we give a formula, is strictly smaller.
The high temperature phase consists of three subphases. (1) In the first phase m=0 is the unique relevant TAP maximizer. (2) In the second phase there are exponentially many TAP maximizers, but m=0 remains dominant. (3) In the third phase, after the so called dynamic phase transition, m=0 is no longer a relevant TAP maximizer, and exponentially many non-zero relevant TAP solutions add up to give the annealed free energy.
Finally in the low temperature phase a subexponential number of TAP maximizers of near-maximal TAP energy dominate.
- D. Belius - High temperature TAP upper bound for the free energy of mean field spin glasses, submitted.
[ Abstract arxiv ]
This work proves an upper bound for the free energy of the Sherrington-Kirkpatrick model and its generalizations
in terms of the Thouless-Anderson-Palmer (TAP) energy. The result applies to models with spherical or Ising spins and
any mixed p-spin Hamiltonian with external field or with a non-linear spike term. The bound is expected to be tight to
leading order at high temperature, and is non-trivial in the presence of an external field. For the proof a geometric
microcanonical method is employed, in which one covers the spin space with sets, each of which is centered at a magnetization
vector m and whose contribution to the partition function is bounded in terms of the TAP energy at m.
- D. Banerjee, D. Belius - Fluctuations of the free energy of the mixed p-spin mean field spin glass model, preprint.
[ Abstract
arxiv ]
We prove the convergence in distribution of the fluctuations of the free energy of the mixed p-spin Sherrington-Kirkpatrick model with non-vanishing 2-spin component at high enough temperature. The limit is Gaussian, and the fluctuations are seen to arise from weighted cycle counts in the complete graph on the spin indices weighted by the 2-spin interaction matrix.
- D. Belius, J. Černý, S. Nakajima, M. Schmidt - Triviality of the geometry of mixed p-spin spherical Hamiltonians with external field, Journal of Statistical Physics 186.1 (2022): 1-3.
[ Abstract
Journal link
arxiv ]
We study isotropic Gaussian random fields on the high-dimensional sphere with an added deterministic linear term, also known as mixed p-spin Hamiltonians with external field. We prove that if the external field is sufficiently strong, then the resulting function has trivial geometry, that is only two critical points. This contrasts with the situation of no or weak external field where these functions typically have an exponential number of critical points.
We give an explicit threshold hc for the magnitude of the external field necessary for trivialization and conjecture hc to be sharp.
The Kac-Rice formula is our main tool.
Our work extends [Fyo15], which identified the trivial regime for the special case of pure p-spin Hamiltonians with random external field.
- T. Liu, A. Chaman, D. Belius, I. Dokmanic - Interpreting U-Nets via Task-Driven Multiscale Dictionary Learning,
IEEE Transactions on Computational Imaging 8 (2022): 425–37. https://doi.org/10.1109/TCI.2022.3175309
[ Abstract
arxiv ]
U-Nets have been tremendously successful in many imaging inverse problems. In an effort to understand the source of this success, we show that one can reduce a U-Net to a tractable, well-understood sparsity-driven dictionary model while retaining its strong empirical performance. We achieve this by extracting a certain multiscale convolutional dictionary from the standard U-Net. This dictionary imitates the structure of the U-Net in its convolution, scale-separation, and skip connection aspects, while doing away with the nonlinear parts. We show that this model can be trained in a task-driven dictionary learning framework and yield comparable results to standard U-Nets on a number of relevant tasks, including CT and MRI reconstruction. These results suggest that the success of the U-Net may be explained mainly by its multiscale architecture and the induced sparse representation.
- M. Samarin, V. Roth, D. Belius, Feature Learning and Random Features in Standard Finite-Width Convolutional Neural Networks: An Empirical Study, Conference on Uncertainty in Artificial Intelligence (UAI) 2022.
[ Abstract
pdf ]
The Neural Tangent Kernel is an important milestone in the ongoing effort to build a theory for deep learning. Its prediction that sufficiently wide neural networks behave as kernel methods, or equivalently as random feature models arising from linearized networks, has been confirmed empirically for certain wide architectures. In this paper, we compare the performance of two common finite-width convolutional neural networks, LeNet and AlexNet, to their linearizations on common benchmark datasets like MNIST and modified versions of it, CIFAR-10 and an ImageNet subset. We demonstrate empirically that finite-width neural networks, generally, greatly outperform the finite-width linearization of these architectures. When increasing the problem difficulty of the classification task, we observe a larger gap which is in line with common intuition that finite-width neural networks perform feature learning which finite-width linearizations cannot. At the same time, finite-width linearizations improve dramatically with width, approaching the behavior of the wider standard networks which in turn perform slightly better than their standard width counterparts. Therefore, it appears that feature learning for non-wide standard networks is important but becomes less significant with increasing width. We furthermore identify cases where both standard and linearized networks match in performance, in agreement with NTK theory, and a case where a wide linearization outperforms its standard width counterpart.
- The TAP-Plefka variational principle for the spherical SK model (with N. Kistler; Communications in Mathematical Physics volume 367, pages 991–1017, 2019)
[ Abstract
arxiv
Journal pdf ]
We reinterpret the Thouless-Anderson-Palmer approach to mean field spin glass models as a variational principle in the spirit of the Gibbs variational principle and the Bragg-Williams approximation. We prove this TAP-Plefka variational principle rigorously in the case of the spherical Sherrington-Kirkpatrick model.
- Tightness for the Cover Time of compact two dimensional manifolds (with J. Rosen and O. Zeitouni; Probability Theory and Related Fields volume 176, pages1357–1437, 2020)
[ Abstract
arxiv ]
Let Cε,S2 denote the cover time of the two dimensional sphere S2 by a Wiener sausage of radius ε. We prove that
√Cε,S2−2√2(logε-1-(1/4)loglog ε-1) is tight.
- Barrier estimates for a critical Galton-Watson process and the cover time of the binary tree
(with J. Rosen and O. Zeitouni; Annales de l'Institut Henri Poincaré, Probabilités et Statistiques,
Volume 55, Number 1 (2019), 127-154.)
[ Abstract
arxiv ]
For the critical Galton--Watson process with geometric offspring distributions we
provide sharp barrier estimates for barriers which are (small) perturbations of linear barriers. These are
useful in analyzing the cover time of finite graphs in the critical regime by random walk, and the Brownian
cover times of compact two dimensional manifolds. As an application of the barrier estimates, we prove that if
CL denotes the cover time of the binary tree of depth L by simple walk, then
√CL/(2L+1) - (√2log2) L + log L/(√2 log2) is tight. The latter
improves results of Aldous (1991), Bramson and Zeitouni (2009) and Ding and Zeitouni (2012). In a subsequent
article we use these barrier estimates to prove tightness of the Brownian cover time for the two-dimensional
sphere.
- Maximum of the Riemann zeta function on a short interval of the critical line
(with L-P. Arguin, P. Bourgade, M. Radziwill and K. Soundararajan; Communications in Pure and Applied Mathematics
72: 500-535 (2017) https://doi.org/10.1002/cpa.21791.)
[ Abstract
arxiv ]
We prove the leading order of a conjecture by Fyodorov, Hiary and Keating, about the
maximum of the Riemann zeta function on random intervals along the critical line. More precisely, if t is
uniformly distributed in [T,2T], then max |t - u| ≤ log |ζ(1/2 + iu)| =(1 + o(1))loglogT
with probability converging to 1 as T → ∞.
- Maximum of the Ginzburg-Landau fields
(with W. Wu; Ann. Probab. 48 (6) 2647 - 2679, November 2020. https://doi.org/10.1214/19-AOP1416)
[ Abstract
Journal
link
arxiv ]
We study two dimensional massless field in a box with potential
V(∇φ(·)) and zero boundary condition, where V is any symmetric and uniformly convex function.
Naddaf-Spencer and Miller proved the macroscopic averages of this field converge to a continuum Gaussian free
field. In this paper we prove the distribution of local marginal φ(x), for any x in the bulk, has a Gaussian
tail. We further characterize the leading order of the maximum and dimension of high points of this field, thus
generalize the results of Bolthausen-Deuschel-Giacomin and Daviaud for the discrete Gaussian free field.
- Maximum of the characteristic polynomial of random unitary matrices
(with L-P. Arguin and P. Bourgade; Communications in Mathematical Physics, Volume 349 (2017), Issue 2, pp
703-751.)
[ Abstract
Journal pdf
arxiv ]
It was recently conjectured by Fyodorov, Hiary and Keating that the maximum of the characteristic polynomial on
the unit circle of a N×N random unitary matrix sampled from the Haar measure grows like CN/(logN)3/4
for some random variable C. In this paper, we verify the leading order of this conjecture, that is, we prove
that with high probability the maximum lies in the range [N1−ε,N1+ε], for arbitrarily
small ε. The method is based on identifying an approximate branching random walk in the Fourier decomposition of
the characteristic polynomial, and uses techniques developed to describe the extremes of branching random walks
and of other log-correlated random fields. A key technical input is the asymptotic analysis of Toeplitz
determinants with dimension-dependent symbols. The original argument for these asymptotics followed the general
idea that the statistical mechanics of 1/f-noise random energy models is governed by a freezing transition. We
also prove the conjectured freezing of the free energy for random unitary matrices.
- Maxima of a randomized Riemann Zeta function, and branching random walks
(with L-P. Arguin and A. Harper; Annals of Applied Probability, Volume 27, Number 1 (2017), pp 178-215.)
[ Abstract
Journal
arxiv ]
A recent conjecture of Fyodorov--Hiary--Keating states that the maximum of the absolute value of the Riemann
zeta function on a typical bounded interval of the critical line is exp(loglog T − (3/4) logloglogT + O(1)), for
an interval at (large) height T. In this paper, we verify the first two terms in the exponential for a model of
the zeta function, which is essentially a randomized Euler product. The critical element of the proof is the
identification of an approximate tree structure, present also in the actual zeta function, which allows us to
relate the maximum to that of a branching random walk.
- The subleading order of two dimensional cover times (with N. Kistler; Probability Theory and Related Fields, Volume 167 (2017), Issue 1, pp 461-552.)
[ Abstract
Journal pdf
arxiv
]
The epsilon-cover time of the two dimensional torus by Brownian motion is the time it takes for the process to
come within distance epsilon>0 from any point. Its leading order in the small epsilon-regime has been
established by Dembo, Peres, Rosen and Zeitouni [Ann. of Math., 160 (2004)]. In this work, the second order
correction is identified. The approach relies on a multi-scale refinement of the second moment method, and draws
on ideas from the study of the extremes of branching Brownian motion.
- Gumbel fluctuations for cover times in the discrete torus (Probability Theory and Related Fields, Volume 157 (2013), Issue 3-4, pp 635-689.)
[ Abstract
Journal pdf
arxiv ]
This work proves that the fluctuations of the cover time of simple random walk in
the discrete torus of dimension at least three with large side-length are governed by the Gumbel
extreme value distribution. This result was conjectured for example in the book by Aldous & Fill. We also
derive some
corollaries which qualitatively describe how covering happens. In addition, we develop a new
and stronger coupling of the model of random interlacements, introduced by Sznitman,
and random walk in the torus. This coupling is used to prove the cover time result and is also
of independent interest.
- Cover times in the discrete cylinder. (Preprint)
[ Abstract
arxiv ]
This article proves that, in terms of local times, the properly rescaled and re-centered
cover times of finite subsets of the discrete cylinder by simple random walk
converge in law to the Gumbel distribution, as the cardinality of the set goes to
infinity. As applications we obtain several other results related to covering in the
discrete cylinder. Our method is new and involves random interlacements, which
were introduced in [22]. To enable the proof we develop a new stronger coupling
of simple random walk in the cylinder and random interlacements, which is also of
independent interest.
- Cover levels and random interlacements (Annals of Applied Probability, Volume 22, Number 2 (2012), pp 522-540.)
[ Abstract
Journal
arxiv ]
This note investigates cover levels of finite sets in the random interlacements model, that is the
least level such that the set is completely contained in the random interlacement at that level. It
proves that as the cardinality of a set goes to infinity, the rescaled and recentered cover level
tends in distribution to the Gumbel distribution with cumulative distribution function exp(-exp(-z)).