David Belius

D. Belius, F. Concetti, G. Genovese - Variational formula for the logarithmic potential of free additive convolutions, submitted.
[ Abstract arxiv ]

We establish a general variational formula for the logarithmic potential of the free additive convolution of two compactly supported probability measures on ℝ. The formula is given in terms of the R-transform of the first measure, and the logarithmic potential of the second measure. The result applies in particular to the additive convolution with the semicircle or Marchenko-Pastur laws, for which the formula simplifies. The logarithmic potential of additive convolutions appears for instance in estimates of the determinant of sums of independent random matrices.

G. Vlassis, D. Belius, V. Fomichov - A thorough reproduction and evaluation of μP, Transactions on Machine Learning Research, 2025.
[ Abstract OpenReview ]

This paper is an independent empirical reproduction of the claimed benefits of the μP parametrization proposed by Yang et al. (2020, 2021). Under the so-called Standard Parametrization (SP), the weights of neural networks are initialized from the Gaussian distribution with variance scaling as the inverse of 'fan-in', with the learning rate being the same for every layer. While this guarantees that (pre)activations are O(1) at initialization with respect to width, it causes their scale to be width-dependent during training. To address this, the Maximal Update Parametrization (μP) was proposed, which is also claimed to make the optimal value of various hyperparameters independent of width. However, despite its alleged benefits, μP has not gained much traction among practitioners. Possibly, this could stem from a lack of thorough independent evaluation of μP against SP. We address this by independently reproducing the empirical claims of the original works. At the same time, we substantially increase the scale of the experiments, by training 16000 neural networks of sizes from 500 to 1B parameters, and empirically investigate μP's effect on outputs, gradient updates, weights, training loss and validation loss. We find that generally μP indeed delivers on its promises, even though this does not always translate to improved generalization.

T. S. Cheng, A. Lucchi, A. Kratsios, D. Belius - A Comprehensive Analysis on the Learning Curve in Kernel Ridge Regression, Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS), 2024.
[ Abstract Proceedings link Conference PDF arxiv ]

This paper conducts a comprehensive study of the learning curves of kernel ridge regression (KRR) under minimal assumptions. Our contributions are three-fold: 1) we analyze the role of key properties of the kernel, such as its spectral eigen-decay, the characteristics of the eigenfunctions, and the smoothness of the kernel; 2) we demonstrate the validity of the Gaussian Equivalent Property (GEP), which states that the generalization performance of KRR remains the same when the whitened features are replaced by standard Gaussian vectors, thereby shedding light on the success of previous analyzes under the Gaussian Design Assumption; 3) we derive novel bounds that improve over existing bounds across a broad range of settings such as (in)dependent feature vectors and various combinations of eigen-decay rates in the over/underparameterized regimes.

T. S. Cheng, A. Lucchi, A. Kratsios, D. Belius - Characterizing Overfitting in Kernel Ridgeless Regression Through the Eigenspectrum, Proceedings of the 41st International Conference on Machine Learning (ICML), 2024.
[ Abstract Proceedings link Conference PDF arxiv ]

We derive new bounds for the condition number of kernel matrices, which we then use to enhance existing non-asymptotic test error bounds for kernel ridgeless regression (KRR) in the over-parameterized regime for a fixed input dimension. For kernels with polynomial spectral decay, we recover the bound from previous work; for exponential decay, our bound is non-trivial and novel. Our contribution is two-fold: (i) we rigorously prove the phenomena of tempered overfitting and catastrophic overfitting under the sub-Gaussian design assumption, closing an existing gap in the literature; (ii) we identify that the independence of the features plays an important role in guaranteeing tempered overfitting, raising concerns about approximating KRR generalization using the Gaussian design assumption in previous literature.

D. Belius, F. Concetti, G. Genovese - On the determinant in Bray-Moore's TAP complexity formula, preprint.
[ Abstract arxiv ]

In the computation of the TAP complexity, originally carried out by Bray and Moore, a fundamental step is to calculate the determinant of a random Hessian. As the replica method does not give a clear prescription, physicists debated how to perform this computation and its consequences on the TAP complexity for a long time. In this paper we prove the original Bray and Moore formula for the behaviour of the determinant at exponential scale to be correct, and compute an important prefactor coming from a small outlier in the spectrum.

T. S. Cheng, A. Lucchi, I. Dokmanic, A. Kratsios, D. Belius - A Theoretical Analysis of the Test Error of Finite-Rank Kernel Ridge Regression, Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023.
[ Abstract Proceedings link Conference PDF arxiv ]

Existing statistical learning guarantees for general kernel regressors often yield loose bounds when used with finite-rank kernels. Yet, finite-rank kernels naturally appear in a number of machine learning problems, e.g. when fine-tuning a pre-trained deep neural network's last layer to adapt it to a novel task when performing transfer learning. We address this gap for finite-rank kernel ridge regression (KRR) by deriving sharp non-asymptotic upper and lower bounds for the KRR test error of any finite-rank KRR. Our bounds are tighter than previously derived bounds on finite-rank KRR and, unlike comparable results, they also remain valid for any regularization parameters.

D. Belius, L. Fröber - Fluctuations of the ground state of the spiked spherical Sherrington-Kirkpatrick model, submitted.
[ Abstract arxiv ]

The Sherrington-Kirkpatrick Hamiltonian is a random quadratic function on the high-dimensional sphere. This article studies the ground state (i.e. maximum) of this Hamiltonian with external field, or more generally with a non-linear "spike" term. We compute the level of the maximum to leading order, and under appropriate condition its first- and second-order fluctuations. The equivalent results are also derived for the maximum of the model's TAP free energy on the ball.

D. Belius, L. Fröber, J. Ko - TAP variational principle for the constrained overlap multiple spherical Sherrington-Kirkpatrick model, submitted.
[ Abstract arxiv ]

Spin glass models involving multiple replicas with constrained overlaps have been studied in [FPV92; PT07; Pan18a]. For the spherical versions of these models [Ko19; Ko20] showed that the limiting free energy is given by a Parisi type minimization. In this work we show that for Sherrington-Kirkpatrick (i.e. 2-spin) interactions, it can also be expressed in terms of a Thouless-Andersson-Palmer (TAP) variational principle. This is only the second spin glass model where a mathematically rigorous TAP computation of the free energy at all temperatures and external fields has been achieved. The variational formula we derive here also confirms that the model is replica symmetric, a fact which is natural but not obviously deducible from its Parisi formula.

A. Maillard, A. S. Bandeira, D. Belius, I. Dokmanić, S. Nakajima - Injectivity of ReLU networks: perspectives from statistical physics, Applied and Computational Harmonic Analysis 76 (2025), 101736.
[ Abstract Journal link arxiv ]

When can the input of a ReLU neural network be inferred from its output? In other words, when is the network injective? We consider a single layer, \(x \mapsto \mathrm{ReLU}(Wx)\), with a random Gaussian \(m \times n\) matrix \(W\), in a high-dimensional setting where \(n, m \to \infty\). Recent work connects this problem to spherical integral geometry giving rise to a conjectured sharp injectivity threshold for \(\alpha = \frac{m}{n}\) by studying the expected Euler characteristic of a certain random set. We adopt a different perspective and show that injectivity is equivalent to a property of the ground state of the spherical perceptron, an important spin glass model in statistical physics. By leveraging the (non-rigorous) replica symmetry-breaking theory, we derive analytical equations for the threshold whose solution is at odds with that from the Euler characteristic. Furthermore, we use Gordon's min--max theorem to prove that a replica-symmetric upper bound refutes the Euler characteristic prediction. Along the way we aim to give a tutorial-style introduction to key ideas from statistical physics in an effort to make the exposition accessible to a broad audience. Our analysis establishes a connection between spin glasses and integral geometry but leaves open the problem of explaining the discrepancies.

D.Belius, M. Schmidt - Complexity of local maxima of given radial derivative for mixed p-spin Hamiltonians, submitted.
[ Abstract arxiv ]

We study the number of local maxima with given radial derivative of spherical mixed p-spin models and prove that the second moment matches the square of the first moment on exponential scale for arbitrary mixtures and any radial derivative. This is surprising, since for the number of local maxima with given radial derivative and given energy the corresponding result is only true for specific mixtures [Sub17; BSZ20].
We use standard Kac-Rice computations to derive formulas for the first and second moment at exponential scale, and then find a remarkable analytic argument that shows that the second moment formula is bounded by twice the first moment formula in this general setting. This also leads to a new proof of a central inequality used to prove concentration of the number critical points of pure p-spin models of given energy in [Sub17] and removes the need for the computer assisted argument used in that paper for 3 ≤ p ≤ 10.

D.Belius, M. Schmidt - Phase diagram for the TAP energy of the p-spin spherical mean field spin glass model, preprint.
[ Abstract arxiv ]

We solve the Thouless-Anderson-Palmer (TAP) variational principle associated to the spherical pure p-spin mean field spin glass Hamiltonian and present a detailed phase diagram. In the high temperature phase the maximum of variational principle is the annealed free energy of the model. In the low temperature phase the maximum, for which we give a formula, is strictly smaller. The high temperature phase consists of three subphases. (1) In the first phase m=0 is the unique relevant TAP maximizer. (2) In the second phase there are exponentially many TAP maximizers, but m=0 remains dominant. (3) In the third phase, after the so called dynamic phase transition, m=0 is no longer a relevant TAP maximizer, and exponentially many non-zero relevant TAP solutions add up to give the annealed free energy. Finally in the low temperature phase a subexponential number of TAP maximizers of near-maximal TAP energy dominate.

D. Belius - High temperature TAP upper bound for the free energy of mean field spin glasses, submitted.
[ Abstract arxiv ]

This work proves an upper bound for the free energy of the Sherrington-Kirkpatrick model and its generalizations in terms of the Thouless-Anderson-Palmer (TAP) energy. The result applies to models with spherical or Ising spins and any mixed p-spin Hamiltonian with external field or with a non-linear spike term. The bound is expected to be tight to leading order at high temperature, and is non-trivial in the presence of an external field. For the proof a geometric microcanonical method is employed, in which one covers the spin space with sets, each of which is centered at a magnetization vector m and whose contribution to the partition function is bounded in terms of the TAP energy at m.

D. Banerjee, D. Belius - Fluctuations of the free energy of the mixed p-spin mean field spin glass model, preprint.
[ Abstract arxiv ]

We prove the convergence in distribution of the fluctuations of the free energy of the mixed p-spin Sherrington-Kirkpatrick model with non-vanishing 2-spin component at high enough temperature. The limit is Gaussian, and the fluctuations are seen to arise from weighted cycle counts in the complete graph on the spin indices weighted by the 2-spin interaction matrix.

D. Belius, J. Černý, S. Nakajima, M. Schmidt - Triviality of the geometry of mixed p-spin spherical Hamiltonians with external field, Journal of Statistical Physics 186.1 (2022): 1-3.
[ Abstract Journal link arxiv ]

We study isotropic Gaussian random fields on the high-dimensional sphere with an added deterministic linear term, also known as mixed p-spin Hamiltonians with external field. We prove that if the external field is sufficiently strong, then the resulting function has trivial geometry, that is only two critical points. This contrasts with the situation of no or weak external field where these functions typically have an exponential number of critical points. We give an explicit threshold h_c for the magnitude of the external field necessary for trivialization and conjecture h_c to be sharp. The Kac-Rice formula is our main tool. Our work extends [Fyo15], which identified the trivial regime for the special case of pure p-spin Hamiltonians with random external field.

T. Liu, A. Chaman, D. Belius, I. Dokmanic - Interpreting U-Nets via Task-Driven Multiscale Dictionary Learning, IEEE Transactions on Computational Imaging 8 (2022): 425–37. https://doi.org/10.1109/TCI.2022.3175309
[ Abstract arxiv ]

U-Nets have been tremendously successful in many imaging inverse problems. In an effort to understand the source of this success, we show that one can reduce a U-Net to a tractable, well-understood sparsity-driven dictionary model while retaining its strong empirical performance. We achieve this by extracting a certain multiscale convolutional dictionary from the standard U-Net. This dictionary imitates the structure of the U-Net in its convolution, scale-separation, and skip connection aspects, while doing away with the nonlinear parts. We show that this model can be trained in a task-driven dictionary learning framework and yield comparable results to standard U-Nets on a number of relevant tasks, including CT and MRI reconstruction. These results suggest that the success of the U-Net may be explained mainly by its multiscale architecture and the induced sparse representation.

M. Samarin, V. Roth, D. Belius, Feature Learning and Random Features in Standard Finite-Width Convolutional Neural Networks: An Empirical Study, Conference on Uncertainty in Artificial Intelligence (UAI) 2022.
[ Abstract pdf arxiv ]

The Neural Tangent Kernel is an important milestone in the ongoing effort to build a theory for deep learning. Its prediction that sufficiently wide neural networks behave as kernel methods, or equivalently as random feature models arising from linearized networks, has been confirmed empirically for certain wide architectures. In this paper, we compare the performance of two common finite-width convolutional neural networks, LeNet and AlexNet, to their linearizations on common benchmark datasets like MNIST and modified versions of it, CIFAR-10 and an ImageNet subset. We demonstrate empirically that finite-width neural networks, generally, greatly outperform the finite-width linearization of these architectures. When increasing the problem difficulty of the classification task, we observe a larger gap which is in line with common intuition that finite-width neural networks perform feature learning which finite-width linearizations cannot. At the same time, finite-width linearizations improve dramatically with width, approaching the behavior of the wider standard networks which in turn perform slightly better than their standard width counterparts. Therefore, it appears that feature learning for non-wide standard networks is important but becomes less significant with increasing width. We furthermore identify cases where both standard and linearized networks match in performance, in agreement with NTK theory, and a case where a wide linearization outperforms its standard width counterpart.

The TAP-Plefka variational principle for the spherical SK model (with N. Kistler; Communications in Mathematical Physics volume 367, pages 991–1017, 2019)
[ Abstract arxiv Journal pdf ]

We reinterpret the Thouless-Anderson-Palmer approach to mean field spin glass models as a variational principle in the spirit of the Gibbs variational principle and the Bragg-Williams approximation. We prove this TAP-Plefka variational principle rigorously in the case of the spherical Sherrington-Kirkpatrick model.

Tightness for the Cover Time of compact two dimensional manifolds (with J. Rosen and O. Zeitouni; Probability Theory and Related Fields volume 176, pages1357–1437, 2020)
[ Abstract arxiv ]

Let C_ε,S² denote the cover time of the two dimensional sphere S² by a Wiener sausage of radius ε. We prove that √C_ε,S²−2√2(logε^-1-(1/4)loglog ε^-1) is tight.

Barrier estimates for a critical Galton-Watson process and the cover time of the binary tree (with J. Rosen and O. Zeitouni; Annales de l'Institut Henri Poincaré, Probabilités et Statistiques, Volume 55, Number 1 (2019), 127-154.)
[ Abstract arxiv ]

For the critical Galton--Watson process with geometric offspring distributions we provide sharp barrier estimates for barriers which are (small) perturbations of linear barriers. These are useful in analyzing the cover time of finite graphs in the critical regime by random walk, and the Brownian cover times of compact two dimensional manifolds. As an application of the barrier estimates, we prove that if C_L denotes the cover time of the binary tree of depth _L by simple walk, then √C_L/(2^L+1) - (√2log2) L + log L/(√2 log2) is tight. The latter improves results of Aldous (1991), Bramson and Zeitouni (2009) and Ding and Zeitouni (2012). In a subsequent article we use these barrier estimates to prove tightness of the Brownian cover time for the two-dimensional sphere.

Maximum of the Riemann zeta function on a short interval of the critical line (with L-P. Arguin, P. Bourgade, M. Radziwill and K. Soundararajan; Communications in Pure and Applied Mathematics 72: 500-535 (2017) https://doi.org/10.1002/cpa.21791.)
[ Abstract arxiv ]

We prove the leading order of a conjecture by Fyodorov, Hiary and Keating, about the maximum of the Riemann zeta function on random intervals along the critical line. More precisely, if t is uniformly distributed in [T,2T], then max |t - u| ≤ log |ζ(1/2 + iu)| =(1 + o(1))loglogT with probability converging to 1 as T → ∞.

Maximum of the Ginzburg-Landau fields (with W. Wu; Ann. Probab. 48 (6) 2647 - 2679, November 2020. https://doi.org/10.1214/19-AOP1416)
[ Abstract Journal link arxiv ]

We study two dimensional massless field in a box with potential V(∇φ(·)) and zero boundary condition, where V is any symmetric and uniformly convex function. Naddaf-Spencer and Miller proved the macroscopic averages of this field converge to a continuum Gaussian free field. In this paper we prove the distribution of local marginal φ(x), for any x in the bulk, has a Gaussian tail. We further characterize the leading order of the maximum and dimension of high points of this field, thus generalize the results of Bolthausen-Deuschel-Giacomin and Daviaud for the discrete Gaussian free field.

Maximum of the characteristic polynomial of random unitary matrices (with L-P. Arguin and P. Bourgade; Communications in Mathematical Physics, Volume 349 (2017), Issue 2, pp 703-751.)
[ Abstract Journal pdf arxiv ]

It was recently conjectured by Fyodorov, Hiary and Keating that the maximum of the characteristic polynomial on the unit circle of a N×N random unitary matrix sampled from the Haar measure grows like CN/(logN)^3/4 for some random variable C. In this paper, we verify the leading order of this conjecture, that is, we prove that with high probability the maximum lies in the range [N^1−ε,N^1+ε], for arbitrarily small ε. The method is based on identifying an approximate branching random walk in the Fourier decomposition of the characteristic polynomial, and uses techniques developed to describe the extremes of branching random walks and of other log-correlated random fields. A key technical input is the asymptotic analysis of Toeplitz determinants with dimension-dependent symbols. The original argument for these asymptotics followed the general idea that the statistical mechanics of 1/f-noise random energy models is governed by a freezing transition. We also prove the conjectured freezing of the free energy for random unitary matrices.

Maxima of a randomized Riemann Zeta function, and branching random walks (with L-P. Arguin and A. Harper; Annals of Applied Probability, Volume 27, Number 1 (2017), pp 178-215.)
[ Abstract Journal arxiv ]

A recent conjecture of Fyodorov--Hiary--Keating states that the maximum of the absolute value of the Riemann zeta function on a typical bounded interval of the critical line is exp(loglog T − (3/4) logloglogT + O(1)), for an interval at (large) height T. In this paper, we verify the first two terms in the exponential for a model of the zeta function, which is essentially a randomized Euler product. The critical element of the proof is the identification of an approximate tree structure, present also in the actual zeta function, which allows us to relate the maximum to that of a branching random walk.

The subleading order of two dimensional cover times (with N. Kistler; Probability Theory and Related Fields, Volume 167 (2017), Issue 1, pp 461-552.)
[ Abstract Journal pdf arxiv ]

The epsilon-cover time of the two dimensional torus by Brownian motion is the time it takes for the process to come within distance epsilon>0 from any point. Its leading order in the small epsilon-regime has been established by Dembo, Peres, Rosen and Zeitouni [Ann. of Math., 160 (2004)]. In this work, the second order correction is identified. The approach relies on a multi-scale refinement of the second moment method, and draws on ideas from the study of the extremes of branching Brownian motion.

Gumbel fluctuations for cover times in the discrete torus (Probability Theory and Related Fields, Volume 157 (2013), Issue 3-4, pp 635-689.)
[ Abstract Journal pdf arxiv ]

This work proves that the fluctuations of the cover time of simple random walk in the discrete torus of dimension at least three with large side-length are governed by the Gumbel extreme value distribution. This result was conjectured for example in the book by Aldous & Fill. We also derive some corollaries which qualitatively describe how covering happens. In addition, we develop a new and stronger coupling of the model of random interlacements, introduced by Sznitman, and random walk in the torus. This coupling is used to prove the cover time result and is also of independent interest.

Cover times in the discrete cylinder. (Preprint)
[ Abstract arxiv ]

This article proves that, in terms of local times, the properly rescaled and re-centered cover times of finite subsets of the discrete cylinder by simple random walk converge in law to the Gumbel distribution, as the cardinality of the set goes to infinity. As applications we obtain several other results related to covering in the discrete cylinder. Our method is new and involves random interlacements, which were introduced in [22]. To enable the proof we develop a new stronger coupling of simple random walk in the cylinder and random interlacements, which is also of independent interest.

Cover levels and random interlacements (Annals of Applied Probability, Volume 22, Number 2 (2012), pp 522-540.)
[ Abstract Journal arxiv ]

This note investigates cover levels of finite sets in the random interlacements model, that is the least level such that the set is completely contained in the random interlacement at that level. It proves that as the cardinality of a set goes to infinity, the rescaled and recentered cover level tends in distribution to the Gumbel distribution with cumulative distribution function exp(-exp(-z)).

David Belius

Minicourse Budapest

Contact details

Group

Former members

Papers

Spin glass workshop 2022

Previous affiliations