Doodle für Nachrichten/Kalender

Aktuelles

Neuigkeiten und Veranstaltungen: Wir halten Sie auf dem Laufenden.

Aktuelle Meldungen

Im Moment stehen keine aktuellen Meldungen zur Verfügung.

 

Aktuelle Veranstaltungen

Im Moment stehen keine Veranstaltungen zur Verfügung.

 

MIP Group Seminar (past events)

Inhalt wechseln

Gradient Descent in Linear Diagonal Networks: Analyzing the Implicit Bias

by Wiebke Bartolomaeus (LMU Munich)

In deep learning, one often operates in a (highly) overparameterized regime. That is, we have significantly more trainable parameters than available training data. Nevertheless, experiments show that the generalization error after training with (stochastic) gradient descent is still small, while one would expect overfitting, i.e. small training error and relatively large test error. Implicit bias now is the hypothesis that the underlying training algorithm (gradient methods) implicitly yields 'good', among all possible, solutions. Here this 'good' is often connected to sparsity/low rank solutions, so simple solutions. To shed some first light on this phenomenon, we analyze the training dynamics of diagonal linear networks. That is a simplification of Neural Network, with identity activation function and diagonal weight matrices. For this we focus on vanilla gradient descent as the training algorithm. Several previous works have considered the continuous analogous gradient flow instead. Our analysis technique identifies the gradient descent as a perturbed mirror gradient descent and allows us to give bounds on the influence of the step-size.

General Tail Bounds for Non-Smooth Stochastic Mirror Descent

by Andrea Paudice (Aarhus University)

We study the problem of minimizing a convex, non-smooth Lipschitz function over a convex domain when only noisy stochastic subgradient estimates are available. We analyze the classical Stochastic Mirror Descent (SMD) algorithm and derive new tail bounds on its optimization error, for both the averaged and the last iterate. Our results extend existing analyses - traditionally limited to light-tailed, sub-Gaussian noise - to heavier-tailed noise distributions. We specialize our general bounds to two important families of noise: one with exponential tails and another with polynomial tails. Notably, our bounds for the averaged iterate reveal a distinct two-regime behavior, highlighting new insights into the interplay between noise tails and convergence rates.

Gradients and Groups

by Daniel Kunin (UC Berkeley)

I will begin with a recently published work introducing Alternating Gradient Flows (AGF)—an analytic framework describing feature learning in two-layer networks trained from small initialization. In this regime, gradient flow exhibits a staircase-like loss curve: neurons slowly align to useful directions, then rapidly grow in norm. AGF models this process as alternating maximization and minimization steps, unifying prior saddle-to-saddle analyses and provably converging to gradient flow in diagonal linear networks. Applied to quadratic networks trained on modular addition, AGF yields the first complete characterization of training dynamics, showing that networks learn Fourier features in decreasing order of coefficient magnitude.
I will then discuss ongoing work extending this analysis from modular addition to group composition over sequences. We develop an analytic theory for how networks learn to compose a sequence of elements from a finite group encoded in a real vector space. Using AGF we show that neural networks learn non-abelian finite groups via alternating maximization and minimization phases. With each iteration the network acquires one additional irreducible representation of the group, with irreps learned in decreasing order of their contribution to the target.