Interesting problems in DL theory

2021-05-01

Here are some research directions I worked on with before my PhD. Mathematics of machine learning is a very cool field, although these days I think empirical research is a bit more impactful.

When are ReLU neural networks injective?

Update Nov 2021: I managed to solve some of the problems below in my Master thesis.

Update Oct 2022: Antoine Maillard and others have apparently resolved the injectivity phase transition in an upcoming publication!

Some basic properties of neural networks are yet to be fully understood. A natural question discussed in my Master thesis is: Given a “typical” neural network going from \mathbb{R}^n to \mathbb{R}^m, is it injective?

The paper [Puthawala et al., 2020] considers injectivity of random neural networks, and proves some basic results on injectivity of random ReLU networks with a single hidden layer.

My thesis (building on joint work with Afonso Bandeira, Charles Clum and Dustin Mixon) contains strong results in several directions: - The first theorem on injectivity of deep ReLU neural networks at initialization; - Large progress on the injectivity phase transition, using the machinery of random polyhedral geometry to approximate it via the expected Euler characteristic of a random nonconvex cone; - The connection of injectivity of deep ReLU neural networks with the rank collapse phenomenon described above.

Here is a nice and natural conjecture that we haven’t managed to prove.

Conjecture: Injectivity of networks of depth independent of expansivity

Let C be a large enough constant (expansivity), and let L be a positive integer (depth). Take a neural network f with L fully connected layers of widths (n, Cn, Cn, \ldots, Cn) and the ReLU activation. Initialize the weights as independent standard Gaussians, so

f(x) = \operatorname{ReLU}(W_L \operatorname{ReLU}( W_{L-1} \operatorname{ReLU}(\ldots W_1 x) \ldots ))

where W_1, \ldots, W_L are matrices with independent \mathcal{N}(0, 1) entries.

Question: Is f injective with high probability, as n increases?

The tools developed in Chapter 5 of my thesis can prove this conjecture when C is a log-linear function of L. The general case will likely need different methods.

How wide should one-hidden-layer networks be to be robust (when interpolating random data)?

The general idea of the tradeoff between neural network robustness and the layer widths is well understood¹. Of course, one should only compare networks with the same accuracy; for simplicity, consider networks that interpolate a random dataset.

The paper [Bubeck et al., 2019] proposes a simple conjecture: the Lipschitz constant of a network with one hidden layer should be on the order of \sqrt{n/k}, where n is the number of datapoints, and k is the number of hidden neurons.

Recently, [Husain et al., 2021] made progress on the problem, connecting the width-robustness tradeoff to the Rademacher complexity of the model class.

Update Nov 2021: Bubeck and Sellke have now famously resolved a part of this problem, winning an Outstanding Paper Award at NeurIPS 2021. As long as the weights are bounded, the Lipschitz constant is large if there no overparameterization. The main theoretical question now is to advance from the Lipschitz constant to more realistic proxies for robustness.

In adversarial training of robust models, why do tighter relaxations perform worse?

Training neural networks for which adversarial robustness can be certified is an active research area. See Lectures 9-13 here for an introduction to certified robustness.

As noted in [Gowal et al., 2019], you get more robustness by training with loose relaxations such as Box, than with the strong polytope relaxations. Box is essentially just interval arithmetic, and it loses a lot of precision. The tight relaxations are certainly closer to the actual robustness problem than the Box relaxation.

There is nothing magical going on. It’s not that the Box relaxation is unexpectedly great, it’s just that tight relaxations fail to deliver in training.

Recently, [Jovanovic et al., 2021] made huge strides on this problem, showing that the most naive Box relaxation has nice continuity and smoothness properties that smarter relaxations such as symbolic intervals and Zonotope don’t have.

This means the abstract transformer can become very nonsmooth in the weights. However, several questions remain:

Although the bounds of the abstract transformer can jump quickly, does it really happen with the abstract loss?
Can we characterize the “jumps” in the abstract loss? (Easy for symbolic intervals, harder for others.)
How to give a proof that the jumps make the robustness optimization harder? It would be interesting to see this proven formally on random data as a stepping stone.
What are the theoretical limits on the smoothness? There are results on the exact adversarial loss landscape. Is there an actual tradeoff between the smooth and the tight? Can we find the best relaxation somewhere in between?
Finally, can the information about the loss landscape be used to make the optimization problem easier? Some regularization techniques come to mind. I don’t know much nonconvex optimization, though.

What’s the deal with mean shift, rank collapse and signal pathologies?

Consider deep ReLU neural networks (or ResNets) at initialization. What happens to the input data when passed through the network?

Answer: the output gets kind of crunched into a single line. The angles become smaller. Let’s call this the contraction phenomenon.

There are several parallel works that correctly predict the behaviour, but they all give different explanations why:

[Labatie, 2019] calls it the pathology of one-dimensional signal;
[Daneshmand et al., 2020] calls it rank collapse;
[Brock et al., 2021] calls it mean shift;
The mean field theory papers describe the angle evolution in the infinite-width limit.

Question 1: Can those be reconciled somehow?

What is needed is a solid theory of compositions of random matrices and nonlinearities acting on the angle between vectors, in the spirit of [Pennington et al., 2017].

Question 2: Is this phenomenon really relevant for training?

The papers above claim that this phenomenon prevents training, and that the benefit of the normalization techniques is in avoiding this contraction behaviour. However, you can just initialize differently (e.g. by orthogonal matrices as affine layers) and then there’s no rank collapse. Is this the whole story?

As [Husain et al., 2021] phrases it, there is a “price to pay for robustness”.↩︎