Here are some research directions I sometimes play with. I have spent between one hour and one month thinking on each of these questions.

How wide should one-hidden-layer networks be to be robust (when interpolating random data)?

The general idea of the tradeoff between

  • neural network robustness; and
  • neural network layer widths

is well known1. Of course, one should compare networks with the same accuracy; for simplicity, consider networks that interpolate a random dataset.

The paper [Bubeck et al., 2019] proposes a simple conjecture: the Lipschitz constant of a network with one hidden layer should be on the order of $\sqrt{n/k}$, where $n$ is the number of datapoints, and $k$ is the number of hidden neurons.

Recently, [Husain et al., 2021] made progress on the problem, connecting the width-robustness tradeoff to the Rademacher complexity of the model class.

In adversarial training of provably robust models, why do tighter relaxations perform worse?

(See Lectures 9-13 here for an introduction to certified robustness.2)

As noted in [Gowal et al., 2019], you get more robustness by training with loose relaxations such as Box, than with the strong polytope relaxations. The problems involving the tight relaxations are certainly closer to the actual robustness problem than if we relax naively.

There is nothing magical going on. It’s not that the loose relaxation is unexpectedly great, it’s just that tight relaxations fail to deliver in training.

Recently, [Jovanovic et al., 2021] have made huge strides on this problem, showing that the most naive Box relaxation has nice continuity and smoothness properties that smarter relaxations such as symbolic intervals and Zonotope don’t have.

This means the abstract transformer can become very nonsmooth in the weights. However, several questions remain:

  • Although the bounds of the abstract transformer can jump quickly, does it really happen with the abstract loss? (This one seems straightforward. Maybe remove it from the questions.)
  • Can we characterize the “jumps” in the abstract loss? (Easy for symbolic intervals, harder for others.)
  • How to give a proof that the jumps make the robustness optimization harder? It would be interesting to see this proven formally on random data as a stepping stone.
  • What are the theoretical limits on the smoothness? There are results on the exact adversarial loss landscape. Is there an actual tradeoff between the smooth and the tight? Can we find the best relaxation somewhere in between?
  • Finally, can the information about the loss landscape be used to make the optimization problem easier? Some regularization techniques come to mind. I don’t know much nonconvex optimization, though.

When are ReLU neural networks injective?

Neural networks are veyr widely used, but we still don\t understand some of their basic properties. As was discussed in [Pennington et al., 2017], analysis of random neural networks is important, and should be similar to analysis of random matrices, but much harder.

The paper [Puthawala et al., 2020] considers injectivity of random neural networks, and proves some basic results on injectivity of random ReLU networks with a single hidden layer.

My Master Thesis (to be published) improves upon their results significantly, including the first result on injectivity of deep ReLU neural networks at initialization.

What’s the deal with mean shift, rank collapse and signal pathologies?

Consider deep ReLU neural networks (or ResNets) at initialization. What happens to the input data when passed through the network?

Answer: the output gets kind of crunched into a single line. The angles become smaller. Let’s call this the contraction phenomenon.

There are several parallel works that correctly predict the behaviour, but they all give different explanations why:

Question 1: Can those be reconciled somehow?

What is needed is a solid theory of compositions of random matrices and nonlinearities acting on the angle between vectors, in the spirit of [Pennington et al., 2017].

Question 2: Is this phenomenon really relevant for training?

The papers above claim that this phenomenon prevents training, and that the benefit of the normalization techniques is in avoiding this contraction behaviour. However, you can just initialize differently (e.g. by orthogonal matrices as affine layers) and then there’s no rank collapse. Is this the whole story?

 

If you want to work on some problem together, feel free to reach out to .


  1. As [Husain et al., 2021] phrases it, there is a “price to pay for robustness”. ↩︎

  2. I should write a shorter introduction, with an emphasis on differentiable abstract interpretation. ↩︎