In Zurich, early into my PhD. Large language models (LLMs) are clearly a big deal, and I do not understand why most of ML academia still assumes that we will go back to the pre-2020 ML paradigm.
Working on empirical research on how language models can lose chain-of-thought interpretability when trained via reinforcement learning or other outcome-based optimization methods, for example with RLHF.
Reading most LLM failure modes and safety papers published recently, and summarizing some of them for my Twitter newsletter. Thinking about cross-posting to Mastodon.
I continue believing that we passed peak data relevance in 2020, and that future models will draw most of their training signal from some kind of reinforcement learning or self-distillation. Hope to be wrong.
Submitted two papers in the first month of my PhD:
- A law of adversarial risk, interpolation, and label noise;
- Red-Teaming the Stable Diffusion Safety Filter.
Last updated November 2022.