Now
In Zurich, in the second year of my PhD at ETH.
Working on whether and how LLMs represent text authorship. Apart from that, leading a remote research collaboration on evaluations for better-than-human LLM forecasters (followup to this paper), and sometimes thinking about situational awareness a bit.
Reading a lot of safety papers published recently, and summarizing the most important ones on my Substack and Twitter.
I continue to believe that we passed peak data relevance some time ago, and that future models will draw most of their training signal from some kind of reinforcement learning or self-distillation.
Six papers in my PhD so far, more soon:
- Stealing part of a production language model
- Evaluating Superhuman Models with Consistency Checks
- ARB: Advanced Reasoning Benchmark for Large Language Models
- Poisoning Web-Scale Training Datasets is Practical
- Red-Teaming the Stable Diffusion Safety Filter
- A law of adversarial risk, interpolation, and label noise
Last updated Mar 2024.