Now
In Zurich, in the second year of my PhD at ETH.
Working on evaluations for better-than-human LLM forecasters (followup to this paper), and sometimes thinking about reverse engineering model details.
Reading a lot of safety papers published recently, and summarizing the most important ones on my Substack newsletter and Twitter.
I continue to believe that we passed peak data relevance some time ago, and that future models will draw most of their training signal from some kind of reinforcement learning or self-distillation.
Nine papers in my PhD so far, more soon:
- Refusal in Language Models Is Mediated by a Single Direction
- Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
- Stealing part of a production language model Best Paper Award at ICML 2024
- Foundational Challenges in Assuring Alignment and Safety of Large Language Models
- Evaluating Superhuman Models with Consistency Checks
- ARB: Advanced Reasoning Benchmark for Large Language Models
- Poisoning Web-Scale Training Datasets is Practical
- Red-Teaming the Stable Diffusion Safety Filter
- A law of adversarial risk, interpolation, and label noise
Last updated July 2024.