Now
In the third year of my PhD at ETH Zurich; in Berkeley from January to March 2025.
Thinking about what to do next. Some ideas:
- Red-teaming an LM by training an “output -> prompt” LM;
- Demoing LLM agents doing weird stuff on the Internet on their own;
- Consistency as training signal on difficult tasks;
- “Thought that faster” as training signal on easy tasks.
Reading many safety papers published recently, and summarizing the most important ones on my Substack newsletter and Twitter.
I continue to believe that we passed peak data relevance some time ago, and that future models will draw most of their training signal from some kind of reinforcement learning or self-distillation. I’ve had this sentence on my website since 2022.
Recent posts:
Ten papers in my PhD so far, more soon:
- Consistency Checks for Language Model Forecasters
- Refusal in Language Models Is Mediated by a Single Direction
- Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
- Stealing part of a production language model Best Paper Award at ICML 2024
- Foundational Challenges in Assuring Alignment and Safety of Large Language Models
- Evaluating Superhuman Models with Consistency Checks
- ARB: Advanced Reasoning Benchmark for Large Language Models
- Poisoning Web-Scale Training Datasets is Practical
- Red-Teaming the Stable Diffusion Safety Filter
- A law of adversarial risk, interpolation, and label noise
Last updated Jan 2025.