Now
In the third year of my PhD at ETH Zurich.
Thinking about what to do next.
Some research ideas:
Figure out what LLMs behaviorally want to do;
Red-teaming an LLM by training an “output -> prompt” LLM;
Consistency as training signal on difficult tasks;
“Thought that faster” as training signal on easy tasks.
Reading many safety papers published recently, and summarizing the most important ones on my Substack newsletter and Twitter.
The following sentence has been on my website since 2022. Time to retire it? > I continue to believe that we passed peak data relevance some time ago, and that future models will draw most of their training signal from some kind of reinforcement learning or self-distillation.
Recent posts:
- GPT-4o draws itself as a consistent type of guy
- You should delay engineering-heavy research in light of R&D automation
Eleven papers in my PhD so far.
- Pitfalls in Evaluating Language Model Forecasters
- Consistency Checks for Language Model Forecasters ICLR 2025 Oral
- Refusal in Language Models Is Mediated by a Single Direction
- Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
- Stealing part of a production language model Best Paper Award at ICML 2024
- Foundational Challenges in Assuring Alignment and Safety of Large Language Models
- Evaluating Superhuman Models with Consistency Checks
- ARB: Advanced Reasoning Benchmark for Large Language Models
- Poisoning Web-Scale Training Datasets is Practical
- Red-Teaming the Stable Diffusion Safety Filter
- A law of adversarial risk, interpolation, and label noise
Last updated June 2025.