Now

In the third year of my PhD at ETH Zurich.

Some research ideas:

Figure out what LLMs behaviorally want to do;
Red-teaming an LLM by training an “output -> prompt” LLM;
Consistency as training signal on difficult tasks;
“Thought that faster” as training signal on easy tasks.

Reading many safety papers published recently, and summarizing the most important ones on my Substack newsletter and Twitter.

The following sentence has been on my website since 2022. Time to retire it? > I continue to believe that we passed peak data relevance some time ago, and that future models will draw most of their training signal from some kind of reinforcement learning or self-distillation.

Recent posts:

GPT-4o draws itself as a consistent type of guy
You should delay engineering-heavy research in light of R&D automation

Eleven papers in my PhD so far.

Pitfalls in Evaluating Language Model Forecasters
Consistency Checks for Language Model Forecasters ICLR 2025 Oral
Refusal in Language Models Is Mediated by a Single Direction
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Stealing part of a production language model Best Paper Award at ICML 2024
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Evaluating Superhuman Models with Consistency Checks
ARB: Advanced Reasoning Benchmark for Large Language Models
Poisoning Web-Scale Training Datasets is Practical
Red-Teaming the Stable Diffusion Safety Filter
A law of adversarial risk, interpolation, and label noise

Last updated June 2025.

What is a “now” page?