I'm currently a member of technical staff at a stealth AI startup. Previously, I worked as an AI researcher at Scale AI, and before that I was a PhD candidate at Harvard advised by David Parkes and supported by the NSF Graduate Research Fellowships Program and a Kempner Institute Graduate Fellowship. Even before that, I was a software engineer at Asana (gap year before college).
In my spare time, I've been a lifelong Go player (in fact, seeing AlphaGo beat Lee Sedol was the origin of my interest in AI). I also co-founded the Gradient, a digital magazine focusing on AI.
My recent work focuses on evals, test-time compute, and post-training for LLMs. Previously, I worked on multi-agent reinforcement learning and game theory. * denotes equal or alphabetical ordering.
Reconstructed o1 test-time scaling laws using public API access to o1-mini.
Humanity's Last Exam
Long Phan*, Alice Gatti*, Ziwen Han*, Nathaniel Li*, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, and 1109 others
Nature, 2026
website
We demonstrate that multi-turn human jailbreaks can achieve >70% success rates against LLM defenses that report single-digit success rates for automated single-turn attacks.
Training on chain-of-thoughts that lead to a correct answer can help a LLM self-improve and generalize far beyond their original capabilities in the toy environment of addition.
Unified algorithm for both reinforcement learning and game theory. Can solve MDPs as fast as RL methods and imperfect-information games as fast as CFR using the single set of hyperparameters.
A novel no-regret learning procedure that converges to correlated and coarse-correlated equilibria several orders of magnitude faster than previous methods in randomly generated normal-form games.
Existing language models can generate either high quality or diverse utterances, but not both simultaneously. How can we measure that in a single metric?