Satvik Golechha

A whimsical dragon sitting on a tree branch, contemplating various symbols above its head, including lightning, a cauldron, a skull, and a compass, set in a lush forest background.

Building Better Deception Probes Using Targeted Instruction Pairs

Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom

ICML 2026 (co-mentored at LASR)

Auditing Games for Sandbagging

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Z-M., Oliver M., Connor K., Kola A., Jacob M., Sam Marks, Chris Cundy, Joseph Bloom

2025, UK AISI (in collaboration with FAR AI)

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha, Adrià Garriga-Alonso

NeurIPS 2025 (Spotlight) (MATS)

A is for Absorption: Studying Feature Splitting and Absorption in SAEs

David Chanin, James W.S., Tomáš D., Hardik B., Satvik Golechha, Joseph Bloom

NeurIPS 2025 (Oral) (MATS)

ABBEL: Acting through Belief Bottlenecks Expressed in Language

Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr

NeurIPS 2025 (Spotlight, LAW workshop) (CHAI, UC Berkeley)

Website

Auditing Language Models for Hidden Objectives

Samuel Marks, Johannes Treutlein, . . ., Satvik Golechha, . . ., Evan Hubinger

2025, Anthropic (external collaboration)

Anthropic

Who’s the Evil Twin? Differential Auditing for Undesired Behavior

Ishwar B. , Hasith V. , Greta K., Ronan A. , Satvik Golechha

Mentored at SPAR 2025

Intricacies of Feature Geometry in Large Language Models

Satvik Golechha, Lucius Bushnaq, Euan Ong, Neeraj Kayal, Nandi Schoots

ICLR 2025 (poster) (best blog award)

ICLR Blog

Studying Cross-cluster Modularity in Neural Networks

Satvik Golechha, Maheep C., Joan V., Alessandro Abate, Nandi Schoots

NeurIPS 2024: Workshop on Science of Deep Learning

Poster

Some Lessons from the OpenAI-FrontierMath Debacle

Satvik Golechha

A piece of investigative journalism that became pretty popular :)

YC HackerNews

Media

Progress Measures for Grokking on Real-world Tasks

Satvik Golechha

ICML 2024:Workshop on High-Dim. Learning Dynamics (independent)

Challenges in Mechanistically Interpreting Harmful Representations

Satvik Golechha, James Dao

ICML 2024: Workshop on Mechanistic Interpretability (independent)

NICE: To Optimize In-Context Examples or Not?

Pragya Srivastava*, Satvik Golechha*, Amit Deshpande, Amit Sharma

ACL 2024 (main, poster) (Microsoft Research)

BYoEB: An LLM-Powered Expert-in-the-Loop Chat System

Pragnya R.*, Bhuvan S.*, Satvik Golechha*, Mohit Jain, and others

UbiComp 2025 (Microsoft Research) (deployed in 3+ hospitals)

Predicting Treatment Adherence of Tuberculosis Patients at Scale

Mihir Kulkarni*, Satvik Golechha*, Rishi R.*, Jithin S.*, Alpan Raval

NeurIPS 2022 (Wadhwani AI) (deployed for 40k+ patients)