• Pyvene

    Pyvene is an open source interpretability tool made by Zhengxuan Wu.

    We aim to utilize and contribute to this fantastic resource!

  • Published Research

    Our primary contribution to the world is academic papers.

    Check out our research timeline below!

  • Causal Abstraction

    Mechanistic interpretability research analyzes internals of an AI.

    We want to ground the field in a formal theory of causal abstraction.

Research Timeline

June 2023

Pr(Ai)²R Group is founded by Atticus Geiger.

September 2023

Rigorously Assessing Natural Language Explanations of Neurons

An audit of Open AI’s Language models can explain neurons in language models in which GPT-4 provides text explanations of neurons in GPT-2. Jing Huang at Stanford led the effort to evaluate the explanation texts provided by GPT-4 under the supervision of Chris Potts. Atticus Geiger contributed as a Pr(Ai)²R Group member. The project culminated to a best paper award at BlackBoxNLP 2023.

October 2023

Linear Representations of Sentiment in Large Language Models

An analysis of how positive and negative sentiment is represented in language models. Curt Tigges and Oskar Hollinsworth led the project as SERI MATS interns. Neel Nanda supervised the project and Atticus Geiger contributed as a Pr(Ai)²R Group member.

January 2024

A Reply to Makelov et al. (2023)’s “Interpretability Illusion” Arguments

A critical response that challenges the notion of illusion presented Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching. Zhengxuan Wu led the project under the supervision of Noah Goodman. Atticus Geiger contributed as a Pr(Ai)²R Group member.

February 2024

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

A benchmark for evaluating interpretability methods that localize high level concepts to features inside deep learning models. Jing Huang at Stanford led the effort under the supervision of Atticus Geiger, a Pr(Ai)²R Group member.

March 2024

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

A open-source Python library that supports customizable interventions on a range of different PyTorch modules. Created by Zhengxuan Wu with Atticus Geiger contributing as a Pr(Ai)²R Group member.

April 2024

ReFT: Representation Finetuning for Language Models

An activation steering method that learns task-specific interventions on hidden representations of a frozen base model. ReFT outperforms parameter efficient fine-tuning methods such as LoRA on a variety of tasks. Led by Zhengxuan Wu and Aryaman Arora with Atticus Geiger contributing as a Pr(Ai)²R Group member.