• Pyvene

    Pyvene is an open source interpretability tool made by Zhengxuan Wu.

    We aim to utilize and contribute to this fantastic resource!

  • Published Research

    Our primary contribution to the world is academic papers.

    Check out our research timeline below!

  • Causal Abstraction

    Mechanistic interpretability research analyzes internals of an AI.

    We want to ground the field in a formal theory of causal abstraction.

Research Timeline

July 2024

July 2023

Causal Abstraction:
A Theoretical Foundation for Mechanistic Interpretability

A comprehensive treatise arguing that causal abstraction provides the theoretical foundations for mechanistic interpretability. The core contributions are generalizing the theory of causal abstraction to arbitrary mechanism transformations and providing precise, yet flexible formalizations of core interpretability methods and concept. Led by Atticus Geiger, with Amir Zur, Maheep Chaudhary, and Sonakshi Chauhan contributing as a Pr(Ai)²R Group members.

Updating CLIP to Prefer Descriptions Over Captions

A method for updating the text and image model CLIP to be sensitive to the distinction between a caption meant to complement an image and a description meant to replace the image for blind and low vision users. We use a training objective based in distributed interchange interventions to instill the CLIP model with this concept. Led by Amir Zur under the supervision of Atticus Geiger, both contributing as a Pr(Ai)²R Group members.

Pr(Ai)²R Group is founded by Atticus Geiger.

September 2023

Rigorously Assessing Natural Language Explanations of Neurons

An audit of Open AI’s Language models can explain neurons in language models in which GPT-4 provides text explanations of neurons in GPT-2. Jing Huang at Stanford led the effort to evaluate the explanation texts provided by GPT-4 under the supervision of Chris Potts. Atticus Geiger contributed as a Pr(Ai)²R Group member. The project culminated to a best paper award at BlackBoxNLP 2023.

October 2023

Linear Representations of Sentiment in Large Language Models

An analysis of how positive and negative sentiment is represented in language models. Curt Tigges and Oskar Hollinsworth led the project as SERI MATS interns. Neel Nanda supervised the project and Atticus Geiger contributed as a Pr(Ai)²R Group member.

January 2024

February 2024

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

A benchmark for evaluating interpretability methods that localize high level concepts to features inside deep learning models. Jing Huang at Stanford led the effort under the supervision of Atticus Geiger, a Pr(Ai)²R Group member.

March 2024

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

A open-source Python library that supports customizable interventions on a range of different PyTorch modules. Created by Zhengxuan Wu with Atticus Geiger contributing as a Pr(Ai)²R Group member.

April 2024

ReFT: Representation Finetuning for Language Models

An activation steering method that learns task-specific interventions on hidden representations representations of a frozen base model. ReFT outperforms parameter efficient fine-tuning methods such as LoRA on a variety of tasks. Led by Zhengxuan Wu and Aryaman Arora with Atticus Geiger contributing as a Pr(Ai)²R Group member.

June 2024

August 2024

Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

A counterexample to the linear representation hypothesis using recurrent neural networks trained to store and repeat sequences of tokens. Rather than storing the token at each position in a separate linear subspace, the model stores the tokens for each position at a particular order of magnitude. We dub these “onion representations”, because larger magnitude representations of earlier tokens must be peeled away to reveal the smaller magnitude representations of later tokens. Led by Róbert Csordás with Atticus Geiger supervising as a Pr(Ai)²R Group member.

September 2024

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

We use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that separately mediate knowledge of which country a city is in and which continent it is in. We find SAE features do not reach a neuron baseline. Led by Maheep Chaudhary with Atticus Geiger supervising, both contributing as Pr(Ai)²R Group members.