-
Pyvene
Pyvene is an open source interpretability tool made by Zhengxuan Wu.
We aim to utilize and contribute to this fantastic resource!
-
Published Research
Our primary contribution to the world is academic papers.
Check out our research timeline below!
-
Causal Abstraction
Mechanistic interpretability research analyzes internals of an AI.
We want to ground the field in a formal theory of causal abstraction.
Research Timeline
July 2024
July 2023
Causal Abstraction:
A Theoretical Foundation for Mechanistic Interpretability
A comprehensive treatise arguing that causal abstraction provides the theoretical foundations for mechanistic interpretability. The core contributions are generalizing the theory of causal abstraction to arbitrary mechanism transformations and providing precise, yet flexible formalizations of core interpretability methods and concept. Led by Atticus Geiger, with Amir Zur, Maheep Chaudhary, and Sonakshi Chauhan contributing as a Pr(Ai)²R Group members.
Updating CLIP to Prefer Descriptions Over Captions
A method for updating the text and image model CLIP to be sensitive to the distinction between a caption meant to complement an image and a description meant to replace the image for blind and low vision users. We use a training objective based in distributed interchange interventions to instill the CLIP model with this concept. Led by Amir Zur under the supervision of Atticus Geiger, both contributing as a Pr(Ai)²R Group members.
Pr(Ai)²R Group is founded by Atticus Geiger.
September 2023
Rigorously Assessing Natural Language Explanations of Neurons
An audit of Open AI’s Language models can explain neurons in language models in which GPT-4 provides text explanations of neurons in GPT-2. Jing Huang at Stanford led the effort to evaluate the explanation texts provided by GPT-4 under the supervision of Chris Potts. Atticus Geiger contributed as a Pr(Ai)²R Group member. The project culminated to a best paper award at BlackBoxNLP 2023.
October 2023
Linear Representations of Sentiment in Large Language Models
An analysis of how positive and negative sentiment is represented in language models. Curt Tigges and Oskar Hollinsworth led the project as SERI MATS interns. Neel Nanda supervised the project and Atticus Geiger contributed as a Pr(Ai)²R Group member.
January 2024
A Reply to Makelov et al. (2023)’s “Interpretability Illusion” Arguments
A critical response that challenges the notion of illusion presented Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching. Zhengxuan Wu led the project. Atticus Geiger contributed as a Pr(Ai)²R Group member.
February 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
A benchmark for evaluating interpretability methods that localize high level concepts to features inside deep learning models. Jing Huang at Stanford led the effort under the supervision of Atticus Geiger, a Pr(Ai)²R Group member.
March 2024
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
A open-source Python library that supports customizable interventions on a range of different PyTorch modules. Created by Zhengxuan Wu with Atticus Geiger contributing as a Pr(Ai)²R Group member.
April 2024
ReFT: Representation Finetuning for Language Models
An activation steering method that learns task-specific interventions on hidden representations representations of a frozen base model. ReFT outperforms parameter efficient fine-tuning methods such as LoRA on a variety of tasks. Led by Zhengxuan Wu and Aryaman Arora with Atticus Geiger contributing as a Pr(Ai)²R Group member.
June 2024
August 2024
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
A counterexample to the linear representation hypothesis using recurrent neural networks trained to store and repeat sequences of tokens. Rather than storing the token at each position in a separate linear subspace, the model stores the tokens for each position at a particular order of magnitude. We dub these “onion representations”, because larger magnitude representations of earlier tokens must be peeled away to reveal the smaller magnitude representations of later tokens. Led by Róbert Csordás with Atticus Geiger supervising as a Pr(Ai)²R Group member.
September 2024
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
We use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that separately mediate knowledge of which country a city is in and which continent it is in. We find SAE features do not reach a neuron baseline. Led by Maheep Chaudhary with Atticus Geiger supervising, both contributing as Pr(Ai)²R Group members.