AI systems have advanced remarkably in recent years, but they have also grown more opaque. For example, ChatGPT and DeepSeek seem intelligent on the surface, but what have they really learned? Why and how do they make the decisions that they do? The ability to answer questions like these is critical in applications like healthcare or legal systems. This research seminar course explores the field of interpretable machine learning, which seeks to understand the internal computational processes of machine learning models—and sometimes, to precisely control these processes. We will cover foundational and cutting-edge topics, including distributed representations, attribution methods, and the emerging field of mechanistic interpretability. We will also cover open problems, such as interpretability illusions and challenges in evaluation. Students will read and present research papers, lead and participate in discussions on these topics, and conduct an interpretability research project.

Prerequisites

Highly recommended prerequisites: Not required, but it will be very useful to have taken at least one of Deep Learning, Natural Language Processing, Computer Vision, or Multimodal Machine Learning. Here are some review materials that may be helpful:

Theme: This semester's general theme is mechanistic interpretability for neural networks. We will focus on understanding neural networks by understanding the computations implemented in their components, rather than just their input-output behaviors.

Learning objectives
Students will:

  1. Gain exposure to foundational and cutting-edge research in interpretable machine learning.
  2. Improve their written, visual, and oral scientific communication skills.
  3. Improve their ability to read and constructively comment on technical papers.
  4. Gain hands-on experience in developing and applying interpretability methods.


Logistics



News

- Sep. 4: The presentation schedule has been released! Please see the Piazza for a link.
- Sep. 4: We now have a Piazza! You all should have received an email with a link to sign up. If you didn't (or if you're auditing and I don't have your email), please let me know and I'll add you.



Seminar Schedule

Note: we will almost definitely alter this schedule! Order may also change depending on the availability of guests.

Date Note Topic Readings Student Presentation
Sep 2, 2025 Course introduction
  • Introduction to interpretability
  • Course logistics
  • Course topics
Recommended: Chen et al. (2024): Designing a Dashboard for Transparency and Control of Conversational AI No student presentations
Sep 4, 2025
  • The intellectual history of interpretability
  • Example paper presentation
Rumelhart et al. (1986): Learning representations by back-propagating errors No student presentations
Sep 9, 2025 Visualization
  • Saliency maps
Simonyan et al. (2014): Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Simonyan et al. (2014)
Sep 11, 2025 Feature attribution
  • Grad-CAM
  • Integrated gradients
Selvaraju et al. (2019): Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Selvaraju et al. (2019)
Sep 16, 2025 Sundararajan et al. (2017): Axiomatic Attribution for Deep Networks Sundararajan et al. (2017)
Sep 18, 2025 Credit assignment
  • Explanation models
  • Local vs. global explanations
Lundberg & Lee (2017): A Unified Approach to Interpreting Model Predictions

Recommended: Ribeiro et al. (2016): "Why Should I Trust You?": Explaining the Predictions of Any Classifier
Lundberg & Lee (2017)
Sep 23, 2025 Influence functions Koh & Liang (2017): Understanding Black-box Predictions via Influence Functions Koh & Liang (2017)
Sep 25, 2025 Component localization
  • Causal mediation analysis
  • Distributed representations
Vig et al. (2020): Investigating Gender Bias in Language Models Using Causal Mediation Analysis

Recommended: Thorpe (1989): Local vs. Distributed Coding
Vig et al. (2020)
Sep 30, 2025 Understanding attention
  • A quick intro to the attention mechanism
  • Attention as explanation: pros and cons
Vaswani et al. (2017): Attention Is All You Need
- and -
Clark et al. (2019): What Does BERT Look at? An Analysis of BERT's Attention
No student presentations
Oct 2, 2025 Project proposal due Jain & Wallace (2019): Attention Is Not Explanation
- and -
Wiegreffe & Pinter (2019): Attention Is Not Not Explanation
Jain & Wallace (2019)
- or -
Wiegreffe & Pinter (2019)
Oct 7, 2025 Probing
  • Auxiliary tasks
  • Control tasks
  • Selectivity and expressivity
Tenney et al. (2019): What Do You Learn from Context? Probing for Sentence Structure in Contextualized Word Representations Tenney et al. (2019)
Oct 9, 2025 Hewitt & Liang (2019): Designing and Interpreting Probes with Control Tasks Hewitt & Liang (2019)
Oct 14, 2025 No Class - Monday schedule
Oct 16, 2025 Project proposal revision due Mechanistic Interpretability - Basics
  • Circuit discovery
  • Path patching
  • Induction heads
Wang et al. (2023): Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

Recommended: Saphra & Wiegreffe (2024): Mechanistic?
Wang et al. (2023)
Oct 21, 2025 Olsson et al. (2022): In-context Learning and Induction Heads Olsson et al. (2022)
Oct 23, 2025 Mechanistic Interpretability, pt. 2
  • Targeted model editing
  • Unlearning
Meng et al. (2022): Locating and Editing Factual Associations in GPT Meng et al. (2022)
Oct 28, 2025 Ravfogel et al. (2020): Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection Ravfogel et al. (2020)
Oct 30, 2025 Mechanistic Interpretability, pt. 3
  • Featurization
  • Causal abstraction
  • Steering
Wu*, Geiger* et al. (2023): Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Wu*, Geiger* et al. (2023)
Nov 4, 2025 Marks et al. (2025): Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Marks et al. (2025)
Nov 6, 2025 Training dynamics
  • Grokking
  • Emergence and phase transitions
Power et al. (2022): Grokking: Generalization Beyond Overfitting Power et al. (2022)
Nov 11, 2025 Chen et al. (2024): Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs Chen et al. (2024)
Nov 13, 2025 Midway report due Inherently interpretable models
  • Additive models
  • Decision trees
  • Decision sets
Rudin (2019): Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead
- and -
Chen et al. (2018): An Interpretable Model with Globally Consistent Explanations for Credit Risk
No student presentations
Nov 18, 2025 Lakkaraju et al. (2016): Interpretable Decision Sets: A Joint Framework for Description and Prediction Lakkaraju et al. (2016)
Nov 20, 2025 Applications
  • Safety
Lee et al. (2024): A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Lee et al. (2024)
Nov 25, 2025 No Class - Thanksgiving 🦃
Nov 27, 2025 No Class - Thanksgiving 🦃
Dec 2, 2025 Applications, cont.
  • Bias and fairness
Karvonen & Marks (2025): Robustly Improving LLM Fairness in Realistic Settings via Interpretability Karvonen & Marks (2025)
Dec 4, 2025 Class-chosen cutting-edge topic TBD TBD
Dec 9, 2025 Poster day!
Dec 18, 2025 Final report due (midnight)

For our final week of class, let's focus on something that's currently the talk of the field. Possible topics could include:


Class format and preparation

Throughout the semester, 25 classes will be dedicated to discussing each one of the research papers, 20 of which will be led by students. For classes with student presentations, 4 students (the student panel) will present the paper and lead the discussion, with follow-up questions from the audience (the questioners). These papers and their presentation dates can be found in the course schedule above.

Grading

The course is graded out of 100 total points.

Presentations and discussion: 50 points


Research Project: 50 points

This is an open-ended project where you will work in teams of 2–3 students to design and execute an interpretability project. The goal is to demonstrate your understanding of the tools, literature, and challenges in the field. Creativity is encouraged! The project has five main milestones, all due at 11:59pm on their respective deadlines.

  1. Project proposal (7 points): A 2-page description of what you intend to do, including datasets, methods, and experiments. Due Oct. 2.
  2. Project proposal revision (3 points): Prof. Mueller will provide feedback to help teams find concrete research ideas. After receiving feedback, you will revise and resubmit your plan. Due Oct. 16.
  3. Midway progress report (10 points): By this point, you should have run a few experiments, and have a fleshed-out plan for the rest of the project based on what did or didn't work. This should be a 4-to-5-page report in NeurIPS format elaborating on what you've done so far, and remaining work you plan to do. Describe the progress you've made, experiments you've run, results you've obtained, and how you plan to handle the rest of the project. While this is called “midway”, ideally you should be somewhat beyond halfway by this point! (Pivoting to a new direction from here is ok if things aren't working the way you initially hoped.) Due Nov. 13.
  4. Poster presentation (10 points): All students will present their findings at a poster session on the last day of class, Dec. 9.
  5. Final report (20 points): Students should write code and carry out additional experiments. The paper should be written in the standard NeurIPS conference paper format (between 5 and 8 pages—longer is not necessarily better!). Use this NeurIPS template. Students in groups are required to include a “Contributions” section at the end concretely listing each author's contributions. References and the Contributions section do not count toward the page limit. The final report should concisely summarize your findings and answer the following questions:
    1. What problem are you addressing?
    2. What approach did you take to address the problem, and why?
    3. How did you evaluate the performance of the approach(es) you investigated?
    4. What worked or didn't work? Do you have any guesses as to why?
    Due Dec. 18.
Grading of the final project will be based on the following:

Can We Publish Our Final Project? It is feasible to convert a course project into an academic publication, but it can take a lot of work! I encourage those interested to discuss this with me after the semester.


Policies and Conduct

Collaboration Policy

In machine learning, research is fundamentally collaborative at every step of the process. This is why the course regularly involves group work and group discussion. To this end, collaborative reading is also allowed and encouraged. When you collaboratively work on your nightly reactions, you must acknowledge your collaborators by listing them explicitly when filling out your reaction form. Feel free to ask other students your questions and workshop them before class.


Outside Resources & AI Policy

I strongly encourage you to use any outside source at your disposal when reading the papers and doing your final project. Your diagrams, slides, questions, implementations, and reports should be original, but you may take inspiration from existing resources as long as you give them proper credit. When doing your project, feel free to base your implementations on publicly available code as well (as long as you make significant modifications to accommodate your original idea), but be sure to give proper credit in your report and your GitHub README if you do so.

I support the use of AI systems as tools, but not as crutches or replacements for fundamental learning. What's the difference? AI as a tool includes:

AI as a crutch/replacement includes: Employing AI to substantially write reactions or substantially complete the final project will be considered an academic integrity violation. The line between tool and replacement can be blurry, so if you're unsure, I recommend asking! The waitlist for the course is also quite long and full of students whose work is directly related to the course, so if you had planned to use AI to do most of the assignments for you, please consider dropping to make room for the folks who are enthusiastic to engage deeply with the content!

Failing to properly cite an outside source is equivalent to taking credit for ideas that are not your own, which is plagiarism. This leads us to...


Academic Integrity

Read through BU's Academic Conduct Code. All students are expected to abide by these guidelines. In the context of this class, it's particularly important that you cite the source of your ideas, facts, and/or methods, and do not claim someone else's work as your own. This goes for the final project and for the nightly reactions.


Absence and Late Work Policy

Attendance and participation form a large part of the grade for this course. I understand that students often cannot attend every class, so there is some flexibility baked into the course grading. You will not need to attend every single class to achieve the highest possible grade for class participation—but you will need to attend most of them!

If you do not complete a nightly reaction by midnight before class, you will receive a 0 for that reaction. If you miss a class where you would be in a non-presenting role, then to get nightly reaction credit, you'll need to complete the question assignment and upload it before the start of class. It's crucial that we're all reading the same papers at the same time, so there's no graceful way to accept late work for the readings. If you miss a class where you are in a presenting role, you must find another student to trade presentation slots with. To do this, you must send an email to me (including the student you're trading with in the email chain) that explains who's trading and to what days; the other student must confirm the trade at least 2 days before the presentation. If you are joining class from the waitlist, please let me know ASAP and we'll help you fill out your presentation slots.

For the project proposal, proposal revision, and midway report, each late day will cause a loss of 1 point, with no points 5 days after the deadline. Late submissions will not be accepted for the final project report. The final poster presentation cannot be easily made up.


In-class Conduct

Let's all follow the NeurIPS code of conduct and the Recurse Center Social Rules. As in many research environments, people are coming from many different backgrounds and levels of experience with the material. Therefore, it's especially important for our learning that we maintain respect for everyone's perspective and input. I value the perspectives of individuals from all backgrounds. I broadly define diversity to include race, gender identity, national origin, ethnicity, religion, social class, age, sexual orientation, political background, and physical or learning ability. I will strive to make this classroom an inclusive space for all students; please let me know if there's anything I can do to improve. On that note...


Accommodations

Boston University's policy is to provide reasonable accommodations to students with qualifying disabilities who are enrolled in Boston University courses. Students seeking accommodations must engage in an interactive process with, and provide appropriate documentation of their disability to, Disability & Access Services (DAS). If this applies, please get in touch with me as soon as possible to discuss accommodations; note that students are not required to disclose information regarding their disability, if applicable, but should request approval for such accommodations through DAS beforehand.


Religious Observance

Students are permitted to be absent from class, including classes involving examinations, labs, excursions, and other special events, for purposes of religious observance. In-class, take-home and lab assignments, and other work shall be made up in consultation with the student's instructors. More details on BU's religious observance policy are available here.