AI systems have advanced remarkably in recent years, but they have also grown more opaque. For example, ChatGPT and DeepSeek seem intelligent on the surface, but what have they really learned? Why and how do they make the decisions that they do? The ability to answer questions like these is critical in applications like healthcare or legal systems. This research seminar course explores the field of interpretable machine learning, which seeks to understand the internal computational processes of machine learning models—and sometimes, to precisely control these processes. We will cover foundational and cutting-edge topics, including distributed representations, attribution methods, and the emerging field of mechanistic interpretability. We will also cover open problems, such as interpretability illusions and challenges in evaluation. Students will read and present research papers, lead and participate in discussions on these topics, and conduct an interpretability research project.
Prerequisites
Highly recommended prerequisites: Not required, but it will be very useful to have taken at least one of Deep Learning, Natural Language Processing, Computer Vision, or Multimodal Machine Learning. Here are some review materials that may be helpful:
Theme: This semester's general theme is mechanistic interpretability for neural networks. We will focus on understanding neural networks by understanding the computations implemented in their components, rather than just their input-output behaviors.
Learning objectives
Students will:
Note: we will almost definitely alter this schedule! Order may also change depending on the availability of guests.
Date | Note | Topic | Readings | Student Presentation |
---|---|---|---|---|
Sep 2, 2025 | Course introduction
|
Recommended: Chen et al. (2024): Designing a Dashboard for Transparency and Control of Conversational AI | No student presentations | |
Sep 4, 2025 |
|
Rumelhart et al. (1986): Learning representations by back-propagating errors | No student presentations | |
Sep 9, 2025 | Visualization
|
Simonyan et al. (2014): Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps | Simonyan et al. (2014) | |
Sep 11, 2025 | Feature attribution
|
Selvaraju et al. (2019): Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization | Selvaraju et al. (2019) | |
Sep 16, 2025 | Sundararajan et al. (2017): Axiomatic Attribution for Deep Networks | Sundararajan et al. (2017) | ||
Sep 18, 2025 | Credit assignment
|
Lundberg & Lee (2017): A Unified Approach to Interpreting Model Predictions
Recommended: Ribeiro et al. (2016): "Why Should I Trust You?": Explaining the Predictions of Any Classifier |
Lundberg & Lee (2017) | |
Sep 23, 2025 | Influence functions | Koh & Liang (2017): Understanding Black-box Predictions via Influence Functions | Koh & Liang (2017) | |
Sep 25, 2025 | Component localization
|
Vig et al. (2020): Investigating Gender Bias in Language Models Using Causal Mediation Analysis
Recommended: Thorpe (1989): Local vs. Distributed Coding |
Vig et al. (2020) | |
Sep 30, 2025 | Understanding attention
|
Vaswani et al. (2017): Attention Is All You Need
- and - Clark et al. (2019): What Does BERT Look at? An Analysis of BERT's Attention |
No student presentations | |
Oct 2, 2025 | Project proposal due | Jain & Wallace (2019): Attention Is Not Explanation
- and - Wiegreffe & Pinter (2019): Attention Is Not Not Explanation |
Jain & Wallace (2019)
- or - Wiegreffe & Pinter (2019) |
|
Oct 7, 2025 | Probing
|
Tenney et al. (2019): What Do You Learn from Context? Probing for Sentence Structure in Contextualized Word Representations | Tenney et al. (2019) | |
Oct 9, 2025 | Hewitt & Liang (2019): Designing and Interpreting Probes with Control Tasks | Hewitt & Liang (2019) | ||
Oct 14, 2025 | No Class - Monday schedule | |||
Oct 16, 2025 | Project proposal revision due | Mechanistic Interpretability - Basics
|
Wang et al. (2023): Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
Recommended: Saphra & Wiegreffe (2024): Mechanistic? |
Wang et al. (2023) |
Oct 21, 2025 | Olsson et al. (2022): In-context Learning and Induction Heads | Olsson et al. (2022) | ||
Oct 23, 2025 | Mechanistic Interpretability, pt. 2
|
Meng et al. (2022): Locating and Editing Factual Associations in GPT | Meng et al. (2022) | |
Oct 28, 2025 | Ravfogel et al. (2020): Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection | Ravfogel et al. (2020) | ||
Oct 30, 2025 | Mechanistic Interpretability, pt. 3
|
Wu*, Geiger* et al. (2023): Interpretability at Scale: Identifying Causal Mechanisms in Alpaca | Wu*, Geiger* et al. (2023) | |
Nov 4, 2025 | Marks et al. (2025): Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models | Marks et al. (2025) | ||
Nov 6, 2025 | Training dynamics
|
Power et al. (2022): Grokking: Generalization Beyond Overfitting | Power et al. (2022) | |
Nov 11, 2025 | Chen et al. (2024): Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs | Chen et al. (2024) | ||
Nov 13, 2025 | Midway report due | Inherently interpretable models
|
Rudin (2019): Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead
- and - Chen et al. (2018): An Interpretable Model with Globally Consistent Explanations for Credit Risk |
No student presentations |
Nov 18, 2025 | Lakkaraju et al. (2016): Interpretable Decision Sets: A Joint Framework for Description and Prediction | Lakkaraju et al. (2016) | ||
Nov 20, 2025 | Applications
|
Lee et al. (2024): A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity | Lee et al. (2024) | |
Nov 25, 2025 | No Class - Thanksgiving 🦃 | |||
Nov 27, 2025 | No Class - Thanksgiving 🦃 | |||
Dec 2, 2025 | Applications, cont.
|
Karvonen & Marks (2025): Robustly Improving LLM Fairness in Realistic Settings via Interpretability | Karvonen & Marks (2025) | |
Dec 4, 2025 | Class-chosen cutting-edge topic | TBD | TBD | |
Dec 9, 2025 | Poster day! | |||
Dec 18, 2025 | Final report due (midnight) |
For our final week of class, let's focus on something that's currently the talk of the field. Possible topics could include:
Throughout the semester, 25 classes will be dedicated to discussing each one of the research papers, 20 of which will be led by students. For classes with student presentations, 4 students (the student panel) will present the paper and lead the discussion, with follow-up questions from the audience (the questioners). These papers and their presentation dates can be found in the course schedule above.
The course is graded out of 100 total points.
This is an open-ended project where you will work in teams of 2–3 students to design and execute an interpretability project. The goal is to demonstrate your understanding of the tools, literature, and challenges in the field. Creativity is encouraged! The project has five main milestones, all due at 11:59pm on their respective deadlines.
Can We Publish Our Final Project? It is feasible to convert a course project into an academic publication, but it can take a lot of work! I encourage those interested to discuss this with me after the semester.
In machine learning, research is fundamentally collaborative at every step of the process. This is why the course regularly involves group work and group discussion. To this end, collaborative reading is also allowed and encouraged. When you collaboratively work on your nightly reactions, you must acknowledge your collaborators by listing them explicitly when filling out your reaction form. Feel free to ask other students your questions and workshop them before class.
I strongly encourage you to use any outside source at your disposal when reading the papers and doing your final project. Your diagrams, slides, questions, implementations, and reports should be original, but you may take inspiration from existing resources as long as you give them proper credit. When doing your project, feel free to base your implementations on publicly available code as well (as long as you make significant modifications to accommodate your original idea), but be sure to give proper credit in your report and your GitHub README if you do so.
I support the use of AI systems as tools, but not as crutches or replacements for fundamental learning. What's the difference? AI as a tool includes:
Failing to properly cite an outside source is equivalent to taking credit for ideas that are not your own, which is plagiarism. This leads us to...
Read through BU's Academic Conduct Code. All students are expected to abide by these guidelines. In the context of this class, it's particularly important that you cite the source of your ideas, facts, and/or methods, and do not claim someone else's work as your own. This goes for the final project and for the nightly reactions.
Attendance and participation form a large part of the grade for this course. I understand that students often cannot attend every class, so there is some flexibility baked into the course grading. You will not need to attend every single class to achieve the highest possible grade for class participation—but you will need to attend most of them!
If you do not complete a nightly reaction by midnight before class, you will receive a 0 for that reaction. If you miss a class where you would be in a non-presenting role, then to get nightly reaction credit, you'll need to complete the question assignment and upload it before the start of class. It's crucial that we're all reading the same papers at the same time, so there's no graceful way to accept late work for the readings. If you miss a class where you are in a presenting role, you must find another student to trade presentation slots with. To do this, you must send an email to me (including the student you're trading with in the email chain) that explains who's trading and to what days; the other student must confirm the trade at least 2 days before the presentation. If you are joining class from the waitlist, please let me know ASAP and we'll help you fill out your presentation slots.
For the project proposal, proposal revision, and midway report, each late day will cause a loss of 1 point, with no points 5 days after the deadline. Late submissions will not be accepted for the final project report. The final poster presentation cannot be easily made up.
Let's all follow the NeurIPS code of conduct and the Recurse Center Social Rules. As in many research environments, people are coming from many different backgrounds and levels of experience with the material. Therefore, it's especially important for our learning that we maintain respect for everyone's perspective and input. I value the perspectives of individuals from all backgrounds. I broadly define diversity to include race, gender identity, national origin, ethnicity, religion, social class, age, sexual orientation, political background, and physical or learning ability. I will strive to make this classroom an inclusive space for all students; please let me know if there's anything I can do to improve. On that note...
Boston University's policy is to provide reasonable accommodations to students with qualifying disabilities who are enrolled in Boston University courses. Students seeking accommodations must engage in an interactive process with, and provide appropriate documentation of their disability to, Disability & Access Services (DAS). If this applies, please get in touch with me as soon as possible to discuss accommodations; note that students are not required to disclose information regarding their disability, if applicable, but should request approval for such accommodations through DAS beforehand.
Students are permitted to be absent from class, including classes involving examinations, labs, excursions, and other special events, for purposes of religious observance. In-class, take-home and lab assignments, and other work shall be made up in consultation with the student's instructors. More details on BU's religious observance policy are available here.