CS599 M1: Interpretable Machine Learning

Seminar Schedule

Note: we will almost definitely alter this schedule! Order may also change depending on the availability of guests.

Date	Note	Topic	Readings	Student Presentation
Sep 2, 2025		Course introduction Introduction to interpretability Course logistics Course topics	Recommended: Chen et al. (2024): Designing a Dashboard for Transparency and Control of Conversational AI	No student presentations
Sep 4, 2025		The intellectual history of interpretability Example paper presentation	Rumelhart et al. (1986): Learning representations by back-propagating errors	No student presentations
Sep 9, 2025		Visualization Saliency maps	Simonyan et al. (2014): Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps	Simonyan et al. (2014)
Sep 11, 2025		Feature attribution Grad-CAM Integrated gradients	Selvaraju et al. (2019): Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization	Selvaraju et al. (2019)
Sep 16, 2025		Feature attribution Grad-CAM Integrated gradients	Sundararajan et al. (2017): Axiomatic Attribution for Deep Networks	Sundararajan et al. (2017)
Sep 18, 2025		Credit assignment Explanation models Local vs. global explanations	Lundberg & Lee (2017): A Unified Approach to Interpreting Model Predictions Recommended: Ribeiro et al. (2016): "Why Should I Trust You?": Explaining the Predictions of Any Classifier	Lundberg & Lee (2017)
Sep 23, 2025		Influence functions	Koh & Liang (2017): Understanding Black-box Predictions via Influence Functions	Koh & Liang (2017)
Sep 25, 2025		Component localization Causal mediation analysis Distributed representations	Vig et al. (2020): Investigating Gender Bias in Language Models Using Causal Mediation Analysis Recommended: Thorpe (1989): Local vs. Distributed Coding	Vig et al. (2020)
Sep 30, 2025		Understanding attention A quick intro to the attention mechanism Attention as explanation: pros and cons	Vaswani et al. (2017): Attention Is All You Need - and - Clark et al. (2019): What Does BERT Look at? An Analysis of BERT's Attention	No student presentations
Oct 2, 2025	Project proposal due		Jain & Wallace (2019): Attention Is Not Explanation - and - Wiegreffe & Pinter (2019): Attention Is Not Not Explanation	Jain & Wallace (2019) - or - Wiegreffe & Pinter (2019)
Oct 7, 2025		Probing Auxiliary tasks Control tasks Selectivity and expressivity	Tenney et al. (2019): What Do You Learn from Context? Probing for Sentence Structure in Contextualized Word Representations	Tenney et al. (2019)
Oct 9, 2025			Hewitt & Liang (2019): Designing and Interpreting Probes with Control Tasks	Hewitt & Liang (2019)
Oct 14, 2025	No Class - Monday schedule
Oct 16, 2025	Project proposal revision due	Mechanistic Interpretability - Basics Circuit discovery Path patching Induction heads	Wang et al. (2023): Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small Recommended: Saphra & Wiegreffe (2024): Mechanistic?	Wang et al. (2023)
Oct 21, 2025			Olsson et al. (2022): In-context Learning and Induction Heads	Olsson et al. (2022)
Oct 23, 2025		Mechanistic Interpretability, pt. 2 Targeted model editing Unlearning	Meng et al. (2022): Locating and Editing Factual Associations in GPT	Meng et al. (2022)
Oct 28, 2025			Ravfogel et al. (2020): Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection	Ravfogel et al. (2020)
Oct 30, 2025		Mechanistic Interpretability, pt. 3 Featurization Causal abstraction Steering	Wu, Geiger et al. (2023): Interpretability at Scale: Identifying Causal Mechanisms in Alpaca	Wu, Geiger et al. (2023)
Nov 4, 2025			Marks et al. (2025): Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models	Marks et al. (2025)
Nov 6, 2025		Training dynamics Grokking Emergence and phase transitions	Power et al. (2022): Grokking: Generalization Beyond Overfitting	Power et al. (2022)
Nov 11, 2025			Chen et al. (2024): Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs	Chen et al. (2024)
Nov 13, 2025	Midway report due	Inherently interpretable models Additive models Decision trees Decision sets	Rudin (2019): Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead - and - Chen et al. (2018): An Interpretable Model with Globally Consistent Explanations for Credit Risk	No student presentations
Nov 18, 2025			Lakkaraju et al. (2016): Interpretable Decision Sets: A Joint Framework for Description and Prediction	Lakkaraju et al. (2016)
Nov 20, 2025		Applications Safety	Lee et al. (2024): A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity	Lee et al. (2024)
Nov 25, 2025	No Class - Thanksgiving 🦃
Nov 27, 2025	No Class - Thanksgiving 🦃
Dec 2, 2025		Applications, cont. Bias and fairness	Karvonen & Marks (2025): Robustly Improving LLM Fairness in Realistic Settings via Interpretability	Karvonen & Marks (2025)
Dec 4, 2025		Class-chosen cutting-edge topic	TBD	TBD
Dec 9, 2025	Poster day!
Dec 18, 2025	Final report due (midnight)

For our final week of class, let's focus on something that's currently the talk of the field. Possible topics could include:

Multimodal (e.g., language and vision) interpretability
The geometry of language model activations:

In-context learning of representations
The geometry of truthfulness decisions
The geometry of arithmetic operations

Interpreting how language models perform arithmetic
Function vectors
Correlating brain activations with neural network activations

Class format and preparation

Throughout the semester, 25 classes will be dedicated to discussing each one of the research papers, 20 of which will be led by students. For classes with student presentations, 4 students (the student panel) will present the paper and lead the discussion, with follow-up questions from the audience (the questioners). These papers and their presentation dates can be found in the course schedule above.

Slide Presentation: 10–20 minutes. The student panel and/or Prof. Mueller guides the class through the slides. For student-led discussions, each student will briefly explain their slide and findings.
Open Discussion: After the presentation, the paper discussion will last 45-60 minutes.
Roles and sign-up: Students are required to fill the role-playing sign-up form before September 4. We will aim to honor your preferences, but please understand that adjustments may be necessary for popular papers! Enrolled students will receive priority, as this is a big part of the grade. Waitlisted students, please get in touch ASAP if you're able to enroll!
Slide and question submission: All participants of the panel and questioners should prepare and submit the required materials by midnight the night before the scheduled discussion. This includes:
- If you're presenting, prepare a Google Slides presentation with the other presenting students and upload to the class's Google Drive. If you're a reviewer, also upload your review to the Review folder of the Google Drive. (More detailed instructions for each role are in an example presentation in the Google Drive.)
- If you're not presenting, 3–4 bullet points about the paper for the broader class discussion (via the Google Form).
- Acknowledgment of the role you're playing for a paper. (If you have a role for a paper, in the same nightly reaction Google Form, you should select your role from the options and not fill out the reactions box.)

Grading

The course is graded out of 100 total points.

Presentations and discussion: 50 points

Nightly reactions (20 points): If you are not presenting, read the assigned paper and submit 3-4 reaction bullet points (sentences/questions) about it before midnight the night before class. Reactions to a particular paper grant you 1 point, amounting to a total of 20 points. The purpose of these is to help us find common points of interest and confusion, and facilitate in-class discussion. These should not be summaries of points in the paper, and should not be generic 1-word-answer questions or generic statements (for example, don't use “What was the learning rate?” Or “Didn't understand the intro.”). These will be the inspiration for in-class discussion, so they should be probing, analytical, and/or thought-provoking (for example, “Why did they use a linear probe, instead of something more powerful?”, or “This method looks just like saliency maps, except they compute it using multiple samples instead of just one.”).
In-class interactions (10 points): Students can earn up to 10 points from attendance and active participation. You can miss up to 3 classes without an excuse and still receive full credit. Each subsequent unexcused absence will incur a 3-point penalty. Active participation can earn you up to half a point per class. If you're in a presenting role, these points come from leading the class discussion.
Paper presentation and review (20 points): Four times during the semester, you'll be assigned to one of the roles from the list below, to review and present an aspect of a paper. This will allow you to read the paper from multiple perspectives. This collaborative effort, done with several other students, requires the creation of a slide deck (more detail and an example in the Google Drive). Successfully leading these discussions earns you up to 5 points per presentation, amounting to a total of 20 points.
- Diagrammer: Create 1-2 slides visually explaining the method and main idea(s) of the paper. Grading is based on whether the figure(s) is/are clear (2.5) and summarize(s) the most important idea(s) from the paper (2.5).
- Reviewer: Fill out the NeurIPS review form for the paper (Ctrl+F for “Review Form”). See this example review. Include a summary of your review on the last slide of your group's slides. Grading is based on whether you have completed a high-quality review of the paper before class (4), and summarize it convincingly on the slide (1).
- Archaeologist: Offer 1-3 slides of historical context relevant to the paper. What was the state of the field when the paper was written, and why was there a need for the paper? What was the paper's impact on later work? Grading is based on whether you have included prior relevant context (2), impact on the field (1), at least 2 prior works (1), and at least 1 work directly inspired by the paper (1).
- Academic Researcher: In 1-2 slides, propose a future academic research project based on the paper. Grading is based on whether the proposed idea is well-motivated (1) and relevant to the paper (2), and features an overview of the proposed methods (1) and intended impact (1).

Research Project: 50 points

This is an open-ended project where you will work in teams of 2–3 students to design and execute an interpretability project. The goal is to demonstrate your understanding of the tools, literature, and challenges in the field. Creativity is encouraged! The project has five main milestones, all due at 11:59pm on their respective deadlines.

Project proposal (7 points): A 2-page description of what you intend to do, including datasets, methods, and experiments. Due Oct. 2.
Project proposal revision (3 points): Prof. Mueller will provide feedback to help teams find concrete research ideas. After receiving feedback, you will revise and resubmit your plan. Due Oct. 16.
Midway progress report (10 points): By this point, you should have run a few experiments, and have a fleshed-out plan for the rest of the project based on what did or didn't work. This should be a 4-to-5-page report in NeurIPS format elaborating on what you've done so far, and remaining work you plan to do. Describe the progress you've made, experiments you've run, results you've obtained, and how you plan to handle the rest of the project. While this is called “midway”, ideally you should be somewhat beyond halfway by this point! (Pivoting to a new direction from here is ok if things aren't working the way you initially hoped.) Due Nov. 13.
Poster presentation (10 points): All students will present their findings at a poster session on the last day of class, Dec. 9.
Final report (20 points): Students should write code and carry out additional experiments. The paper should be written in the standard NeurIPS conference paper format (between 5 and 8 pages—longer is not necessarily better!). Use this NeurIPS template. Students in groups are required to include a “Contributions” section at the end concretely listing each author's contributions. References and the Contributions section do not count toward the page limit. The final report should concisely summarize your findings and answer the following questions:
1. What problem are you addressing?
2. What approach did you take to address the problem, and why?
3. How did you evaluate the performance of the approach(es) you investigated?
4. What worked or didn't work? Do you have any guesses as to why?
Due Dec. 18.

Grading of the final project will be based on the following:

Proposal: Well-motivated idea with a concrete experimental plan. Creativity is encouraged more than bulletproof-but-incremental ideas!
Proposal revision: The suggested feedback is integrated.
Midway report:
- Clear problem statement and introduction
- The most essential related work is present
- Reasonable set of initial experiments
- Clear presentation of methods and evaluation protocol
- Clear presentation and description of preliminary results
- Well-reasoned discussion about your initial experiments. If successful, what do the experiments tell us so far? If they didn't turn out how you expected, why do you think this was?
- Concrete plan for the rest of the project. It's ok to pivot from your original plan if needed!
Final report: Much like the midway report, but in addition:
- The related work should be reasonably complete and well-integrated throughout the report
- More complete methods section
- More complete results section
- Rigorous evaluation
- Discussion and conclusion composed of well-formulated arguments, grounded in your experimental findings and the broader literature
- Novelty and creativity

Can We Publish Our Final Project? It is feasible to convert a course project into an academic publication, but it can take a lot of work! I encourage those interested to discuss this with me after the semester.

Policies and Conduct

Collaboration Policy

In machine learning, research is fundamentally collaborative at every step of the process. This is why the course regularly involves group work and group discussion. To this end, collaborative reading is also allowed and encouraged. When you collaboratively work on your nightly reactions, you must acknowledge your collaborators by listing them explicitly when filling out your reaction form. Feel free to ask other students your questions and workshop them before class.

Outside Resources & AI Policy

I strongly encourage you to use any outside source at your disposal when reading the papers and doing your final project. Your diagrams, slides, questions, implementations, and reports should be original, but you may take inspiration from existing resources as long as you give them proper credit. When doing your project, feel free to base your implementations on publicly available code as well (as long as you make significant modifications to accommodate your original idea), but be sure to give proper credit in your report and your GitHub README if you do so.

I support the use of AI systems as tools, but not as crutches or replacements for fundamental learning. What's the difference? AI as a tool includes:

Help with outlining a report or revising language
Help with looking up resources to help you understand a tough concept
Asking an LLM for an explanation of a tough concept (be sure to verify it!)
Help planning a project implementation
Workshopping an existing project idea
Asking a vision model to help you mock up a figure for some findings you don't know how to present

AI as a crutch/replacement includes:

Having an LLM write your entire reactions
Having an LLM write your final project codebase
Having an LLM generate reports or figures for you

Employing AI to substantially write reactions or substantially complete the final project will be considered an academic integrity violation. The line between tool and replacement can be blurry, so if you're unsure, I recommend asking! The waitlist for the course is also quite long and full of students whose work is directly related to the course, so if you had planned to use AI to do most of the assignments for you, please consider dropping to make room for the folks who are enthusiastic to engage deeply with the content!

Failing to properly cite an outside source is equivalent to taking credit for ideas that are not your own, which is plagiarism. This leads us to...

Academic Integrity

Read through BU's Academic Conduct Code. All students are expected to abide by these guidelines. In the context of this class, it's particularly important that you cite the source of your ideas, facts, and/or methods, and do not claim someone else's work as your own. This goes for the final project and for the nightly reactions.

Absence and Late Work Policy

Attendance and participation form a large part of the grade for this course. I understand that students often cannot attend every class, so there is some flexibility baked into the course grading. You will not need to attend every single class to achieve the highest possible grade for class participation—but you will need to attend most of them!

If you do not complete a nightly reaction by midnight before class, you will receive a 0 for that reaction. If you miss a class where you would be in a non-presenting role, then to get nightly reaction credit, you'll need to complete the question assignment and upload it before the start of class. It's crucial that we're all reading the same papers at the same time, so there's no graceful way to accept late work for the readings. If you miss a class where you are in a presenting role, you must find another student to trade presentation slots with. To do this, you must send an email to me (including the student you're trading with in the email chain) that explains who's trading and to what days; the other student must confirm the trade at least 2 days before the presentation. If you are joining class from the waitlist, please let me know ASAP and we'll help you fill out your presentation slots.

For the project proposal, proposal revision, and midway report, each late day will cause a loss of 1 point, with no points 5 days after the deadline. Late submissions will not be accepted for the final project report. The final poster presentation cannot be easily made up.

In-class Conduct

Let's all follow the NeurIPS code of conduct and the Recurse Center Social Rules. As in many research environments, people are coming from many different backgrounds and levels of experience with the material. Therefore, it's especially important for our learning that we maintain respect for everyone's perspective and input. I value the perspectives of individuals from all backgrounds. I broadly define diversity to include race, gender identity, national origin, ethnicity, religion, social class, age, sexual orientation, political background, and physical or learning ability. I will strive to make this classroom an inclusive space for all students; please let me know if there's anything I can do to improve. On that note...

Accommodations

Boston University's policy is to provide reasonable accommodations to students with qualifying disabilities who are enrolled in Boston University courses. Students seeking accommodations must engage in an interactive process with, and provide appropriate documentation of their disability to, Disability & Access Services (DAS). If this applies, please get in touch with me as soon as possible to discuss accommodations; note that students are not required to disclose information regarding their disability, if applicable, but should request approval for such accommodations through DAS beforehand.

Religious Observance

Students are permitted to be absent from class, including classes involving examinations, labs, excursions, and other special events, for purposes of religious observance. In-class, take-home and lab assignments, and other work shall be made up in consultation with the student's instructors. More details on BU's religious observance policy are available here.