Toward Cognitive Models in Explainable AI: Mechanistic Interpretability as a Missing Link?


Date
Sep 20, 2023 — Sep 23, 2023
Location
Belgrade, Serbia

Because Explainable AI faces challenges similar to the ones facing cognitive science, the explanatory strategies of the latter may be useful guides for the former. Top-down and bottom-up strategies are used in cognitive science to create cognitive models, which describe cognitive processes as computational algorithms. However, such models remain elusive in XAI. Mechanistic interpretability is an approach that identifies interpretable structure within a network to explain its global behavior. By studying a network’s internal parameters, this method determines the algorithm it has learned. This approach has made first steps toward explaining machine vision and natural language processing. Although preliminary, this work resembles cognitive modeling efforts in cognitive science by combining top-down behavioral observations with bottom-up investigations of the underlying mechanisms. Therefore, mechanistic interpretability deserves closer philosophical scrutiny.

Collaborative work with Carlos Zednik.

Céline Budding
Céline Budding
PhD candidate in Philosophy of AI

I am a PhD candidate at Eindhoven University of Technology, where I am investigating what large language models know about language.