Because Explainable AI faces challenges similar to the ones facing cognitive science, the explanatory strategies of the latter may be useful guides for the former. Top-down and bottom-up strategies are used in cognitive science to create cognitive models, which describe cognitive processes as computational algorithms. However, such models remain elusive in XAI. Mechanistic interpretability is an approach that identifies interpretable structure within a network to explain its global behavior. By studying a network’s internal parameters, this method determines the algorithm it has learned. This approach has made first steps toward explaining machine vision and natural language processing. Although preliminary, this work resembles cognitive modeling efforts in cognitive science by combining top-down behavioral observations with bottom-up investigations of the underlying mechanisms. Therefore, mechanistic interpretability deserves closer philosophical scrutiny.
Collaborative work with Carlos Zednik.