Mechanical Interpretability for Large Neural Networks

Motivation

Despite the growing deployment of increasingly large and capable machine learning models in real-world applications, our understanding of how these models function internally remains alarmingly limited. This lack of insight presents challenges in high-stakes scenarios, where trust in the models’ reliability is critical, and makes it difficult to predict or address instances of undesirable behavior.

Mechanistic interpretability offers a promising solution to this problem. By reverse engineering the algorithms that neural networks use, this approach seeks to uncover human-understandable mechanisms behind their capabilities. Through analyzing the weights and activations of neural networks, mechanistic interpretability aims to identify the circuits responsible for specific behaviors, providing a clearer view of how these models operate at a deeper level [Cammarata et al., 2020, Elhage et al., 2021].

Scientific Questions

We are exploring the reasoning processes of models in two of the most promising areas: computer vision and language models.

In the computer vision domain, our focus is on examining the cross-attention mechanism to better understand how text-to-image generation operates. Our goal is to uncover the underlying processes of this mechanism, particularly as it addresses the issue of hallucination in large vision-language models (VLMs).

In the language model domain, we aim to understand how large language models (LLMs) function by using sparse autoencoders. This approach allows us to investigate the underlying mechanisms that govern LLM behaviour and reasoning.

Theses

New theses are regularly advertised in the area of Mechnical Interpretability for Large Neural Networks. In line with the project description given above, there are several sub problems in this project, which could be interesting topics for Bachelor’s and Master’s theses and could be discussed in a personal meeting.

Partners

Qihui Feng, M.Sc., Knowledge-based Systems Group, Computer Science 5 – Information Systems and Databases, RWTH Aachen-University
Yongli Mou M.Sc., Computer Science 5 Information Systems and Databases, RWTH Aachen-University

Contact

M.Sc.
Er Jin
+49 241 80 27866
Er.Jin@lfb.rwth-aachen.de

Mechanical Interpretability for Large Neural Networks

Motivation

Scientific Questions

Theses

Partners

Contact

Service

Academics

Institute

Address