What the DAAM: Interpreting Stable Diffusion Using Cross Attention
Tang, Raphael,
Liu, Linqing,
Pandey, Akshat,
Jiang, Zhiying,
Yang, Gefei,
Kumar, Karun,
Stenetorp, Pontus,
Lin, Jimmy,
and Ture, Ferhan
In Proceedings of Association for Computational Linguistics (ACL), Best Paper Award,
2023
Diffusion models are a milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce attribution maps, we upscale and aggregate cross-attention maps in the denoising module, naming our method DAAM. We validate it by testing its segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. On two generated datasets, we attain a competitive 58.8-64.8 mIoU on noun segmentation and fair to good mean opinion scores (3.4-4.2) on generalized attribution. Then, we apply DAAM to study the role of syntax in the pixel space across head–dependent heat map interaction patterns for ten common dependency relations. We show that, for some relations, the head map consistently subsumes the dependent, while the opposite is true for others. Finally, we study several semantic phenomena, focusing on feature entanglement; we find that the presence of cohyponyms worsens generation quality by 9%, and descriptive adjectives attend too broadly. We are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future research. Our code is at \urlhttps://github.com/castorini/daam.