Probing Large Language Model Hidden States for Adverse Drug Reaction Knowledge
Published in Artificial Intelligence in Medicine (AIME25), 2024
Large language models (LLMs) integrate knowledge from diverse sources into a single set of internal weights. However, these representations are difficult to interpret, complicating our understanding of the models’ learning capabilities. Sparse autoencoders (SAEs) linearize LLM embeddings, creating monosemantic features that both provide insight into the model’s comprehension and simplify downstream machine learning tasks. These features are especially important in biomedical applications where explainability is critical. Here, we evaluate the use of Gemma Scope SAEs to identify how LLMs store known facts involving adverse drug reactions (ADRs).
Recommended citation: Berkowitz, J. et al. (2025). Probing Large Language Model Hidden States for Adverse Drug Reaction Knowledge. In: Bellazzi, R., Juarez Herrero, J.M., Sacchi, L., Zupan, B. (eds) Artificial Intelligence in Medicine. AIME 2025. Lecture Notes in Computer Science(), vol 15734. Springer, Cham. https://doi.org/10.1007/978-3-031-95838-0_6
Read paper
