Understanding How CodeLLMs (Mis)Predict Types with Activation Steering

Francesca Lucchetti and Arjun Guha, 2025

LLMs are in widespread use by software engineers for programming tasks. However, research shows that they are not robust to semantically irrelevant program features. Small changes, such as renaming variables, have a significant impact on the performance of many programming tasks. In this paper, we focus on the type prediction task: given a partially typed program, can the model predict a type annotation such that the resulting program is more typed? We first show that models easily go wrong on type prediction when the input program is constructed to be out-of-distribution from the model’s training data. This is problematic because LLMs ought to be able to generalize to code that is unlike their training data.

We then give evidence that models do in fact learn a robust internal mechanism for type prediction in their hidden layers. However, this mechanism often fails to activate on out-of-distribution programs. We show that we can correct this using activation steering. Furthermore, we show that this mechanism is shared across two programming languages: Python and TypeScript. We also show that steering is more effective than ICL. Our extensive empirical evaluation shows that our results hold for five models, including LLMs pretrained on code and instruction-tuned LLMs.

PDF available on arXiv

@inproceedings{lucchetti:steering-type-prediction,
title = "Understanding How {CodeLLMs} (Mis)Predict Types with Activation Steering",
booktitle = "Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP)",
author = "Francesca Lucchetti and Arjun Guha",
year = 2025
}