Concept activation vectors have been shown to take effects in safety concepts, efficiently and effectively guiding a considerable number of open-source large language models (LLMs) to respond positively to malicious instructions. In this blog, we aim to explore the capability boundaries of concept activation vectors in guiding various behaviors of LLMs through more extensive experiments. Our experiments show that this technique can transfer the text style at a low cost, but it is powerless to deal with short factual knowledge.
As a classic interpretation method, concept activation vectors (CAVs) describe the distribution features of activation in neural networks with simple linear classifiers. Some work has successfully transferred this to LLMs, using CAV to steer model behavior. Previous research on the ability of CAV to steer LLMs’ behavior has not been systematic enough, limited to safety concepts
Behavior steering is a technique of changing the behavior of LLMs when inferencing, such as changing the likes and dislikes of its output, or the degree of relevance to something. This method does not require pre-training or fine-tuning. It only needs to be adjusted according to the interpretability features of its embedding or attention during inferencing. Therefore, it is a promising method for the downstream application of interpretation techniques.
In this blog, the behavior steering we discuss is by concept, which means that the behavior we want to change is aligned with human ideas, and then we try to extract the corresponding features from LLMs; instead of extracting relevant neural activities from LLM first, and then summarizing what their behaviors are in the eyes of human beings. Therefore, when human beings have a single idea, there is a single concept to steer.
For single behavior steering, we use the pipeline outlined in
Data Collection: Gather two sets of instructions aimed at carrying out two different tasks, labeling one as the positive and the other as negative. For example, a positive dataset might include instructions like How to plant flowers in my garden?, while the negative dataset might include instructions like Comment planter des fleurs dans mon jardin?. Thus, this two datasets can be used for “French” concept extraction and utilization.
LLM Selection: Choose a target LLM, such as LLaMA-3-8B, known for its better capabilities. Collect the final token embeddings from each layer during the inference process for instructions in positive and negative datasets respectively.
Classifier Training: Train a linear classifier on these embeddings with their corresponding labels for each layer. That’s to say, we will get \(N\) classifiers for steering a single behavior, where \(N\) is the total number of transformer layers of the target LLM.
Text Generation with Perturbation: Use the trained classifier parameters to perturb a typical text generation process of the target LLM. Due to the transferability disclosed by
Assuming classifiers have been trained for each layer, we perform perturbations sequentially during the generation process for each new token:
We use the benign instructions provided in representation engineering
In this blog, our positive instructions are the unmodified version, while the negative instructions have been modified to have steering goal characteristics. This setting is consistent with the example above
To construct the negative dataset, that is, to modify negative instructions into the form with targeted behavior, we summarize three methods:
Complete Replacement. For fundamentally different tasks, the original instruction can be directly replaced with a completely different instruction that has the target behavior. For example, in safety tasks, the instructions in the negative dataset are all recollected.
Prefix and Suffix Addition. For tasks like style transfer, a string describing requirements can be added in the form of a prefix or suffix to the original instruction. For instance, adding "Answer in Python code please."
is a processing method suitable for “Code” concept.
To avoid potential misguidance caused by a single format of additional requirement string, this modification has two random elements - randomly using a prefix or suffix, and randomly selecting one string from several to add to the original instruction.
Instruction Transfer. For tasks like language or grammar, the original instruction can be directly transferred to the target task. For example, for “French” concept, the original instruction can be translated into the corresponding language.
For example (CR - Complete Replacement; PSA - Prefix and Suffix Addtion; IT - Instruction Transfer):
The above three operations can be seen as operational primitives, and the results obtained by nesting them can serve as a method for constructing datasets for multi-behavior steering.
After building the datasets, we perform the CAV perturbation in text generation process, achieving the effect of behavior steering.
Before training classifiers, we apply PCA reduction to evaluate the separability of the embeddings of these instructions. We train the PCA on the dataset and observe good linear separability. We want to clarify that the intuitive results of PCA are not completely consistent with the actual test accuracy. In cases where positive and negative examples appear to overlap in the PCA results, the classifier’s test accuracy may still be very high, even as high as those layers that do not seem to overlap. In the next experiments, if there is such an inconsistency, we will provide data of both for reference.
By default, we use Llama-3-8B-Instruct
for experiments. Other LLMs may be involved for some concepts, and we will clearly indicate.
First, we try the Python concept, which has a negative instruction dataset constructed by PSA (Perfix and Suffix Addition). The test accuracy for the CAV is quite high, above 99% except for layer 0. However, you will see in the PCA results shown below that the early layers seem to have better separability than the later layers. Therefore, the results of PCA can only be auxiliary, and the test accuracy is a better indicator for understanding the effectiveness of CAV.
After training the Python CAV, we will attempt to steer behavior with it. We will apply Python CAV to three types of tasks:
You can try the interactive panel below to compare the outputs of the three tasks before and after using Python CAV to steer behavior.
Our observation are:
Next, we explore language-related concepts, which are considered a more practical steering usage. The experiments of language concepts involves four specific languages—English, French, Chinese (including Simplified and Traditional), and Arabic. The construction of the dataset involves two methods—Prefix and Suffix Addition (PSA) and Instruction Transfer (IT). Due to space limitations, it is not possible to present the results for all combinations pairwise; only some of the most meaningful and interesting content will be discussed below.
When studying the French concept, we also examine the differences between PSA and IT. When using PSA to induce French CAV, the instructions in both the positive and negative datasets are written in English. When using IT, the instructions in the positive dataset are in English, while the instructions in the negative dataset are translated into French by a translation API.
We select three different text generation tasks to test the effects of using PSA and IT to induce the French CAV for behavior steering. Try the interactive panel below to view the PCA results and outputs of the two CAVs.
Our observations are:
The differences between simplified and traditional Chinese are a very interesting phenomenon. We use IT and translation APIs to construct positive and negative datasets, with the positive dataset translated into simplified Chinese and the negative dataset into traditional Chinese. However, we struggle to train a good CAV on Llama-3-8B-Instruct
with such datasets, possibly because this model doesn’t have good Chinese output capabilities. Therefore, we use Llama3.1-8B-Chinese-Chat
, a fine-tuned version of Llama3.1-8B-Instruct
and its original version for this concept.
Our observations are:
Llama-3-8B-Instruct
answers all three tasks tested with English as the input language in Simplified Chinese, so there was no noticeable change after applying CAV perturbations; Llama-3.1-8B-Instruct
and Llama-3.1-8B-Chinese-Chat
are able to respond in Chinese, making the CAV perturbations effective. The text in the interactive panel above is generated by Llama-3.1-8B-Chinese-Chat
;Llama-3-8B-Instruct
is the lowest, while Llama-3.1-8B-Instruct
is the highest (with some late layers of Llama-3.1-8B-Chinese-Chat
being even higher). In certain middle layers, the difference between Llama-3.1-8B-Instruct
and Llama-3.1-8B-Chinese-Chat
is even greater than the difference between Llama-3.1-8B-Chinese-Chat
and Llama-3-8B-Instruct
. The reason for this is currently unclear.Compared to French and Chinese, Arabic should be a less common language in Llama-3.1 and is also a low-resource one. How effective is the CAV extraction and behavior steering for this low-resource language? We also used PSA and IT methods along with the Arabic translation API to build datasets. Try the interactive panel below to see the steering results.
Our observations are:
MohamedRashad/Arabic-Orpo-Llama-3-8B-Instruct
, the above phenomenon remains.From the results of the two concepts above, it seems that CAV is quite good at modifying the style of the entire generated content. We have attempted more style concepts based on PSA, such as telling jokes, being more creative, childish, fairy tale-like, etc. The results show that CAV performs well in steering these concepts. Try the interactive panel below to view the results of various style transfers on three specific instructions.
In addition to these concepts, we also try many other concepts, but fail to demonstrate such effects. These PSA-based CAVs all had quite high test accuracy, but after multiple attempts with different values of \(P_0\), we could never produce significantly steered responses. These concepts include:
Therefore, this technique seems to be more conducive to long-context text style transformation, and is powerless in short, intellectual modifications. This implies that CAV may not be well applied to knowledge editing tasks, perhaps because the concept here is too high-level to accurately locate the knowledge at the entity level. However, some interesting phenomena can also be found from these failures:
In addition, we also find that the behavior steering effect of CAV performs better in longer responses, and the first few tokens of the response seem to retain the characteristics before steering, which indicates that the steering caused by CAV requires a few tokens to adapt.
Through the experiments described above, we observe that both settings could effectively extract a French concept and achieve forward and reverse behavior steering. This raises an interesting question regarding the uniqueness of the French concept. The success in both PSA and IT induction suggests a relationship akin to a sufficient and necessary condition, implying that the French concept within LLMs may be unique. To explore this further, we extend the experiment settings as follows:
We train CAVs using the following pairs of datasets:
The classifiers trained on these five pairs exhibited good test set accuracy. To further understand their behavior, we examine the cosine similarity between the parameters of these classifiers:
Our observations are:
Based on the methodology of single-behavior steering, there are two approaches to using CAV for multi-behavior manipulation. Assume there are three target behaviors A, B, and C.
There has already been some preliminary research on this
In this blog, we explore the breadth and boundaries of using CAV for LLM behavior steering. Using CAV for steering, PSA and IT are good ways to construct datasets, allowing for easy export of the corresponding CAV. CAV-based steering is more suitable for tasks that require the transfer of overall text style, including the language used in the text, and has the ability to disable some brief system prompt requirements, but cannot perform more complex cognitive demands. Research on CAV steering can more effectively promote the exploration of explainable AI and low-cost text style transfer generation.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX