Neural networks are trained to perform a specific task, whether that be image classification, object detection, text generation or time-series prediction. In any case, they are supplied with data and trained to perform the intended task using a loss function. The loss function tells the network what is good and bad behaviour based on the data it is provided. For instance, suppose a neural network is trained to classify hand-written digits. Then when a hand-written two is supplied to the model, the loss function would be low (indicating good behaviour) when the network predicts a two and it would be high (indicating bad behaviour) otherwise. Despite this rather narrow performance of the loss function, in most cases, a network good at predicting hand-written digits is observed to have learned a robust concept of the digit two. However, when we run these networks, we get a stream of vectors, from which it is difficult to identify the concept of the digit two.
A concept activation vector attempts to identify a concept with a vector. It simply takes a collection of data representing a concept, and a collection of data not representing the concept and learns to differentiate the activations of the network between the sets of data with a linear classifier. The trained linear classifier provides a vector, which can be thought of as pointing in the direction of the target concept. Note that under this construction we are supposing that concepts are delineated linearly by the network. This is quite a strong assumption that does not hold in all cases, particularly for rather esoteric concepts. However, it is observed that a network linearly encodes a large proportion of general concepts.
We can investigate how well a concept activation vector summarises a concept by studying a neural network trained to classify hand-written digits. More specifically, we first train an autoencoder model to reconstruct the images. This way we obtain a representation of the 28x28 images in a lower-dimensional latent space, in this case, we set the latent space to have 16 dimensions. Moreover, in the lower-dimensional latent space, the model is incentivised to linearly encode the concepts which increases the effectiveness of concept extraction. We then obtain our concept activation vectors by training the linear classifiers on the activations of the different classes of images in this latent space. Consequently, we obtain 10 different concept activation vectors, each one summarising a particular hand-written digit. There are different ways one can understand how well a concept activation vector summarises a concept. A simple approach would be to study the accuracy of the linear classifier.
For our case, we will explore the influence of the concept activation on downstream tasks, which is also the approach taken by the original paper. Namely, we train a neural network to classify the reconstructed images of the autoencoder, and then observe how the concept activation vector influences the prediction of the classifier. In all cases, we see that perturbing the latent space activation of an image by the corresponding concept activation vectors leads the classifier to more strongly classify the image with that label. In more detail, we can approximate the gradient with which the concept activation vector influences the target logit of the classifier. As noted, this gradient was always found to be positive.
Instead of influencing an image’s latent space activation with its corresponding activation vector, we can also influence it by the concept activation vector of a different concept.
We can continue perturbing the latent space activation to the point where the classification model reclassifies the image with the label of the other concept. The necessary amplitude of perturbation required to change the output of the classifier can indicate how well the concept activation vector summarises that concept and the relationship of the concept to the original label of the image.