A closer look at interpretability


The terms explainability and interpretability are used a lot in the machine learning industry, and as a result are often confused with one another, or used interchangeably. However, they both refer to distinct and very important concepts with respect to machine learning and its applications, as they deal with the understanding a models’ creator has of how that model is making its decisions.

Despite their importance, understanding and applying these concepts often become a secondary concern as the models being deployed become a means to an end, rather than a piece of technology to be investigated and understood. We think the understanding of these two concepts is key to making a well-defined model and ensuring that the results produced from the model are accurate and without bias.

Explainability and Interpretability

To understand the importance of these two concepts, you first need to understand what they refer to and how that is related to the models created in machine learning.

Explainability refers to the degree to which you can explain the mechanics of how the model is making its decisions. Not just “The model is seeing X as an input so it will probably have an output Y,” but an explanation at each step of what the model is doing. Generally, one is concerned with explainability during the design stage of a model, as it relates mainly to your understanding of the math and mechanics behind the code you’re running.

Interpretability refers to the cause and effect relation between a models inputs and outputs. That is to say – given input features [A , B] and outputs [X , Y], how easy it is to predict the output [X , Y] when we change the input features [A , B].  You can have a highly interpretable model without any knowledge of how the model works, so explainability and interpretability don’t imply each other. In the case of a neural network interpretability could relate to how data activates different neurons and convolutional filters, or visualizations of the convolutional filters themselves. This is generally done once the model is trained as a way to better understand how the model itself has learned to make decisions.

The Why and How of Improving Interpretability

Once you’ve created a model and it has been trained and tweaked to reach the accuracy you were looking for, often you deploy the model and forget about it. However, understanding each piece of your model can be crucial to making sure its performing optimally. Optimal could refer to resource utilization or, in some cases, the correctness of results. For less critical models, such as one that tells you the ingredients used to make a certain food, the way the model gets to its conclusion isn’t as important as the end result. However, in cases where the process by which the model produces its answers is as important as the answer itself, such as selecting qualified applicants for job postings, ensuring that you understand how your model is producing its results is key to avoiding a model that has a hidden bias towards certain groups.

While interpretability can be a difficult problem to tackle, and bringing yourself to do it after you’ve finished the task of training a model can be difficult, it’s important to have a deep understanding of the model that you’re deploying. To ease the burden of interpretability, I’m going to give an example of how easy it can be to gain a deeper understanding of how your model works using Imagenet, VGG16 and a library called Keras-Vis.

How it works

When you feed an image to a convolutional neural network, the filters - the matrices that are passed over the images and return how well their pattern matched that section of the image - will be ‘activated’ to varying degrees depending on the image. What keras-vis does is finding an image that maximizes the activation for a given filter. Check out their GitHub for a more in-depth explanation of what they are doing, but that explanation should be enough to understand what I show below.

The Results

We can see that in different layers, the model is looking for very different patterns

We can see that in different layers, the model is looking for very different patterns

Looking at the visualizations above we can see that in the earlier blocks, Block 2 for example, the model is mostly looking for very basic patterns and textures in the image. As you progress through the layers, the patterns get increasingly complex and abstract to the point that they’re almost starting to form recognizable shapes. However, there is nothing recognizable just yet. To see something more recognizable we need to visualize an image that maximizes the activation of one of the outputs. For this experiment I chose category 263 of the imagenet challenge, corresponding to a Corgi Dog. Visualizing the maximization of that output yields:

Compare what the network thinks of when you say corgi to the image of a real corgi

Compare what the network thinks of when you say corgi to the image of a real corgi

Looking at the image on the left we can see some shapes that resemble the characteristic ear shape of the corgi, as well as some areas that look like noses and eyes.

While the results from this very simple method are not the clearest, they will definitely help you get a better understanding of what each category is looking for. If you want to dig deeper into interpretability and review some more complicated methods of understanding how a convolutional model works I think there is no better blog post than The Building Blocks of Interpretability.

Interpretability at IMRSV Data labs

At IMRSV a lot of our work deals with the processing of text, a method called Natural Language Processing (NLP). Despite their success across many different disciplines, deep learning neural network models are extremely hard to interpret in NLP: Given an already trained deep neural network, and a set of test inputs, how can we gain insight into how those inputs interact with different layers of the neural network? Furthermore, can we characterize a given deep neural network based on how it behaves given different inputs?

We are investigating novel factorization-based approaches for understanding how different deep neural networks operate. More specifically, we are working on identifying patterns that link different higher level input data characteristics with how well the resulting model is trained. This work will also help to investigate different high-level recognizable patterns and how they traverse the hidden layers of the network.


We value having not only a solution that performs well but a solution that we can explain and understand. This ensures that when we produce a model there are no hidden biases or nasty surprises, and you can be sure that it will perform properly when deployed. I hope you learned a bit about how we here at IMRSV go about looking deeper into the models that we make and will perhaps give some of these methods a try next time you’re making a model!