Close Icon
Article AI Blog CPG General Healthcare High Tech Insurance Life Sciences August 23, 2017 4 minute read

Interpretable Models are the Key to Increased Adoptability of Machine Learning

Machine learning has helped drive many technological advancements. There are two models of machine learning, which have their own pros and cons. Malaikannan Sankarasubbu discusses the way interpretable models can increase the machine learning among a wider base of communities.

Machine Learning and Deep Learning are responsible for lot of technology advancements in the last couple of years. Machine Learning models can be broadly classified into:

  1. Interpretable models, and
  2. Black box models

Highly regulated industries like Pharmaceutical, Life Sciences, Insurance, and Banking, need models that are interpretable. Traditionally, when a model is said to be Interpretable, Data Scientists have relied on Linear Regression, Logistic Regression, or Decision Trees. These models work really well for data which has a linear decision boundary, for example, you can draw a straight line to classify two classes. It is easy to explain using coefficients on what features played what role in a decision made by the model.

When we have datasets that have more of a nonlinear decision boundary, then a Data Scientist tends to rely on Black box approaches like Random Forest, Gradient Boosting, Extreme Gradient Boosting, Deep Learning, etc.

But there is always a tradeoff between accuracy and interpretability of models. This tradeoff is not acceptable.

Data science is decision science. Data scientists need to convince stakeholders on why a certain decision has to be made and why a model has to be put in production.

However, mistakes made due to the black box model, breaks the trust of key stakeholders. Then the most obvious question that stakeholders ask is: why should I trust your model?

There are different approaches to explain the decision made by black box models.

Let’s take loan-default dataset and run it through Random Forest Classifier with 100 trees and try to interpret why it makes certain decisions during inference.

Sample Data can be seen as below:



Variable Feature Importance

Variable Feature importance can be used to explain why a model can take certain decisions based on the data it is trained with. For loan default dataset, variable importance plot can be seen below. But variable importance does not explain why a model made a certain decision during inference.



ELI5

I firmly believe that if you cannot explain a concept to a 5-year-old, then you are not explaining it well. This is why I say that Python open source community is awesome; there is an eli5 package for explaining black box machine learning algorithms. Let’s take a record from the dataset where the person has originally defaulted but the random forest model predicted that person will not default.



You can see why the random forest model inferred that person will not default, income is high, and borrowed amount is low.

But is there an even better approach to interpret why a black box model is doing inference a certain way?

LIME (Local Interpretable Model-Agnostic Explanations)

LIME is designed to be model agnostic, which means it can work with most black box Machine Learning algorithms. To learn the behavior of a machine learning model, input data can be perturbed and then we can observe how the inference change. LIME generates a dataset of modified (little change) input values that varies by a little compared to the the record that we are doing inference on.

LIME generates explanation by approximating the black box machine learning model by an interpretable model like logistic or linear regression with co-efficient values. LIME works on the principle that it is easy to approximate a black box model by a simple model locally than to approximate the model globally. Again, as python open source community is awesome, there is a LIME package for us to use.



The figure explains why random forest model inferred incorrectly on the particular record. Income greater than 76850.00, grade = 1.00, amount > 12000 and age between 23 and 26. Even a human could have done a mistake on this particular record.

LIME explains why a decision was made fairly well. Ipython notebook and data used for the Random forest model can be found here.

Conclusion

We have just scratched the surface with the kind of problems that can be solved using machine learning. Interpretable and explainable machine learning models will improve adoption of machine learning.

Malaikannan Sankarasubbu will be chairing the Artificial Intelligence Summit in San Francisco on September 18-19. Stop by and meet him!

Saama can put you on the fast track to clinical trial process innovation.