Mining Unstructured Texts for Insights using Convolutional Neural Networks
Pharmaceutical and Healthcare domains deal with a tremendous amount of unstructured texts, which can be mined effectively using CNN for NLP approach. Malaikannan Sankarasubbu talk about these advanced techniques that can have many applications, such as, building better cohort for clinical trials or establishing better inclusion/exclusion criteria, among other uses.
Natural Language Processing, or NLP, has improved a lot in the past couple of years. We see lot of applications being built with NLP that weren’t possible before. Virtual Assistants, smart replies in email, text summarization, sentiment analysis, PHI Scrubber, Machine Translation, etc., are some examples of NLP applications. The healthcare industry generates a lot of unstructured text, such as, doctor notes. There are a lot of key insights that can be derived, if there is a way to apply NLP effectively to these unstructured texts.
Text is sequential information. To understand or predict the next word in a sentence, previous words have to be considered. Traditionally, when Deep Learning is applied to Natural Language Processing, it has been a flavor of Recurrent Neural Networks (RNN). Recurrent Neural Networks have a memory (they use loops) to remember what has come before. Algorithms for Deep Learning hasn’t changed much in the past 20 years, however, the computation power using GPU has changed. RNNs do not use the full power of GPUs.
Convolutional Neural Networks (CNN) are typically used for computer vision. Most of the recent breakthroughs in computer vision are due to CNN. CNNs use the full power of GPUs for computations. Researchers started applying CNN for NLP around ~2014. Facebook AI Research group recently released a paper on CNN for Machine Translation. CNNs are good in learning location invariance and compositionality.
To understand CNN, we have to first learn about convolution and pooling operation.
What is Convolution ?

Let’s look at how convolution works for images and transfer that knowledge to text. A sliding window, called kernel, with typical sizes in 3×3, 5×5, 7×7, or even larger, is slid over the image. The example shown here is of a 3×3 filter. The original matrix and kernel are multiplied using element-wise matrix multiplication and the values are summed up. Each kernel or filter composes a local patch of lower level features into a higher level representation.
What is Pooling ?
In the above picture, 8 is an 8 no matter at which corner it appears in the picture. Pooling operation helps CNN to understand this difference through location invariance. There are different types of pooling types, like Max, Average, etc.

In the Max Pooling example shown here, a 2×2 matrix is moved over a grid; then the maximum value in 2×2 matrix alone is extracted to create a smaller matrix. In Natural Language Processing, Max pooling over time tends to be commonly used.
Convolutional Neural Networks can be expressed as a combination of Convolution Function, Activation Function, Pooling Function, and Fully Connected Layers.
CNN for NLP
In Natural Language Processing, the input is sentences or documents, represented as a matrix. Each row in the matrix is a word, or in the case of the Chinese language, it can be a character. A row can be an embedding representation, like Word2vec, Glove, or Sense2vec of One-Hot Encoding. In the example here, it is a 7-word sentence and 5-dimension embedding, making the matrix of 7×5 dimension.
Width of kernel filter is usually the same as the width of input matrix. In the above example, width of kernel filter will always be 5. Sliding windows for sizes may be from 2 to 5 words, that is, bigram, trigram, 4-grams, or 5-grams.
Images below will show how convolution operation will work for 5-grams, 4-grams, and trigrams.
5-grams
As explained before, convolution operation at each filter position gives 1 value through element wise multiplication and addition. A 5×5 filter can have 3 values. We are going to apply 2 5×5 filters over the data.
4-grams
A 4×5 filter can have 4 values. We are going to apply 2 4×5 filters over the data.
Trigrams
A 3×5 filter can have 5 values. We are going to apply 2 3×5 filters over the data.
Maxpool Over time and Softmax
We have finished the convolution operation, so we can now apply the next step in convolutional neural networks, which is pooling. In NLP. Max pooling over time is the commonly used approach. Output from maxpool over time is concatenated to produce a sentence representation. For classification tasks, a Softmax layer is added to the output of concatenate layer.
CNN NLP Design choices
There are some design choices that have to be made when using CNN for NLP:
- To Pad or Not to Pad: Zero Padding is used if the network shouldn’t lose information rapidly.
- Stride Size: Different stride sizes can be used, I typically stick to stride size = 1 for NLP.
- Pooling: Pooling operation forces CNN to choose the most salient feature in that vector, through which higher level representations are learned. Max pooling over time is the preferred choice.
- Channels: In an image, typically it is the RGB channels; but in case of NLP, you can have separate channel for different types of embedding like Word2vec, Glove, or Sense2vec.
An example code for CNN and Sentiment Analysis can be found in this link.
Conclusion
Pharmaceutical and Healthcare domains deal with a tremendous amount of unstructured texts, which can be mined very effectively using CNN for NLP approach. These advanced techniques can be used for a lot of different applications, such as, building better cohort for clinical trials, better inclusion/exclusion criteria, safety signal detection, etc.
We are just scratching the surface in terms of the potential of the application; next couple of years are going to be very interesting.
I am at the Artificial Intelligence Innovation Summit in San Francisco next week. Stop by and see me. I’ll be the guy chairing the event. If you haven’t registered yet, use code C932SAAMA for a 15% discount on me!