Article Blog General Life Sciences September 6, 2017 4 minute read

Value of Biological Data Integration

Mining Disease-Disease Associations using Disease-Gene Analytics

The key to developing a treatment to any disease lies in understanding the gene mechanism. There has been tremendous progress in elucidating molecular mechanisms of diseases. Rajeev Gangal discusses the significance of mining disease-disease associations using disease-gene analytics. It can be used to find ways of repurposing drugs, among other uses.

Medical professionals have long been aware of multi-layered biological complexity and complex interactions with environment that underlie most human diseases, which are multi-genic. Our lifestyle and environment significantly influence their etiology. The progression of any ailment usually involves various genes and their components. The key to developing a treatment to any disease lies in understanding the gene mechanism.

There has been tremendous progress in elucidating molecular mechanisms of diseases and PubMed is full of references to these studies. Several databases have been curated that relate genes, genetic variations, and pathways to diseases. Data integration and analytics is key to discovering important disease associations and also validating them. Conceptually, this “disease clustering” is a way to reveal common mechanisms that may underlie seemingly disparate conditions.

Let’s use one such resource, the DisGeNET database, to mine associations between diseases on the basis of number of shared genes.

Disease data mining
Figure 1: DisGeNET Table with All Curated Gene-Disease Associations

We now use the power of SQL to perform a groupBy on genes for unique disease names in our data, regardless of the Source. The resultant is a set of genes associated with each disease.

Disease data mining
Figure 2: Diseases and Corresponding Gene Set

The rest of the workflow involves looping over each disease, and comparing genes to all remaining diseases and corresponding genes. We compute the Jaccard Coefficient (Set Intersection/Set Union) to get a measure of “Similarity” between two diseases.

It’s important to remember that some genes are associated with few genes and some with numerous. Similarly, some genes are associated with few diseases and some with many. The present analysis does not take these considerations into account.

At the end of the iterative process, we are left with a few million disease-disease similarity pairs with almost zero or insignificant similarity. Since we do not have any means to test significance of miniscule associations, they can be filtered out (Jaccard Similarity >= 0.3 ).

At this stage, many of the results contain closely similar diseases that may represent subtypes or just the same diseases written differently. String similarity algorithms come to our rescue!

Using n-gram similarity can help us decide on an appropriate threshold to filter out highly similar disease names.

Disease data mining
Figure 3: N-grams String Similarity, to Filter Out Almost Similar Diseases

Once all trivial disease associations are filtered out, a few thousand high similarity associations are left. While most may be known, and there may still be some false positives along with some conceptually similar ones, such as, non-verbal and absent speech, there may be some quite remarkable results. Let’s look at a few of these:

  • West Nile fever and Type 1 diabetes
  • ACTH deficiency and delayed menstruation
  • Intermittent abdominal pain and HyperInsulinemia
  • Combined hyperlipidemia AND decreased adiponectin level/increased abdominal fat
  • Acute fatty lever AND cholesterol gallstones
  • Weakness of long finger extensor/facial muscles AND Reduced vital capacity
  • Vitamin K deficiency AND increased serum bile concentration
  • Vitamin B6 deficiencies AND Amyloid Neuropathies

PubMed references provide evidence for a quite a few of the above disease associations. In cases where increased or decreased level of biomolecules are indicated (e.g. ACTH/Vitamin levels), they may provide evidence for biomarkers or drug targets for associated diseases.

If we were to have drugs that work against one pathology, it is possible that the associated one maybe amenable to treatment. This is a step towards repurposing drugs. However, more data related to pathways, variations, expression, etc. will need to be integrated to achieve this.

All this analysis can be carried out in your favorite computational environment, be it RStudio, Python, or Visual Workflow tools. A shout out to the DisGeNET group (Piñero et al., 2016;Piñero et al., 2015) without whose integrated data this case study would have been moot.

Feel free to contact me if you want to see the entire list of disease associations and do subscribe for more insightful articles.

Access Out-of-the-Box Features in 4 Weeks—Guaranteed.

Saama can put you on the fast track to clinical trial process innovation.