Close Icon
Article Big Data Blog CPG General Healthcare High Tech Insurance Life Sciences July 12, 2017 6 minute read

Big Data Challenges: Solving for Data Quality

Data harmonization is essential for generating actionable and accurate business insights. With accurate data, an organization can see significant impact on the bottom line. Sushil Jadhav describes his experience while troubleshooting a data accuracy issue for a client.

Big Data and analytics solution providers attempt to solve a problem that arises from having too much of a very desired thing – data. So much data is flowing in everyday from various media that it is already impossible to even measure how much data each organization processes everyday.

Take a look at these numbers:

Research says that using as little as 10% of available data effectively can lead to significant increase in net income for a Fortune 1000 organization. Therefore, every organization today is on board or already implementing big data solutions. However, working with Big Data invites a huge set of complex challenges. The biggest problem, of course, is turning tons of bytes of data into something meaningful. Even after that becomes possible, ensuring that the data is sourced correctly, is secure, and accurate, is critical.

When we say that we need to leverage data for insights, we are essentially saying that we need to have well-integrated data collected from all available sources and accessible to the data analytics engines.

One of the biggest challenges is ensuring data quality and accuracy, which is easier said than done. Let’s look at some of the biggest issues that come in the way:

  • Integrating data: Typically, for an organization, data is collected from so many different sources, that it is easy to lose track of the effectiveness of the integration process. Most of the issues related to skewed insights can be traced back to the way data was collected, stored, verified, and used. When working with data-intensive and sensitive industries like Life Sciences, this error can prove fatal.
  • Dealing with the complexity of data: The data that’s flowing in is becoming more complex by the day. The systems need to account for a lot more parameters than they were doing earlier. Raw data flows in from different sources – directly from consumers, salespeople, operations, and other users within the organization. Then there are so many technologies that work across different channels – mobile, web, and offline – which further adds to the complexity.
  • Ensuring data security: The next big challenge is the security of all the available data. Creating technologies like the cloud to keep data at one place and offer access to it from anywhere has made life easy for consumers and organizations, but has also contributed to creating security nightmares.

All of these broad level challenges can be tackled by ensuring that we take all the small steps necessary while integrating data correctly. And the first step on this journey to meaningful business insights is accuracy of data.

For people who love all things data, here is an example of the way a small step can lead to large dividends. We started with the goal of working on ways to stitch together disparate data sources, and ended up finding a way of data harmonization that blew our clients away.

Consider this scenario:

A team has data in two environments and they want to ensure that the new data matches well with the existing data. They define multiple ‘rounds’ of matching, where each round would attempt to find matches based on different criteria. They end up working on these rounds for a while and trying to resolve the mismatches with further rounds and a massive data clean up effort. Still, they are unable to get the data to match perfectly. Sound familiar?

Now be honest, haven’t you ever had a data accuracy issue that led to endless hours of data cleaning? When were you last in a situation where you had to build complex logic to compare data between two systems because names are misspelled or data is missing in your matching columns?

There are many popular ways to compare and correct data, and almost all of them need to work with Microsoft Excel spreadsheets, as it is the most popular software for storing and integrating data, especially in organization with legacy software.

We believe that almost anything can be done better, provided there is a good enough reason to do so. We had been using the VLOOKUP function successfully, but when we had a client with a particularly large set of data – most of which was collected from so many different channels and handled by multiple hands – we knew we had to come up with something special.

And we did. We found Fuzzy Logic, a Microsoft Excel Add-in.

Fuzzy matching or Fuzzy lookup is a process that fills gaps in many standard data cleaning or filtering techniques. Matching two strings of text or numbers that are exactly the same, is easy through VLOOKUP. But what if you have two strings that vary slightly, say ABC Ltd and ABC PLC? This is where fuzzy matching shines. It matches strings of varying degrees of similarities and in cases that are more complex than that example.

The result of a fuzzy match will include some data that is not correct, but the add-on will show you the degree of similarity that the match has returned. You can even set a similarity threshold.

Here is an example to further illustrate the importance of Fuzzy Logic. Let’s use two tables – SSN and Compensation – for the basis of our fuzzy lookup. Now we want to match the ‘Name’ field in the SSN table with the ‘Name’ field in the Compensation table, and create a results list that shows employees’ names, Social Security numbers, and compensation.

Big Data
Figure 1: Tables for Fuzzy Lookup

The challenge we face is that not all of the names are spelled and arranged the same way; therefore, if we attempt to use a VLOOKUP function to complete this task, we will not be successful. However, by using the Fuzzy Lookup Add-in, we can match the data in these two tables. To do so, click Fuzzy Lookup to open the Fuzzy Lookup task pane on the right side of the window as shown in Figure 2.


Upon doing so, Excel automatically reads and inserts the names of the tables into the Fuzzy Lookup task pane.

Excel also automatically analyzes the columns in the tables and joins the tables if a column in one table has the same column header as a column in the other table. Of course, you can edit both the table and column selections, if necessary. If the tool does not work flawlessly. We can modify the Similarity Threshold setting and see if reducing the confidence level slightly increases the number of matches.

This was an example of how large impacts can be achieved by bringing attention of detail to every step of data management. Find out more about our approach to data and the benefits we deliver with our Fluid Analytics Solution.

Saama can put you on the fast track to clinical trial process innovation.