While most of big data is geared towards social media and stream analytics, traditional EDW can also best leverage the power of Big Data. The concept of Big Data is not new, banks have been doing it for a while using mainframe size computers. The reason it’s being talked so much now is that for the first time, cheap and massive computing power and even cheaper memory has put mainframe size power in the hands of every organization, right at the time when organizations have been struggling to justify the ROI in processing such exponential data volume.
Big Data is not a performance engine. i.e. it is not a traditional database that can run queries faster. It will also not replace traditional reporting strategies. What it can do is, it can batch process millions and billions of records both unstructured and structured much faster and cheaper. What has also become possible with BigData Analytics is the ability to merge all analysis into one platform. As a direct result, data analysis has become more accurate, well-rounded, reliable and focused on a specific business capability/advantage.
Before investing money in buying commodity hardware and calling consultants to wave the big data magic wands, companies should do a lot of soul-searching because once you set the wheels in motion, it is likely to take up lot of your organization’s focus. To decide where you are in the BigData spectrum it is important to look at the 4 V’s – Volume, Velocity, Variety and Variability of your data as shown in the info-graphic below.
A key question to ask would be, if you have enough data volumes at the source to justify the use of Big Data processing (Average Data set > 300GB). If you don’t, you should consider investing in building a traditional enterprise data warehouse and fine tuning your reporting metrics. If yes, you should move on to the next question of how you want to process this amount of data.
One of the key technologies that is widely being accepted by large Enterprises for BigData Processing is Hadoop. While this technology provides the processing power, the algorithms to make sense of this data will still need to be developed in-house. The most frequent application for Hadoop is to support the “Transform” in traditional ETL (Extract, Transform, Load), where the data is in myriad of unstructured, semi-structured, and structured formats and loaded into terabyte-scale analytical data marts where predictive modelers and other data scientists can work their magic.
Hadoop and traditional EDW technologies can co-exist in the same ecosystem as shown below. Each has its own strengths and when combined provides a potent mix for your analytical needs that we have seen in few large companies.
Traditional EDWs built on relational, columnar, and other approaches for storing, manipulating, and managing data will continue to exist. All of your investments in pre-Hadoop EDWs, data marts, operational data stores and the likes are reasonably safe from obsolescence.
The reality here is that the EDW is evolving into a virtualized cloud ecosystem in which all of these database architectures can and will coexist in a pluggable “Big Data” storage layer alongside HDFS, HBase (Hadoop’s columnar database), Cassandra (a sibling Apache project that supports peer-to-peer persistence for complex event processing and other real-time applications), Neo4j (graph database), and other “NoSQL” platforms.
Beginning with a Bigdata implementation really boils down to one basic question, do you have the use cases for it? We will post few sample use cases that are being adopted by large enterprises in our next posting. Stay tuned….