On Big Data and In-Memory Data Clouds

Data growth curve:  Terabytes -> Petabytes -> Exabytes -> Zettabytes -> Yottabytes -> Brontobytes -> Geopbytes.  It is getting more interesting.

Consider this:

  • Online firms–including Facebook, Visa, Zynga–use Big Data technologies like Hadoop to analyze massive amounts of business transaction data and application data.
  • Wall street investment banks, hedge funds, algorithmic and low latency traders are leveraging data appliances such as EMC Greenplum hardware with Hadoop software to do advanced analytics in a “massively scalable” architecture
  • Retailers use HP Vertica  or Cloudera analyze massive amounts of data simply, quickly and reliably, resulting in “just-in-time” business intelligence.
  • New public and private “data cloud” software startups capable of handling petascale problems are emerging to create a new category – Cloudera, Northscale, Splunk, Palantir, Factual, Datameer, Aster Data, TellApart.

Why are some companies in retail, insurance, financial services and healthcare racing to position themselves in Big Data, data clouds and others don’t seem to care?

A new form of business problems are being targeted that were hard to solve before – Modeling true risk, customer churn analysis,  flexible supply chains, loyalty pricing, recommendation engines, ad targeting, precision targeting, PoS transaction analysis, threat analysis, trade surveillance, search quality fine tuning,  and mashups  such as location + ad targeting.

To address these problems an elastic/adaptive infrastructure for data warehousing and analytics is required. As a result,  a new BI and Analytics framework is emerging to support public and private cloud deployments.

 The excitement is that Big Data capabilities fundamentally change the core premise of BI and analytics – the ability to have end-users (and even machines) perform ad-hoc analysis and reporting tasks over large and continuously growing amounts of structured and unstructured information such as log files, sensor data, streaming data, sales transactions, emails, research data and images collectively known as ‘big data.’

Technology Innovation around Big Data

At the technology level, tremendous amount of innovation is already taking place around Big Data, next generation data warehousing, low latency OLTP, NoSQL, in-memory, columnar, or cloud databases.

So if you have not heard of these tools – Hadoop, NoSQL, MongoDB, Cassandra, HBase, Columnar databases, Data Appliances – then it’s time for a quick primer.

NoSQL stands for Not Only SQL. NoSQL databases do not use the popular SQL (Structured Query Language) to create tables and insert, delete or update data.  Many NoSQL deployments handle data that simply can’t be handled by a relational database, such as sparse data, text, and other forms of unstructured content. Unstructured content include social media/networks, Internet text and documents;  call detail records, photography and video archives;; and web logs.  Industry specific unstructured data include RFID; large scale eCommerce catalogs, sensor networks, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and/or interdisciplinary scientific research; military surveillance; and medical records.

Cassandra was developed by Facebook and later open sourced in 2008. Cassandra is influenced by the Google BigTable model, but also uses concepts from Amazon’s Dynamo distributed key-value store.  Eventually, Cassandra became an Apache project. It falls under a category of databases called NoSQL, which stands for Not Only SQL.   Cassandra database is used by Facebook, Digg and Twitter.

Hbase – is NoSQL open-source, column-oriented store database modeled from Google’s BigTable system, offers the row level access. Hbase is an Apache project. It is part of the Hadoop ecosystem. See this presentation on how FaceBook uses HBase in Production.

Hadoop – Relational database systems are good at data retrieval and queries but don’t accept new data. Hadoop and other tools get around this and allow data ingestion at incredibly fast rates.  Hadoop was built initially by Doug Cutting while he was at Yahoo, has become prominent first in unstructured data management and cloud computing. Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing.    But Hadoop requires additional programming tools such as Pig or Hive to write SQL-like queries to retrieve the data.  Hadoop is an Apache Open Source project.

Columnar databases.  Examples include SAP/Sybase IQ, HP/Vertica, and ParAccel. Columnar querying’s performance efficiencies are unmatched by any row-oriented database.

columnar

Data Appliances –  Purpose built solutions like Teradata, IBM/Netezza, EMC/Greenplum, SAP HANA  (High-Performance Analytic Appliance) and Oracle Exadata are forming a new category.  A number of vendors are going down the path of appliance and quasi-appliance offerings which have some preconfiguration of hardware and software,  cloud-supporting deployments, and reference configurations. A leading example is Oracle Exadata Database Machine.  Exadata is Oracle‘s fast-selling appliance that bundles its database and hardware for optimized performance.  Oracle Exadata deployments mostly involve replacing data warehousing solutions for much better performance via compression, and dropping overhead like old indexes and partitions.  SAP HANA is an equivalent of Exadata and debuted at Sapphire 2011.  Data appliances are one of the fastest growing categories in Big Data.

sap-hana-example

MongoDB is an open source database, combining scalability, performance and ease of use, with traditional relational database features such as dynamic queries and indexes. It has become the leading NoSQL database choice, with downloads exceeding 100,000 per month. Thousands of customers including Fortune 500 enterprises and leading Web 2.0 companies are developing large-scale applications and performing real-time “Big Data” analytics with MongoDB.  For more information, visit  www.mongodb.org or www.10gen.com. 10gen develops MongoDB, and offers production support, training, and consulting for the database.

There are many new database directions appearing on the landscape today. These include nonschematic DBMS ( “NoSQL”), cloud databases, highly distributed databases,  small footprint DBMS, and in-memory database (IMDB).  The business applications of these are driven by high performance, low latency and efficiency in deployment. All of these are driven by the premise that insight into data requires more than tabular analysis.

The Rebirth of Computational and Management Science – Data Scientist

As data growth outpaces our ability to absorb or even process, new techniques will emerge.  There are new roles emerging such as Data Scientists in corporations for handling the range of activities listed below.  This is the rebirth of Management Science as a field. Amazing how things come back into style.

E-tailing – E-Commerce – Online Retailing

  • Recommendation engines — increase average order size by recommending complementary products based on predictive analysis for cross-selling.
  • Cross-channel analytics — sales attribution, average order value, lifetime value (e.g., how many in-store purchases resulted from a particular recommendation, advertisement or promotion).
  • Event analytics — what series of steps (golden path) led to a desired outcome (e.g., purchase, registration).
Retail/Consumer Products

  • Merchandizing and market basket analysis.
  • Campaign management and customer loyalty programs.
  • Supply-chain management and analytics.
  • Event- and behavior-based targeting.
  • Market and consumer segmentations.
Financial Services

  • Compliance and regulatory reporting.
  • Risk analysis and management.
  • Fraud detection and security analytics.
  • CRM and customer loyalty programs.
  • Credit risk, scoring and analysis.
  • High speed Arbitrage trading
  • Trade surveillance.
  • Abnormal trading pattern analysis
Web & Digital Media Services

  • Large-scale clickstream analytics.
  • Ad targeting, analysis, forecasting and optimization.
  • Abuse and click-fraud prevention.
  • Social graph analysis and profile segmentation.
  • Campaign management and loyalty programs.
Government

  • Fraud detection and cybersecurity.
  • Compliance and regulatory analysis.
  • Energy consumption and carbon footprint management.
New Applications

  • Sentiment Analytics
  • Mashups – Mobile User Location + Precision Targeting
Health & Life Sciences

  • Health Insurance fraud detection
  • Campaign and sales program optimization.
  • Brand management.
  • Patient care quality and program analysis.
  • Supply-chain management.
  • Drug discovery and development analysis.
Telecommunications

  • Revenue assurance and price optimization.
  • Customer churn prevention.
  • Campaign management and customer loyalty.
  • Call Detail Record (CDR) analysis.
  • Network performance and optimization
  • Mobile User Location analysis

Big Data Startup and Existing Companies to Watch

Cloudera, Northscale, Splunk, Palantir, Factual, Kognitio, Datameer, Aster Data, TellApart, Paraccel

EMC Greenplum , HP Vertica,  IBM/Netezza, Microsoft, Oracle ExaData,  SAP HANA, Teradata

All these firms are going after two distinct opportunities:

–  Big Data in the Public Cloud

– Big Data in the Private Cloud

As I speak to customers,  it is becoming more clear to me that there is going to be growing push towards an elastic / adaptive infrastructure for data warehousing and analytics.  With increasing focus on mobility and faster decision making…the business is going to push for this faster than IT can react.

Seems like the IT roadmap is going to divide into a Compute Cloud AND Data Clouds.   The Compute Cloud (Private/Public/Hybrid) are coming from the virtualization/resource side and the Data Cloud (in-memory, data appliances) is coming from mobility and decision making side.

——————–

Also checkout these articles for more coverage:

Big data’s potential for businesses:   Financial Times

Hadoop World – 2010 – Conference Presentations

The Structure Big Data conference: GigaOM conferences

The Vendor Landscape of BI and Analytics – list of Big Data vendors

This article was originally published in Business Analytics 3.0 by Shirish Netke and Ravi Kalakota.

Facebook
LinkedIn
Twitter
YouTube

About Saama Executives

Saama Executives is an exclusive group of thinkers, leaders, mentors and innovators within the company. The members of this group come together from time-to-time to pen their thoughts on topics that would matter the most for the industry. Over the last few years, the group has written some brilliant pieces of Insurance, Life-Sciences, Healthcare and CPG industry.


Related Posts

Anant Gadhvi says:

Gartner recently announced their 2011 list of Top 10 strategic technologies among cloud services is on Top which was grounded in late 2010 that cloud services would grown-up in 2011 by IDC. It’s a time to go for a revolutionary change with cloud analytics with use of new tools in public cloud as well as private cloud.

Business analytics solutions convey thorough evaluation of end-user paying, trends and significances regarding analytics and data warehousing solutions, as well as the anticipated rate of adoption of new and existing technologies by industries and company size especially for cloud. Mostly companies have been securing the advantages of cloud analytics to pick, execute, and to use these solutions. For a development and progress of organizations, more refined analysis of data needed. Today, in diverse environment all suppliers must need to regulate with the varying requirements that customers or end users place in both the form hi-tech & an intentional viewpoint.

Business people should involve IT to guarantee continuing encouragement, and IT should work with its business audience, considering cloud analytics especially for non-technical users, stakeholders outside the firewall.

Shajan Padmanabhan says:

This is an important read… following you! Thanks for the blog.

Greg Milbank says:

SHIRISH AND RAVI,

I am a new member, please pardon my delayed response. Your comments are all valid but do not take into consideration how to integrate widely disparate data that resides in the existing data stores whose performance meet the business needs of companies today. Even for those companies that make the commitment to a giant leap of faith in Big Data, being able to do so in a risk free manner will be key to forging the new frontier. The question should then be asked, how can an enterprise migrate from its current generation of IT to Bgd Data platfroms in a practical manner. At Revelytix, we are addressing this challenge with semantic integration technologies. This approach affords the customer the option to migrate to BigData by the drink, filling capability gaps that justify costs and return rapid ROIs. The technical approach can be executed without disruption to existing data quality or metadata management best practices. See the Semantic Software Architecture and Emergent Analtycs white papers on our web site.

Jeremy says:

Interesting stuff! 🙂

Jim McGovern says:

Very nice succinct write up – thanks!



Leave a Reply

Your email address will not be published. Required fields are marked *