What is a "Hadoop" ? - Explaining Big Data to the C-Suite
Keep hearing about Big Data and Hadoop? Having a hard time understanding what is behind the curtain?
Hadoop is an emerging framework for Web 2.0 and enterprise businesses who are dealing with data deluge challenges – store, process and analyze large amounts of data as part of their business requirements.
The continuous challenge in Web 2.0 is how to improve site relevance, performance, understand user behavior, and predictive insight to influence decisions. This is a never ending arms race as each firm tries to become the portal of choice in a fast changing world. Industries - Travel, Retail, Financial Services, Digital Media, Search etc. – that are consumer oriented are all facing similar real-time information dynamics.
Take for instance, the competitive world of travel – airline, hotel, car rental, vacation rental etc.. Every site has to improve at analytics and machine learning as the contextual data is changing by the second – inventory, pricing, customer comments, peer recommendations, political/economic hotspots, natural disasters like earthquakes etc. Without a sophisticated real-time analytics playbook, sites can become less relevant very quickly.
Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. This will be a huge shift in how IT apps are engineered.
Hadoop Quick Overview
Traditional relational databases and data warehouse products excel at OLAP and OLTP workloads over structured data. These form the underpinnings of most IT applications.
It is becoming increasingly more difficult for classic techniques to support the wide range of use cases and workloads that power the next wave of digital business.
Hadoop is designed to solve a different problem: the fast, reliable analysis of both structured, unstructured and complex data. Hadoop and related software are designed for 3V’s: (1) Volume – Commodity hardware and open source software lowers cost and increases capacity; (2) Velocity – Data ingest speed aided by append-only and schema-on-read design; and (3) Variety – Multiple tools to structure, process, and access data.
As a result, many IT Engineering teams are deploying the Hadoop ecosystem alongside their legacy IT applications, which allows them to combine old data and new data sets in powerful new ways. It also allows them to offload analysis from the data warehouse.
Technically, Hadoop consists of two elements: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel/distributed data processing framework called MapReduce. For more see our primer on Big Data, Hadoop and in-memory Analytics.
Hadoop runs on a collection/cluster of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop is self-healing. It can deliver data — and can run large-scale, high-performance processing jobs — in spite of system changes or failures.
The Hadoop Stack
It’s important to differentiate Hadoop from the Hadoop stack. Firms like Cloudera sell a set of capabilities around Hadoop called the Cloudera’s Distribution for Hadoop (CDH).
This is a set of projects and management tools designed to lower the cost and complexity of administration and production support services; this includes 24/7 problem resolution support, consultative support, and support for certified integrations.
The introduction of Hadoop stack is changing the business intelligence (reporting/analytics/data mining), which has been dominated by very expensive relational databases and data warehouse appliance products.
What is Hadoop good for?
Searching, log processing, recommendation systems, data warehousing, video and image analysis, archiving seem to be the initial uses. One prominent space where Hadoop is playing a big role in is data-driven online websites. The four primary areas include:
1) To aggregate “data exhaust” — messages, posts, blog entries, photos, video clips, maps, web graph….
2) To give data context — friends networks, social graphs, recommendations, collaborative filtering….
3) To keep apps running — web logs, system logs, system metrics, database query logs….
4) To deliver novel mashup services – mobile location data, clickstream data, SKUs, pricing…..
Let’s look at a few realworld examples from LinkedIn, CBS Interactive, Explorys and FourSquare. Walt Disney, Wal-mart, General Electric, Nokia, and Bank of America are also applying Hadoop to a variety of tasks including marketing, advertising, and sentiment and risk analysis. IBM used the software as the engine for its Watson computer, which competed with the champions of TV game show Jeopardy.
Hadoop @ LinkedIn
LinkedIn is a massive data hoard whose value is connections. It currently computes more than 100 billion personalized recommendations every week, powering an ever growing assortment of products, including Jobs You May be Interested In, Groups You May Like, News Relevance, and Ad Targeting. LinkedIn leverages Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIn’s 100 million member base. LinkedIn then uses Lucene to do real-time recommendations, and also Lucene on Hadoop to bridge offline analysis with user-facing services.
CBS Interactive - Leveraging Hadoop
CBS Interactive is using Hadoop as the web analytics platform, processing one Billion weblogs daily (grown from 250 million events per day) from hundreds of web site properties.
Who is CBS Interactive? They are the online division for CBS, the broadcast network. They are a top 10 global web property and the largest premium online content network. Some of the brands include: CNET, Last.fm, TV.com, CBS Sports, 60 Minutes, to name a few.
CBS Interactive migrated processing from a proprietary platform to Hadoop to crunch web metrics. The goal was to achieve more robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far). To enable this they built an Extraction, Transformation and Loading ETL framework called Lumberjack, built based on python and streaming.
Explorys and Cleveland Clinic
Explorys, founded in 2009 in partnership with the Cleveland Clinic, is one of the largest clinical repositories in the United States with 10 million lives under contract. The Explorys healthcare platform is based upon a massively parallel computing model that enables subscribers to search and analyze patient populations, treatment protocols, and clinical outcomes. With billions of clinical and operational events already curated, Explorys helps healthcare leaders leverage analytics for break-through discovery and the improvement of medicine. HBase and Hadoop are at the center of Explorys. Already ingesting billions of anonymized clinical records, Explorys provides a powerful and HIPAA compliant platform for accelerating discovery.
Hadoop @ Orbitz
Travel – air, hotel, car rentals – is an incredibly competitive space. Take the challenge of hotel ranking. Orbitz .com generates ~1.5 million air searches and ~1 million hotel searches a day in 2011. All this activity generates massive amounts of data – over 500 GB/day of log data. The challenge was expensive and difficult to use existing data infrastructure for storing and processing this data.
Orbitz needed an infrastructure that provides (1) long term storage of large data sets; (2) open access for developers and business analysts; (3) ad-hoc quering of data and rapid deploying of reporting applications. They moved to Hadoop and Hive to provide reliable and scalable storage and processing of data on inexpensive commodity hardware. Hive is an open-source data warehousing solution built on top of Hadoop which allows easy data summarization, adhoc querying and analysis of large datasets stored in Hadoop. Hive simplifies Hadoop data analysis — users can use SQL rather than writing low level custom code. Highlevel queries are compiled into Hadoop Mapreduce jobs.
Hadoop @ Foursquare
foursquare is a mobile + location + social networking startup aimed at letting your friends in almost every country know where you are and figuring out where they are.
As a platform foursquare is now aware of 25+ million venues worldwide, each of which can be described by unique signals about who is coming to these places, when, and for how long. To reward and incent users foursquare allows frequent users to collect points, prize “badges,” and eventually, coupons, for check-ins.
Foursquare is built on enabling better mobile + location + social networking by applying machine learning algorithms to the collective movement patterns of millions of people. The ultimate goal is to build new services which help people better explore and connect with places.
Foursquare engineering employs a variety of machine learning algorithms to distill check-in signals into useful data for app and platform. foursquare is enabled by a social recommendation engine and real-time suggestions based on a person’s social graph.
Matthew Rathbone, foursquare engineering, describes the data analytics challenge as follows:
“With over 500 million check-ins last year and growing, we log a lot of data. We use that data to do a lot of interesting analysis, from finding the most popular local bars in any city, to recommending people you might know. However, until recently, our data was only stored in production databases and log files. Most of the time this was fine, but whenever someone non-technical wanted to do some ad-hoc data exploration, it required them knowing SCALA and being able to query against production databases.
This has become a larger problem as of late, as many of our business development managers, venue specialists, and upper management eggheads need access to the data in order to inform some important decisions. For example, which venues are fakes or duplicates (so we can delete them), what areas of the country are drawn to which kinds of venues (so we can help them promote themselves), and what are the demographics of our users in Belgium (so we can surface useful information)?”
To enable easy access to data foursquare engineering decided to use Apache Hadoop, and Apache Hive in combination with a custom data server (built in Ruby), all running in Amazon EC2. The data server is built using Rails, MongoDB, Redis, and Resque and communicates with Hive using the ruby Thrift client.
Hadoop and Big Data Vendor Landscape…
Now that we have dispensed with examples and you are eager to get started, which vendors do you engage with for what. How do you make sense of the landscape?
Jeff Kelly @ Wikibon presents a nice Big Data market segmentation landscape graphic that I found quite interesting.
Hadoop usage is growing as more analysts, programmers and – increasingly – processes “use” data. Data growth drives performance challenges, load time challenges and hardware cost optimization.
Hadoop based analytic complexity grows as data mining, predictive modeling and advanced statistics become the norm. Usage growth is driving the need for more analytical sophistication.
Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. As Hadoop graduates from pilots to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative.
Finally, adoption of Hadoop in the enterprise will not be an easy journey, and the hardest steps are often the first. Then, they get harder! Weaning the IT organizations off traditional DB and EDW models to use a new approach can be compared to moving the moon out of its orbit with a spatula… but it can be done.
2012 may be the year Hadoop crosses into mainstream IT.
Sources and References
1) Apache Hadoop = HDFS + MapReduce
- Hadoop Distributed File System (HDFS) for storing large datasets
- MapReduce, the algorithm on which Google built its empire
2) Related components often deployed with Hadoop – HBase, Hive, Pig, Oozie, Flume and Sqoop. These components form the core Hadoop Stack.
- HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s BigTable architecture. HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant.
- Hive is a data warehouse infrastructure built on top of Apache Hadoop
- Pig, a high-level query language for large-scale data processing
- ZooKeeper, a toolkit of coordination primitives for building distributed systems
3) Hadoop ecosystem is evolving constantly so this makes it tricky for enterprise IT adoption which tends to like stable proven models with a big maintenance tail.
4) Foursquare and Hadoop case study writeup by - Matthew Rathbone of the Foursquare Engineering team…. http://engineering.foursquare.com/2011/02/28/how-we-found-the-rudest-cities-in-the-world-analytics-foursquare/
5) Presentation Big Data at FourSquare: http://engineering.foursquare.com/2011/03/24/big-data-foursquare-slides-from-our-recent-talk/
||Hadoop Distributed File System
||Parallel data-processing framework
||A set of utilities that support the Hadoop subprojects
||Hadoop database for random read/write access
||SQL-like queries and tables on large datasets
||Data flow language and compiler
||Workflow for interdependent Hadoop jobs
||Integration of databases and data warehouses with Hadoop
||Configurable streaming data collection
||Coordination service for distributed applications
||User interface framework and software development kit (SDK) for visual Hadoop applications