How to Design a Big Data Architecture in 6 Easy Steps

Big_Data_Architecture_pic

Designing a Big Data architecture is a complex task, considering the volume, variety and velocity of data today. Add to that the speed of technology innovations and competitive products in the market, and this is no trivial challenge for a Big Data Architect.

Join us while I describe in a 2 part series the components of Big Data architecture, the myths around them, and how they are handled differently today.

Let’s get the plans out!

atep_by_step_BigData

Analyze the Business Problem

Look at the business problem objectively and identify whether it is a Big Data problem or not? Sheer volume or cost may not be the deciding factor. Multiple criteria like velocity, variety, challenges with the current system and time taken for processing should be considered as well.

Some Common Use Cases:

  • Data Archival/ Data Offload – Despite the cumbersome process and long SLAs for retrieval of data from tapes, it’s the most commonly used method of backup, as the cost prohibits the amount of active data maintained in the current systems. Alternatively, Hadoop facilitates storing huge amounts of data spanning across years (active data) at a very low cost.
  • Process Offload – Offload jobs that consume expensive MIPS cycles or consume extensive CPU cycles on the current systems.
  • Data Lake Implementation– Data lakes help in storing and processing massive amounts of data.
  • Unstructured Data Processing – Big Data technologies provide capabilities to store and process any amount of unstructured data natively. RDBMS’s can also store unstructured data as BLOB or CLOB but wouldn’t provide processing capabilities natively.
  • Data Warehouse Modernization – Integrate the capabilities of Big Data and your data warehouse to increase operational efficiency.

 Vendor Selection

Vendor selection for the Hadoop distribution may be driven by the client most of the time, depending on their personal bias, market share of the vendor, or existing partnerships. The vendors for Hadoop distribution are Cloudera, Hortonworks, Mapr and BigInsights (with Cloudera and Hortonworks being the prominent ones).

Deployment Strategy

Deployment strategy  determines whether it will be on premise, cloud based, or a mix of both. Each has its own pros and cons.

  • An on premise solution tends to be more secure (at least in the customers mind). Typically Banking, Insurance, and Healthcare customers have preferred this method, as data doesn’t leave the premise. However, the hardware procurement and maintenance would cost a lot more money, effort and time.
  • A cloud based solution is a more cost effective pay as you go model which provides a lot of flexibility in terms of scalability and eliminates procurement and maintenance overhead.
  • A mix deployment strategy gives us bits of both worlds and can be planned to retain PII data on premise and the rest in the cloud.

Capacity Planning

Capacity planning plays a pivotal role in hardware and infrastructure sizing. Important factors to be considered are:

  • Data volume for one-time historical load
  • Daily data ingestion volume
  • Retention period of data
  • HDFS Replication factor based on criticality of data
  • Time period for which the cluster is sized (typically 6months -1 year), after which the cluster is scaled horizontally based on requirements
  • Multi datacenter deployment

Infrastructure sizing

Infrastructure sizing is based on our capacity planning, and decides the type of hardware required, like the number of machines, CPU, memory, etc. It also involves deciding the number of clusters/environment required.

Important factors to be considered

  • Types of processing Memory or I/O intensive
  • Type of disk
  • No of disks per machine
  • Memory size
  • HDD size
  • No of CPU and cores
  • Data retained and stored in each environment (Ex: Dev may be 30% of prod)

Backup and Disaster Recovery Planning

Backup and disaster recovery is a very important part of planning, and involves the following considerations:

  • The criticality of data stored
  • RPO (Recovery Point Objective) and RTO (Recovery Time Objective) requirements
  • Active-Active or Active-Passive Disaster recovery
  • Multi datacenter deployment
  • Backup Interval (can be different for different types of data)

Click here to read part 2 of this series, in which we take a deep dive into the logical layers involved in architecting the Big Data Solution.

Learn how Saama’s Fluid Analytics™ Hybrid Solution accelerates your big data business outcomes.

Facebook
LinkedIn
Twitter
YouTube

About Shreya Pal

Shreya brings over 14 years of experience spread across data management, analytics and consulting for various clients in US, UK market. She has extensive experience in Designing complex solutions, Architecture of Large Scale Distributed Systems and building services in Cloud Technologies with strong Business acumen and Technical experience in Big Data & Analytics space.


Related Posts

Mahesh Bhatia says:

Hi Shreya,

The content (Both Part 1 and 2) that you’ve prepared to explain the steps to implement Bigdata architecture are extremely useful. It is quite precise and to the point information and covers all aspects of Bigdata planning and architecture. I am sure the readers will benefit from this immensely. Thanks for thinking through this and taking this initiative (much appreciated!).

You may post it on your Linkedin so that it could be reach BigData aspirants

Thanks,
Mahesh

shreya says:

Thanks Mahesh
I have posted the articles on linkedin
You can read more on https://shreyabigdata.blogspot.in/



Leave a Reply

Your email address will not be published. Required fields are marked *