Close Icon
Article Blog Featured - Blog High Tech The Data Post May 31, 2016 3 minute read

How to Design a Big Data Architecture in 6 Easy Steps

Designing a Big Data architecture is a complex task, considering the volume, variety and velocity of data today. Add to that the speed of technology innovations and competitive products in the market, and this is no trivial challenge for a Big Data Architect.

Join us while I describe in a 2 part series the components of Big Data architecture, the myths around them, and how they are handled differently today.

Let’s get the plans out!


Analyze the Business Problem

Look at the business problem objectively and identify whether it is a Big Data problem or not? Sheer volume or cost may not be the deciding factor. Multiple criteria like velocity, variety, challenges with the current system and time taken for processing should be considered as well.

Some Common Use Cases:

  • Data Archival/ Data Offload – Despite the cumbersome process and long SLAs for retrieval of data from tapes, it’s the most commonly used method of backup, as the cost prohibits the amount of active data maintained in the current systems. Alternatively, Hadoop facilitates storing huge amounts of data spanning across years (active data) at a very low cost.
  • Process Offload – Offload jobs that consume expensive MIPS cycles or consume extensive CPU cycles on the current systems.
  • Data Lake Implementation– Data lakes help in storing and processing massive amounts of data.
  • Unstructured Data Processing – Big Data technologies provide capabilities to store and process any amount of unstructured data natively. RDBMS’s can also store unstructured data as BLOB or CLOB but wouldn’t provide processing capabilities natively.
  • Data Warehouse Modernization – Integrate the capabilities of Big Data and your data warehouse to increase operational efficiency.

 Vendor Selection

Vendor selection for the Hadoop distribution may be driven by the client most of the time, depending on their personal bias, market share of the vendor, or existing partnerships. The vendors for Hadoop distribution are Cloudera, Hortonworks, Mapr and BigInsights (with Cloudera and Hortonworks being the prominent ones).

Deployment Strategy

Deployment strategy  determines whether it will be on premise, cloud based, or a mix of both. Each has its own pros and cons.

  • An on premise solution tends to be more secure (at least in the customers mind). Typically Banking, Insurance, and Healthcare customers have preferred this method, as data doesn’t leave the premise. However, the hardware procurement and maintenance would cost a lot more money, effort and time.
  • A cloud based solution is a more cost effective pay as you go model which provides a lot of flexibility in terms of scalability and eliminates procurement and maintenance overhead.
  • A mix deployment strategy gives us bits of both worlds and can be planned to retain PII data on premise and the rest in the cloud.

Capacity Planning

Capacity planning plays a pivotal role in hardware and infrastructure sizing. Important factors to be considered are:

  • Data volume for one-time historical load
  • Daily data ingestion volume
  • Retention period of data
  • HDFS Replication factor based on criticality of data
  • Time period for which the cluster is sized (typically 6months -1 year), after which the cluster is scaled horizontally based on requirements
  • Multi datacenter deployment

Infrastructure sizing

Infrastructure sizing is based on our capacity planning, and decides the type of hardware required, like the number of machines, CPU, memory, etc. It also involves deciding the number of clusters/environment required.

Important factors to be considered

  • Types of processing Memory or I/O intensive
  • Type of disk
  • No of disks per machine
  • Memory size
  • HDD size
  • No of CPU and cores
  • Data retained and stored in each environment (Ex: Dev may be 30% of prod)

Backup and Disaster Recovery Planning

Backup and disaster recovery is a very important part of planning, and involves the following considerations:

  • The criticality of data stored
  • RPO (Recovery Point Objective) and RTO (Recovery Time Objective) requirements
  • Active-Active or Active-Passive Disaster recovery
  • Multi datacenter deployment
  • Backup Interval (can be different for different types of data)

Click here to read part 2 of this series, in which we take a deep dive into the logical layers involved in architecting the Big Data Solution.

Learn how Saama’s Fluid Analytics™ Hybrid Solution accelerates your big data business outcomes.

Saama can put you on the fast track to clinical trial process innovation.