This follows the part 1 of the series posted on May 31, 2016
In part 1 of the series, we looked at various activities involved in planning Big Data architecture. This article covers each of the logical layers in architecting the Big Data Solution.
The picture below depicts the logical layers involved.
Get to the Source!
Source profiling is one of the most important steps in deciding the architecture. It involves identifying the different source systems and categorizing them based on their nature and type.
Points to be considered while profiling the data sources:
- Identify the internal and external sources systems
- High-Level assumption for the amount of data ingested from each source
- Identify the mechanism used to get data – push or pull
- Determine the type of data source – Database, File, web service, streams etc.
- Determine the type of data – structured, semi-structured or unstructured
Ingestion Strategy and Acquisition
Data ingestion in the Hadoop world means ELT (Extract, Load and Transform) as opposed to ETL (Extract, Transform and Load) in case of traditional warehouses.
Points to be considered:
- Determine the frequency at which data would be ingested from each source
- Is there a need to change the semantics of the data append replace etc?
- Is there any data validation or transformation required before ingestion (Pre-processing)?
- Segregate the data sources based on mode of ingestion – Batch or real-time
Storage
One should be able to store large amounts of data of any type and should be able to scale on need basis. We should also consider the number of IOPS (Input output operations per second) that it can provide. Hadoop distributed file system is the most commonly used storage framework in BigData world, others are the NoSQL data stores – MongoDB, HBase, Cassandra etc. One of the salient features of Hadoop storage is its capability to scale, self-manage and self-heal.
There are 2 kinds of analytical requirements that storage can support:
- Synchronous – Data is analyzed in real-time or near real-time, the storage should be optimized for low latency.
- Asynchronous – Data is captured, recorded and analyzed in batch.
Things to consider while planning storage methodology:
- Type of data (Historical or Incremental)
- Format of data ( structured, semi-structured and unstructured)
- Compression requirements
- Frequency of incoming data
- Query pattern on the data
- Consumers of the data
And Now We Process
Not only the amount of data being stored but the processing also has increased multifold.
Earlier frequently accessed data was stored in Dynamic RAMs but now due to the sheer volume, it is been stored on multiple disks on a number of machines connected via the network. Instead of bringing the data to processing, in the new way, processing is taken closer to data which significantly reduce the network I/O.The Processing methodology is driven by business requirements. It can be categorized into Batch, real-time or Hybrid based on the SLA.
- Batch Processing – Batch is collecting the input for a specified interval of time and running transformations on it in a scheduled way. Historical data load is a typical batch operation
Technology Used: MapReduce, Hive, Pig
- Real-time Processing – Real-time processing involves running transformations as and when data is acquired.
Technology Used: Impala, Spark, spark SQL, Tez, Apache Drill
- Hybrid Processing – It’s a combination of both batch and real-time processing needs.
Best example would be lambda architecture.
The Last Mile- Consumption
This layer consumes the output provided by processing layer. Different users like administrator, Business users, vendor, partners etc. can consume data in different format. Output of analysis can be consumed by recommendation engine or business processes can be triggered based on the analysis.
Different forms of data consumption are:
- Export Datasets – There can be requirements for third-party dataset generation. Data sets can be generated using hive export or directly from HDFS.
- Reporting and visualization – Different reporting and visualization tool scan connect to Hadoop using JDBC/ODBC connectivity to hive.
- Data Exploration – Data scientist can build models and perform deep exploration in a sandbox environment. Sandbox can be a separate cluster (Recommended approach) or a separate schema within the same cluster that contains a subset of actual data.
- Adhoc Querying – Adhoc or Interactive querying can be supported by using Hive, Impala or spark SQL.
And finally, the key thing to remember in designing BigData Architecture are:
- Dynamics of use case: There a number of scenarios as illustrated in the article which needs to be considered while designing the architecture – form and frequency of data, Type of data, Type of processing and analytics required.
- Myriad of technologies: Proliferation of tools in the market has led to a lot of confusion around what to use and when, there are multiple technologies offering similar features and claiming to be better than the others.
Learn how Saama’s Fluid Analytics℠ Hybrid Solution accelerates your big data business outcomes.