In simple terms, if data management was about ingestion, enrichment and storage of data so that data was harmonized, cleansed and stored in a way that analytics can run efficiently on top of it in a secured way then why is “Modern Data Management” not doing the same in the platform. After all, one could argue that the same requirements if served up through open source technologies reduces the Total Cost Ownership (TCO).
Given the cost of the skills, and a still maturing platform, this statement only holds true at very large volumes. There are inherently new capabilities in this platform which enables something new and that also needs to be understood to appreciate “Modern Data Management”.
The key to understanding and appreciating the capability of modern platform is to think of disjointed, error and failure prone environments from which we want to aggregate the data (edge sources like geo location, sensors, fitness devices, machine logs, social) as opposed to enterprise data.
In short, it is a change in thinking from structured internal data being moved on a regular frequency that is been served to a few users and thinking now of disparate systems with various capabilities generating data in asynchronous way being collected and made available to many.
But first the important question: What do you need for such a system to succeed?
- Manage failure: A node, including edge node which is generating the data or the cluster node which is processing it can go down. The Idea is to create as autonomous a service as possible which can keep running based on the given instructions and does not require the whole eco system to be available all the time.
- Buffering: Key bottlenecks form in the current systems when data does not arrive as expected. For Example, during holiday season peak, a sensor gone bad can generate data in unpredictable peaks. Buffering allows the system time to process the data, but allows you to securely pass on the data is key to such disparate systems.
- Disintegration of the perishable data: Each data element has a finite life. Some are stored for a few years while others lose their meaning in a few moments. A modern system allows for each node along the data journey corridor to make a call if data has lost its meaning and let it “disintegrate” or be deleted and in some cases be de-prioritized, so other, more relevant data can get to the destination faster.
- Prioritization: All data is not created equal. In a disparate systems, there could be many connections which can be low bandwidth or low on resources (CPU, Memory) and prioritization becomes essential to make a call on what is to processed and moved when.
- Security: Like any other system, this system needs to be secured for a very large multi-role use base.
For the right type of scenario and at the right scale, Modern Data Management system would be a valid solution. However, we need to think beyond TCO, to see what new use cases these features of the modern system makes possible.
Learn more about Saama’s Hybrid Model Fluid Analytics here.
Source: In addition to person experience and view point, one of the sources for this blog comes from Apache NiFi concepts discussed by HortonWorks blogs and Videos.