Data Ingestion, Processing and Architecture Layers For Big Data and Iot
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
Also, the variety of data is coming from various sources in different formats, such as
sensors, logs, structured data from an RDBMS, etc. In the past few years, the
generation of new data has drastically increased. More applications are being built,
and they are generating more data at a faster rate.
Earlier, Data Storage was costly, and there was an absence of technology which could
process the data in an efficient manner. Now the storage costs have become cheaper,
and the availability of technology to transform Big Data is a reality.
What is Big Data Technology?
According to the Author Dr. Kirk Borne, Principal Data Scientist, Big Data Definition
is Everything, Quantified, and Tracked. Let’s pick that apart -
10 Vs of Big Data
This Architecture helps in designing the Data Pipeline with the various requirements
of either Batch Processing System or Stream Processing System. This architecture
consists of 6 layers which ensure a secure flow of data.
Data Ingestion Layer
This layer is the first step for the data coming from variable sources to start its
journey. Data here is prioritized and categorized which makes data flow smoothly in
further layers.
You May also Love to Read Why We Need Modern Big Data Integration Platform &
Data Ingestion Patterns
That's why we should properly ingest the data for the successful business decisions
making. It's rightly said that "If starting goes well, then, half of the work is already
done."
We can also say that Data Ingestion means taking data coming from multiple sources
and putting it somewhere it can be accessed. It is the beginning of Data Pipeline
where it obtains or import data for immediate use.
Data can be streamed in real time or ingested in batches, When data is ingested in real
time then, as soon as data arrives it is ingested immediately. When data is ingested in
batches, data items are ingested in some chunks at a periodic interval of time.
Ingestion is the process of bringing data into Data Processing system.
When numerous Big Data sources exist in the different format, it's the biggest
challenge for the business to ingest data at the reasonable speed and further
process it efficiently so that data can be prioritized and improves business
decisions.
Modern Data Sources and consuming application evolve rapidly.
Data produced changes without notice independent of consuming application.
Data Semantic Change over time as same Data Powers new cases.
Detection and capture of changed data - This task is difficult, not only because
of the semi-structured or unstructured nature of data but also due to the low
latency needed by individual business scenarios that require this determination.
Able to handle and upgrade the new data sources, technology and applications
Assure that consuming application is working with correct, consistent and
trustworthy data.
Allows rapid consumption of data
Capacity and reliability - The system needs to scale according to input coming
and also it should be fault tolerant.
Data volume - Though storing all incoming data is preferable; there are some
cases in which aggregate data is stored.
You May also Love to Read Data Ingestion Using Apache Nifi For Building Data
Lake Using Twitter Data
It can use the specially generated source code to easily write and read structured data
to and from a variety of data streams and using a variety of languages.
Apache Avro
The more recent Data Serialization format that combines some of the best features
which previously listed. Avro Data is self-describing and uses a JSON-schema
description. This schema is included with the data itself and natively support
compression. Probably it may become a de facto standard for Data Serialization.
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data.
It uses a simple, extensible data model that allows for an online analytic application.
Stream Data - Ingest streaming data from multiple sources into Hadoop for
storage and analysis.
Insulate System - Buffer storage platform from transient spikes, when the rate of
incoming data exceeds the rate at which data can be written to the destination
Scale Horizontally - To ingest new data streams and additional volume as
needed.
Elastic Logstash is an open source, server-side data processing pipeline that ingests
data from a multitude of sources simultaneously transforms it, and then sends it to
your “stash, " i.e., Elasticsearch.
It easily ingests from your logs, metrics, web applications, data stores, and various
AWS services and done in continuous, streaming fashion. It can Ingest Data of all
Shapes, Sizes, and Sources.
In this Layer, more focus is on transportation data from ingestion layer to rest of Data
Pipeline. Here we use a messaging system that will act as a mediator between all the
programs that can send and receive messages.
Here the tool used is Apache Kafka. It's a new approach in message-oriented
middleware.
Data Pipeline Helps in bringing data into your system. It means taking unstructured
data from where it is originated into a system where it can be stored and analyzed for
making business decisions
Data Integration
Data Organization
Organizing data means an arrangement of data; this arrangement is also made in Data
Pipeline.
Data Refining
It's also one of the processes where we can enhance, clean, improve the raw data.
Data Analytics
After improving the useful data, Data Pipeline provides us with the processed data on
which we can apply the operations on raw data and can make business decisions
accurately.
Primarily reasons for the need of data pipeline is because it's tough to monitor Data
Migration and manage data errors. Other reasons for this are below -
Kafka works in combination with Apache Storm, Apache HBase and Apache Spark
for real-time analysis and rendering of streaming data.
Building Real-Time streaming Data Pipelines that reliably get data between
systems or applications
Building Real-Time streaming applications that transform or react to the
streams of data.
In the previous layer, we gathered the data from different sources and made it
available to go through rest of pipeline.
In this layer, our task is to do magic with data, as now data is ready we only have
to route the data to different destinations.
In this main layer, the focus is to specialize Data Pipeline processing system or we can
say the data we have collected by the last layer in this next layer we have to do
processing on that data.
Apache Sqoop works with relational databases such as Teradata, Netezza, Oracle,
MySQL, Postgres, and HSQLDB.
It is a system for processing streaming data in real time. It adds reliable real-time data
processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for
scenarios requiring real-time analytics, machine learning and continuous monitoring
of operations.
Fast – It can process one million 100 byte messages per second per node.
Scalable – It can do parallel calculations that run across a cluster of machines.
Fault-tolerant – When workers die, Storm will automatically restart them. If a
node dies, the worker will be restarted on another node.
Reliable – Storm guarantees that each unit of data (tuple) will be processed at
least once or exactly once. Messages are only replayed when there are failures.
Easy to operate – It consists of Standard configurations that are suitable for
production on day one. Once deployed, Storm is easy to work.
Hybrid Processing system - This consist of Batch and Real-time processing
System capabilities. For this type of processing tool used is Apache Spark and
Apache Flink.
Apache Spark is a fast, in-memory data processing engine with elegant and expressive
development APIs to allow data workers to efficiently execute streaming, machine
learning or SQL workloads that require fast iterative access to data sets.
With Spark running on Apache Hadoop YARN, developers everywhere can now
create applications to exploit Spark’s power, derive insights, and enrich their data
science workloads within a single, shared data set in Hadoop.
It is stateful and fault-tolerant and can seamlessly recover from failures while
maintaining exactly-once application state.
Performs at large scale, running on thousands of nodes with excellent
throughput and latency characteristics.
It's streaming data flow execution engine, APIs and domain-specific libraries
for Batch, Streaming, Machine Learning, and Graph Processing.
But with the new big data strategic enterprise applications, you should no longer be
assuming that your persistence should be relational.
We need different databases to handle the different variety of data, but using different
databases creates overhead. That's why there is an introduction to the new concept in
the database world, i.e., the Polyglot Persistence.
It takes advantage of the strength of different database. Here various types of data are
arranged in a variety of ways. In short, it means picking the right tool for the right use
case.
It’s the same idea behind Polyglot Programming, which is the idea that applications
should be written in a mix of languages to take advantage of the fact that different
languages are suitable for tackling different problems.
Advantages of Polyglot Persistence -
Faster response times - Here we leverage all the features of databases in one app,
which makes the response times of your app very quickly.
Helps your app to scale well - Your app scales exceptionally well with the data.
All the NoSQL databases scale well when you model databases correctly for
the data that you want to store.
A rich experience - You have a vibrant experience when you harness the power
of multiple databases at the same time. For example, if you want to search for
Products in an e-commerce app, then you use ElasticSearch, which returns the
results based on relevance, which MongoDB cannot do.
HDFS is a Java-based file system that provides scalable and reliable data
storage, and it was designed to span large clusters of commodity servers.
HDFS holds a huge amount of data and provides easier access.
To store such massive data, the files are stored on multiple machines. These
files are stored redundantly to rescue the system from possible data losses in
case of failure.
HDFS also makes applications available for parallel processing. HDFS is built
to support applications with large data sets, including individual files that reach
into the terabytes.
It uses a master/slave architecture, with each cluster consisting of a single
NameNode that manages file system operations and supporting DataNodes that
manage data storage on individual compute nodes.
When HDFS takes in data, it breaks the information down into separate pieces
and distributes them to different nodes in a cluster, allowing for parallel
processing.
The file system also copies each piece of data multiple times and distributes the
copies to individual nodes, placing at least one copy on a different server rack
HDFS and YARN from the data management layer of Apache Hadoop.
Features of HDFS
As we know good storage solution must provide elasticity in both storage and
performance without affecting active operations.
Scale-out storage systems based on GlusterFS are suitable for unstructured data such
as documents, images, audio and video files, and log files. GlusterFS is a scalable
network filesystem.
Using this, we can create large, distributed storage solutions for media streaming, data
analysis, and other data- and bandwidth-intensive tasks.
Cloud Computing
Streaming Media
Content Delivery
Amazon Simple Storage Service (Amazon S3) is object storage with a simple
web service interface to store and retrieve any amount of data from anywhere
on the internet.
It is designed to deliver 99.999999999% durability, and scale past trillions of
objects worldwide.
Customers use S3 as primary storage for cloud-native applications; as a bulk
repository, or "data lake," for analytics; as a target for backup & recovery and
disaster recovery; and with serverless computing.
It's simple to move large volumes of data into or out of S3 with Amazon's cloud
data migration options.
Once data is stored on Amazon S3, it can be automatically tiered into lower
cost, longer-term cloud storage classes like S3 Standard - Infrequent Access
and Amazon Glacier for archiving.
It is the layer where active analytic processing takes place. This is a field where
interactive queries are necessaries, and it’s a zone traditionally dominated by SQL
expert developers. Before Hadoop, we had an insufficient storage due to which it
takes long analytics process.
At first, it goes through a Lengthy process, i.e., ETL to get a new data source ready to
be stored and after that, it puts the data in database or data warehouse. But now, data
analytics became essential step which solved problems while computing such a large
amount of data.
Increase revenue
Decrease costs
Increase productivity
Data analysts use Hive to query, summarize, explore and analyze that data, then turn it
into actionable business insight.
It provides a mechanism to project structure onto the data in Hadoop and to query that
data using a SQL - like a language called HiveQL (HQL).
Spark SQL includes a cost-based optimizer, columnar storage, and code generation to
make queries fast.
At the same time, it scales to thousands of nodes and multi-hour queries using the
Spark engine, which provides full mid-query fault tolerance.
Spark SQL is a Spark module for structured data processing. Some of the Functions
performed by Spark SQL are -
The interfaces provided by Spark SQL provide Spark with more information
about the structure of both the data and the computation being performed.
Internally, Spark SQL uses this extra information to perform additional
optimizations.
One use of Spark SQL is to execute SQL queries.
Spark SQL can also be used to read data from an existing Hive installation.
Amazon Redshift
We can also create additional databases as needed by running a SQL command. Most
important we can scale it from hundred gigabytes of data to a petabyte or more.
It enables you to use your data to acquire new insights for your business and
customers. The Amazon Redshift service manages all of the work of setting up,
operating and scaling a data warehouse.
These tasks include provisioning capacity, monitoring and backing of the cluster, and
applying patches and upgrades to the Amazon Redshift engine.
Presto is an open source distributed SQL query engine for running interactive analytic
queries against data sources of all sizes ranging from gigabytes to petabytes.
Presto was designed and written for interactive analytics and approaches and the
speed of commercial data warehouses while scaling to the size of organizations like
Facebook.
Presto Capabilities
Facebook uses Presto for interactive queries against several internal data stores,
including their 300PB Data Warehouse. Over 1,000 Facebook employees use
Presto daily to run more than 30,000 queries that in the complete scan over a
petabyte each per day.
Leading internet companies including Airbnb and Dropbox are using Presto.
So, a Data Warehouse is a centralized repository that stores data from multiple
information sources and transforms them into a standard, multidimensional data
model for efficient querying and analysis.
Technology is just that – a means to store and manage large amounts of data. A data
warehouse is a way of organizing data so that there are corporate credibility and
integrity.
When someone takes data from a data warehouse, that person knows that other people
are using the same data for other purposes. There is a basis for reconcilability of data
when there is a data warehouse.
With data lake, incoming data goes into the lake in a raw form or whatever form data
source providers, and there we select and organize the data in a raw form. There are
no assumptions about the schema of the data; each data source can use whatever
scheme it likes.
It's up to the consumers of that information to make sense of that data for their
purposes. The idea is to have a single store for all of the raw data that anyone in an
organization might need to analyze.
Commonly people use Hadoop to work on the data in the lake, but the concept is
broader than just Hadoop.
Custom dashboards are useful for creating unique overviews that present data
differently, For example, you can -
Show the web and mobile application information, server information, custom
metric data, and plugin metric data all on a single custom dashboard.
Create dashboards that present charts and tables with a uniform size and
arrangement on a grid.
Select existing New Relic charts for your dashboard, or create your charts and
tables.
Real-Time Dashboards save, share, and communicate insights. It helps users generate
questions by revealing the depth, range, and content of their data stores.
Tableau is the richest data visualization tool available in the market. With Drag
and Drop functionality.
Tableau allows users to design Charts, Maps, Tabular, Matrix reports, Stories
and Dashboards without any technical knowledge.
Tableau helps anyone quickly analyze, visualize and share information.
Whether it’s structured or unstructured, petabytes or terabytes, millions or
billions of rows, you can turn big data into big ideas.
It connects directly to local and cloud data sources, or import data for fast in-
memory performance.
Make sense of big data with easy-to-understand visuals and interactive web
dashboards.
An intelligent agent is a software that assists people and acts on their behalf.
Intelligent agents work by allowing people to delegate work that they could
have done, to the agent software.
Agents can perform repetitive tasks, remember things you forgot, intelligently
summarize complex data, learn from you and even make recommendations to
you.
An intelligent agent can help you find and filter information when you are
looking at corporate data or surfing the Internet and don't know where the right
information is.
It could also customize information to your preferences, thus saving you the
time of handling it as more and more new information arrived each day on the
Internet.
An agent could also sense changes in its environment and responds to these
changes.
An agent continues to work even when the user is gone, which means that an
agent could run on a server, but in some cases, an agent runs on the user
systems.
Recommendation Systems
Angular.JS Framework
Understanding React.JS
React is JavaScript Library used for building User Interface, focus on the UI,
not a framework.
One way reactive data flow(no two-way Data Binding), Virtual DOM. React is
front-end library developed by Facebook.
It's used for handling view layer for the web and mobile apps. ReactJS allows
us to create reusable UI components.
It is currently one of the most popular JavaScript libraries, and it has a strong
foundation and large community behind it.
Useful Features of React
JSX − JSX is JavaScript syntax extension. It isn't necessary to use JSX in React
development, but it is recommended.
Components − React is all about components. You need to think of everything
as a component. This will help you to maintain the code when working on
larger scale projects.
Unidirectional data flow and Flux − React implements one-way data flow
which makes it easy to reason about your app. Flux is a pattern that helps to
keep your data unidirectional.
Security is the primary task of any work. Security should be implemented at all layers
of the lake starting from Ingestion, through Storage, Analytics, Discovery, all the way
to Consumption. For proving security to data pipeline, few steps are there that are:-
Authentication will verify user’s identity and ensure they are who they say they are.
Using the Kerberos protocol provides a reliable mechanism for authentication.
Access Control
It is the next step to secure data, by defining which dataset can be consulted by the
users or services. Access control will restrict users and services to access only that
data which they have permission for; they will access all the data.
Encryption and data masking are required to ensure secure access to sensitive data.
Sensitive data in the cluster should be secured at rest as well as in motion. We need to
use proper Data Protection techniques which will protect data in the cluster from
unauthorized visibility.
Another aspect of data security requirement is Auditing data access by users. It can
detect the log on & access attempts as well as the administrative changes.
Data In enterprise systems is like food – it has to be kept fresh. Also, it needs
nourishment. Otherwise, it goes wrong and doesn’t help you in making strategic and
operational decisions. Just as consuming spoiled food could make you sick, using
“spoiled” data may be bad for your organization’s health.
There may be plenty of data, but it has to be reliable and consumable to be valuable.
While most of the focus in enterprises is often about how to store and analyze large
amounts of data, it is also essential to keep this data fresh and flavorful.
So we can do this? The solution is for monitoring, auditing, testing, managing, and
controlling the data. Continuous monitoring of data is an important part of the
governance mechanisms.
Apache Flume is useful for processing log data. Apache Storm is desirable for
operations monitoring Apache Spark for streaming data, graph processing, and
machine learning. Monitoring can happen in data storage layer. It includes following
steps for data monitoring:-
These are the techniques to identify the quality of data and the lifecycle of the data
through various phases. In these systems, it is important to capture the metadata at
every layer of the stack so it can be used for verification and profiling.Talend, Hive,
Pig.
Data Quality
Data is considered to be of high quality if it meets business needs and it satisfies the
intended use so that it's helpful in making business decisions successfully. So,
understanding the dimension of greatest interest and implementing methods to achieve
it is important.
Data Cleansing
Policies have to be in place to make sure the loopholes for data loss are taken care of.
Identification of such data loss needs careful monitoring and quality assessment
processes.