BDA Unit - 4
BDA Unit - 4
BDA Unit - 4
Unit - 4
Big data integration and processing -
Big Data Processing is the collection of methodologies or frameworks enabling access to enormous
amounts of information and extracting meaningful insights. Initially, Big Data Processing involves data
acquisition and data cleaning. Once you have gathered the quality data, you can further use it for
Statistical Analysis or building Machine Learning models for predictions.
Data Analytics tools and methods for Big Data Processing enable firms to visualize huge datasets and
create dashboards for gaining an overview of the entire business operations. Business Intelligence (BI)
Analytics answer fundamental business growth and strategy questions. BI tools make predictions and
what-if analyses on the transformed data that help stakeholders understand the depth patterns in data
and the correlations between the attributes.
The Machine Learning phase of Big Data Processing is primarily concerned with the creation of models
that can learn to evolve in response to the new input. The learning algorithms allow for more quickly
analyzing large amounts of data.
It enables the fetching of data from a database in order to display it on a monitor and/or use
within an application.
A database query is either an action query or a select query. A select query is one that
retrieves data from a database. An action query asks for additional operations on data, such as insertion,
updating, deleting or other forms of data manipulation.
This doesn't mean that users just type in random requests. For a database to understand demands, it
must receive a query based on the predefined code. That code is a query language.
By Traci Curran
The topic of data integration has been around forever. Before we used technology to manage data,
we integrated data in manual ways. At the time, we needed to integrate simple data structures such
as customer data with purchase data. As the industry progressed, we have gone from managing flat
data files and integrations to using applications to creating databases and data warehouses that
automates the integration of data. The early data sources were few compared to today, with
information technology supporting almost everything that we do. Data is everywhere and captured
in many formats. Managing data today is not a small task but a much bigger job and grows
exponentially every year.
Traditionally, businesses have collected data to analyze and help inform decisions after the fact.
From Uber to Shell to Amazon, the use of operational analytics has become widespread among
companies because it focuses on the “right now.” This refers to data that’s collected and aggregated
from existing business operations is then analyzed and fed back into operations instantly to make
intelligent decisions on the spot rather than later on.
There are many business operations that require intelligent decisions to be made immediately.
Supply chain management, inventory management, customer service, and marketing are just a few
examples of where operational analytics can make a substantial impact.
Traditional analytical systems have many benefits, but their weakness is the speed at
which the insights gathered from crunching the data can be implemented back into the
business. To more effectively streamline operations, modern businesses need real-time
data that can be processed and put into action instantaneously. Moving past the
limitations of traditional data collection and analysis through continuous intelligence
means that:
----Instead of only relying on weekly, quarterly, or annual reports to make
improvements to your business, you’re operationalizing your data to take immediate
actions day-to-day.
---You’re able to react to customer behavior in real-time.
---You can identify and improve inefficiencies as they’re happening.
Data aggregation
Data aggregation is any process whereby data is gathered and expressed in a summary form.
When data is aggregated, atomic data rows -- typically gathered from multiple sources --
are replaced with totals or summary statistics. Groups of observed aggregates are
replaced with summary statistics based on those observations. Aggregate data is
typically found in a data warehouse, as it can provide answers to analytical questions
and also dramatically reduce the time to query large sets of data.
Data aggregation is often used to provide statistical analysis for groups of people and to
create useful summary data for business analysis. Aggregation is often done on a large
scale, through software tools known as data aggregators. Data aggregators typically
include features for collecting, processing and presenting aggregate data.
Data aggregation can enable analysts to access and examine large amounts of data in a
reasonable time frame. A row of aggregate data can represent hundreds, thousands or even
more atomic data records. When the data is aggregated, it can be queried quickly instead of
requiring all of the processing cycles to access each underlying atomic data row and
aggregate it in real time when it is queried or accessed.
As the amount of data stored by organizations continues to expand, the most important and
frequently accessed data can benefit from aggregation, making it feasible to access efficiently.
Data aggregators summarize data from multiple sources. They provide capabilities for multiple
aggregate measurements, such as sum, average and counting.
Data aggregators work by combining atomic data from multiple sources, processing the data
for new insights and presenting the aggregate data in a summary view. Furthermore, data
aggregators usually provide the ability to track data lineage and can trace back to the
underlying atomic data that was aggregated.
Collection. First, data aggregation tools may extract data from multiple sources, storing it in
large databases as atomic data. The data may be extracted from internet of things (IoT)
sources, such as the following:
----social media communications;
----news headlines;
------personal data and browsing history from IoT devices; and
Processing. Once the data is extracted, it is processed. The data aggregator will identify the
atomic data that is to be aggregated. The data aggregator may apply predictive analytics,
artificial intelligence (AI) or machine learning algorithms to the collected data for new insights.
The aggregator then applies the specified statistical functions to aggregate the data.
Presentation. Users can present the aggregated data in a summarized format that itself
provides new data. The statistical results are comprehensive and high quality.
Data aggregation can be helpful for many disciplines, such as finance and business strategy
decisions, product planning, product and service pricing, operations optimization and
marketing strategy creation. Users may be data analysts, data scientists, data warehouse
administrators and subject matter experts.
To understand big data workflows, you have to understand what a process is and
how it relates to the workflow in data-intensive environments. Processes tend to be designed as high
level, end-to-end structures useful for decision making and normalizing how things get done in a
company or organization.
In contrast, workflows are task-oriented and often require more specific data than processes.
Processes are comprised of one or more workflows relevant to the overall objective of the process.
In many ways, big data workflows are similar to standard workflows. In fact, in any workflow, data
is necessary in the various phases to accomplish the tasks. Consider the workflow in a healthcare
situation.
One elementary workflow is the process of “drawing blood.” Drawing blood is a necessary task
required to complete the overall diagnostic process. If something happens and blood has not been
drawn or the data from that blood test has been lost, it will be a direct impact on the veracity or
truthfulness of the overall activity.
What happens when you introduce a workflow that depends on a big data source? Although you
might be able to use existing workflows, you cannot assume that a process or workflow will work
correctly by just substituting a big data source for a standard source. This may not work because
standard data-processing methods do not have the processing approaches or performance to handle
the complexity of the big data.
The healthcare example focuses on the need to conduct an analysis after the blood is drawn from the
patient. In the standard data workflow, the blood is typed and then certain chemical tests are
performed based on the requirements of the healthcare practitioner.
It is unlikely that this workflow understands the testing required for identifying specific biomarkers
or genetic mutations. If you supplied big data sources for biomarkers and mutations, the workflow
would fail. It is not big data aware and will need to be modified or rewritten to support big data.
The best practice for understanding workflows and the effect of big data is to do the following:
Ensure that you have the processing speed and storage access to support your workflow.
Modify the existing workflow to accommodate big data or create new big data workflow.
END