BDA Unit - 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Big data analytics

Unit - 4
Big data integration and processing -

What is Big Data Processing?

Big Data Processing is the collection of methodologies or frameworks enabling access to enormous
amounts of information and extracting meaningful insights. Initially, Big Data Processing involves data
acquisition and data cleaning. Once you have gathered the quality data, you can further use it for
Statistical Analysis or building Machine Learning models for predictions.

5 Stages of Big Data Processing


Data Extraction
Data Transformation
Data Loading
Data Visualization/BI Analytics
Machine Learning Application

Stage 1: Data Extraction


This initial step of Big Data Processing consists of collecting information from diverse resources like
enterprise applications, web pages, sensors, marketing tools, transactional records, etc. Data processing
professionals extract information through many Unstructured and Structured Data Streams. For
instance, in building a Data Warehouse, extracting entails merging information from multiple sources,
subsequently verifying the information by removing incorrect data. To decide future decisions based on
the outcomes, the data collected during the data collection phase of Big Data Processing must be
labeled and accurate. This stage establishes a quantitative standard as well as a goal for improvement.

Stage 2: Data Transformation


The data transformation phase of Big Data Processing defines changing or modifying data into
required formats which helps in building different insights and visualizations. There are many
transformation techniques like Aggregation, Normalization, Feature Selection, Binning and
Clustering, and concept hierarchy generation. Using these techniques for Big Data Processing,
developers transform Unstructured Data into Structured Data and Structured Data into a user-
understandable format. Business and Analytical operations become more efficient as a result of the
transformation, and firms can make better data-driven choices.

Stage 3: Data Loading


The converted data is transported to the centralized database system in the load stage of Big Data
Processing. Before loading the data, index the database and remove the constraints to make the process
more efficient. Using Big Data ETL, the process of loading became automated, well-defined, consistent,
and Batch-driven or Real-time.

Stage 4: Data Visualization/BI Analytics

Data Analytics tools and methods for Big Data Processing enable firms to visualize huge datasets and
create dashboards for gaining an overview of the entire business operations. Business Intelligence (BI)
Analytics answer fundamental business growth and strategy questions. BI tools make predictions and
what-if analyses on the transformed data that help stakeholders understand the depth patterns in data
and the correlations between the attributes.

Stage 5: Machine Learning Application

The Machine Learning phase of Big Data Processing is primarily concerned with the creation of models
that can learn to evolve in response to the new input. The learning algorithms allow for more quickly
analyzing large amounts of data.

What Does Data Retrieval Mean?


In databases, data retrieval is the process of identifying and extracting data from a
database, based on a query provided by the user or application.

It enables the fetching of data from a database in order to display it on a monitor and/or use
within an application.

Techopedia Explains Data Retrieval


Data retrieval typically requires writing and executing data retrieval or extraction commands
or queries on a database. Based on the query provided, the database looks for and retrieves
the data requested. Applications and software generally use various queries to retrieve data
in different formats. In addition to simple or smaller data, data retrieval can also include
retrieving large amounts of data, usually in the form of reports.
What is a query?
A query is a question or a request for information expressed in a formal manner. In computer science, a
query is essentially the same thing, the only difference is the answer or retrieved information comes
from a database.

What is a database query?

A database query is either an action query or a select query. A select query is one that
retrieves data from a database. An action query asks for additional operations on data, such as insertion,
updating, deleting or other forms of data manipulation.

This doesn't mean that users just type in random requests. For a database to understand demands, it
must receive a query based on the predefined code. That code is a query language.

Understanding Big Data Integration

By Traci Curran
The topic of data integration has been around forever. Before we used technology to manage data,
we integrated data in manual ways. At the time, we needed to integrate simple data structures such
as customer data with purchase data. As the industry progressed, we have gone from managing flat
data files and integrations to using applications to creating databases and data warehouses that
automates the integration of data. The early data sources were few compared to today, with
information technology supporting almost everything that we do. Data is everywhere and captured
in many formats. Managing data today is not a small task but a much bigger job and grows
exponentially every year.

What is Big Data Integration?


Data integration is now a practice in all organizations. Data needs to be protected, governed,
transformed, usable, and agile. Data supports everything that we do personally and supports
organizations’ ability to deliver products and services to us.
Big data integration is the practice of using people, processes, suppliers, and technologies
collaboratively to retrieve, reconcile, and make better use of data from disparate sources for decision
support. Big data has the following characteristics: volume, velocity, veracity, variability, value, and
visualization.
Volume – Differentiates big data from traditional structured data managed by relational
database systems. The number of data sources is much higher than the conventional
approach to managing data inputs.
Velocity – Data source increases the rate of data generation. Data generation comes from so
many sources in various formats and unformatted structures.
Veracity – Reliability of data, not all data has value, data quality challenges.
Variability – Data is inconsistent and has to be managed from various sources.
Value – Data has to have value for processing; all data does not have value
Visualization – Data has to be meaningful and understood by a consumer
Integration of big data needs to support any service in your organization. Your organization should run
as a high-performing team sharing data, information, and knowledge to support your customers’ service
and product decisions.

Big Data Integration Process


Big data integration and processing are crucial for all the data that is collected. Data has to have value
to support the end result for the usage of the data. With so much data being collected from so many
sources, many companies rely on big data scientists, analysts, and engineers to use algorithms and other
methods to help derive value from the data received and processed.
The processing of big data has to be compliant relative to organizational governance standards. Ensure
the reduction of risk related to decisions with the data. Help enable organizational growth and
enablement. Reduce or contain cost. Improve operational efficiency and decision support.
The basic process is;
Extract data from various sources
Store data in an appropriate fashion
Transform and integrate data with analytics
Orchestrate and Use/Load data
Orchestrating and loading data into applications in an automated manner is critical for success.
Technology that does not allow ease of use will be cumbersome and hamper the organization’s ability to
be effective using big data.

Challenges of Big Data Integration


Data is constantly changing. Trends have to be managed and assessed for integrity to make sure the data
being received is timely and valuable for decision making within the organization. This is not easy. In
fact, integrating big data can often be the biggest challenge. Other Big data integration challenges are:
Using appropriate data sources to create a single source.
Consistent use and improvement of analytics to deliver valuable data. Data sources increase,
change.
Creating and maintaining valuable data warehouses and data lakes from the collected data.
Improving business intelligence.
One of the biggest challenges besides those listed is the enablement of people to use technology.
Organizations should look for technology that provides ease of use for users across the organization, but
they also need to make sure they choose data management platforms that are robust enough to meet
complex use cases. Products and technologies that are not easy to use will not be used effectively and
efficiently to support business outcomes.
Big data processing pipelines in unit -2

What is Operational Analytics?


Whether you’re trying to improve the customer experience or better manage your
inventory, data helps your business make well-informed decisions. The more nimble
you are when making those decisions, the better your business operates. Operational
analytics, commonly referred to as operational intelligence, is the practice of utilizing
data in real-time to make instant decisions in business operations.

Why is operational analytics important?

Traditionally, businesses have collected data to analyze and help inform decisions after the fact.
From Uber to Shell to Amazon, the use of operational analytics has become widespread among
companies because it focuses on the “right now.” This refers to data that’s collected and aggregated
from existing business operations is then analyzed and fed back into operations instantly to make
intelligent decisions on the spot rather than later on.
There are many business operations that require intelligent decisions to be made immediately.
Supply chain management, inventory management, customer service, and marketing are just a few
examples of where operational analytics can make a substantial impact.

What are the benefits of operational analytics?

Traditional analytical systems have many benefits, but their weakness is the speed at
which the insights gathered from crunching the data can be implemented back into the
business. To more effectively streamline operations, modern businesses need real-time
data that can be processed and put into action instantaneously. Moving past the
limitations of traditional data collection and analysis through continuous intelligence
means that:
----Instead of only relying on weekly, quarterly, or annual reports to make
improvements to your business, you’re operationalizing your data to take immediate
actions day-to-day.
---You’re able to react to customer behavior in real-time.
---You can identify and improve inefficiencies as they’re happening.

Data aggregation
Data aggregation is any process whereby data is gathered and expressed in a summary form.
When data is aggregated, atomic data rows -- typically gathered from multiple sources --
are replaced with totals or summary statistics. Groups of observed aggregates are
replaced with summary statistics based on those observations. Aggregate data is
typically found in a data warehouse, as it can provide answers to analytical questions
and also dramatically reduce the time to query large sets of data.

Data aggregation is often used to provide statistical analysis for groups of people and to
create useful summary data for business analysis. Aggregation is often done on a large
scale, through software tools known as data aggregators. Data aggregators typically
include features for collecting, processing and presenting aggregate data.

Data aggregation can enable analysts to access and examine large amounts of data in a
reasonable time frame. A row of aggregate data can represent hundreds, thousands or even
more atomic data records. When the data is aggregated, it can be queried quickly instead of
requiring all of the processing cycles to access each underlying atomic data row and
aggregate it in real time when it is queried or accessed.

As the amount of data stored by organizations continues to expand, the most important and
frequently accessed data can benefit from aggregation, making it feasible to access efficiently.

What does data aggregation do?

Data aggregators summarize data from multiple sources. They provide capabilities for multiple
aggregate measurements, such as sum, average and counting.

Examples of aggregate data include the following:


---Voter turnout by state or county. Individual voter records are not presented, just the vote
totals by candidate for the specific region.
---Average age of customer by product. Each individual customer is not identified, but for each
product, the average age of the customer is saved.
---Number of customers by country. Instead of examining each customer, a count of the
customers in each country is presented.

How do data aggregators work?

Data aggregators work by combining atomic data from multiple sources, processing the data
for new insights and presenting the aggregate data in a summary view. Furthermore, data
aggregators usually provide the ability to track data lineage and can trace back to the
underlying atomic data that was aggregated.

Collection. First, data aggregation tools may extract data from multiple sources, storing it in
large databases as atomic data. The data may be extracted from internet of things (IoT)
sources, such as the following:
----social media communications;
----news headlines;
------personal data and browsing history from IoT devices; and

Processing. Once the data is extracted, it is processed. The data aggregator will identify the
atomic data that is to be aggregated. The data aggregator may apply predictive analytics,
artificial intelligence (AI) or machine learning algorithms to the collected data for new insights.
The aggregator then applies the specified statistical functions to aggregate the data.

Presentation. Users can present the aggregated data in a summarized format that itself
provides new data. The statistical results are comprehensive and high quality.

Uses for data aggregation

Data aggregation can be helpful for many disciplines, such as finance and business strategy
decisions, product planning, product and service pricing, operations optimization and
marketing strategy creation. Users may be data analysts, data scientists, data warehouse
administrators and subject matter experts.

High level operation , tools and system not found

To understand big data workflows, you have to understand what a process is and
how it relates to the workflow in data-intensive environments. Processes tend to be designed as high
level, end-to-end structures useful for decision making and normalizing how things get done in a
company or organization.

In contrast, workflows are task-oriented and often require more specific data than processes.
Processes are comprised of one or more workflows relevant to the overall objective of the process.

In many ways, big data workflows are similar to standard workflows. In fact, in any workflow, data
is necessary in the various phases to accomplish the tasks. Consider the workflow in a healthcare
situation.

One elementary workflow is the process of “drawing blood.” Drawing blood is a necessary task
required to complete the overall diagnostic process. If something happens and blood has not been
drawn or the data from that blood test has been lost, it will be a direct impact on the veracity or
truthfulness of the overall activity.

What happens when you introduce a workflow that depends on a big data source? Although you
might be able to use existing workflows, you cannot assume that a process or workflow will work
correctly by just substituting a big data source for a standard source. This may not work because
standard data-processing methods do not have the processing approaches or performance to handle
the complexity of the big data.

The healthcare example focuses on the need to conduct an analysis after the blood is drawn from the
patient. In the standard data workflow, the blood is typed and then certain chemical tests are
performed based on the requirements of the healthcare practitioner.

It is unlikely that this workflow understands the testing required for identifying specific biomarkers
or genetic mutations. If you supplied big data sources for biomarkers and mutations, the workflow
would fail. It is not big data aware and will need to be modified or rewritten to support big data.

The best practice for understanding workflows and the effect of big data is to do the following:

Identify the big data sources you need to use.

Map the big data types to your workflow data types.

Ensure that you have the processing speed and storage access to support your workflow.

Select the data store best suited to the data types.

Modify the existing workflow to accommodate big data or create new big data workflow.

END

You might also like