UNIT I BIG DATA Extra Content

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

UNIT I UNDERSTANDING BIG DATA

Big Data analytics requires aptitude in mathematics, as it is the arithmetic of adequacy


To work in Big Data analytics, it is helpful to have a knowledge of Hadoop, SQL, R,
Python, and other programming language
Big Data analytics is driving employment development, as individuals with data mining
and machine learning methods, information visualization tools, and information
warehousing knowledge and experience are required for Big Data analytics. Big Data
Fields can be any of the following:
Healthcare Big Data Analytics:
Big Data analytics is entrenched in the healthcare industry, from alerting about drug
interactions to modeling emergency department use
Big Data analytics capabilities fall into three major categories: descriptive, predictive,
and prescriptive
Prescriptive analytics is the future of healthcare Big Data, and the Internet of Things is
helping to make it become a reality
Big Data for HR:
Big Data helps companies control their bottom line and avoid costly employee
turnover
Big Data is useful for making hiring decisions and determining new hires’
compensation rates
Implementing Big Data involves creating a dedicated analytics team, gamifying the
hiring process, and motivating current employees
The skills set area we should develop are the programming languages, Analytical tools,
Statics, Hadoop Technologies, etc.
Programming languages we should learn are:
 Java
 Python
 Scala
Mathematics and Statics
You should have good understanding of mathematics and particular the statics. You
should be able to solve a problem by applying mathematical formula or set of
formulas. Good mathematics is necessary for predictive analysis and machine learning
rule sets.
We should learn Hadoop Big Data Platform, HBase, Hive, Storm, Spark, Pig, R,
Elasticsearch, Machine learning frameworks and many more technologies to master
fast growing Big Data field.
To summarize this, from a Data Engineer perspective, it includes the fields Operations,
Data Engineering, Data Analytics and Big Data Business.
Operations is understanding how Big Data Platform work. Understanding Hadoop,
NoSQL systems, concepts such as the CAP theorem.
Data Engineering is data processing. This includes understanding of data processing
engines, shared nothing architectures, etc.
Analytics: Statistics, application of machine learning, etc.
Big Data Business: Understanding use cases such as Data Monetarization, Predictive
Maintenance, Churn, etc.
Relevant Courses that you may be interested in:
 Big Data
 Data Science
INTRODUCTION
WHAT EXACTLY IS BIG DATA?
The definition of big data is data that contains greater variety, arriving in increasing
volumes and with more velocity. This is also known as the three “Vs.”
Put simply, big data is larger, more complex data sets, especially from new data sources.
These data sets are so voluminous that traditional data processing software just can’t
manage them. But these massive volumes of data can be used to address business
problems you wouldn’t have been able to tackle before.
What is Big Data?
Big data refers to extremely large and diverse collections of structured, unstructured,
and semi-structured data that continues to grow exponentially over time. These
datasets are so huge and complex in volume, velocity, and variety, that traditional
data management systems cannot store, process, and analyze them.
The amount and availability of data is growing rapidly, spurred on by digital
technology advancements, such as connectivity, mobility, the Internet of Things (IoT),
and artificial intelligence (AI). As data continues to expand and proliferate, new big
data tools are emerging to help companies collect, process, and analyze data at the
speed needed to gain the most value from it.
Big data describes large and diverse datasets that are huge in volume and also rapidly
grow in size over time. Big data is used in machine learning, predictive modeling, and
other advanced analytics to solve business problems and make informed decisions.
Read on to learn the definition of big data, some of the advantages of big data
solutions, common big data challenges, and how Google Cloud is helping
organizations build their data clouds to get more value from their data.
Big data examples
Data can be a company’s most valuable asset. Using big data to reveal insights can
help you understand the areas that affect your business—from market conditions and
customer purchasing behaviors to your business processes.
These are just a few ways organizations are using big data to become more data-
driven so they can adapt better to the needs and expectations of their customers and
the world around them.
The Vs of big data
Big data definitions may vary slightly, but it will always be described in terms of
volume, velocity, and variety. These big data characteristics are often referred to as
the “3 Vs of big data” and were first defined by Gartner in 2001.

Volume
As its name suggests, the most common characteristic associated with big data is its
high volume. This describes the enormous amount of data that is available for
collection and produced from a variety of sources and devices on a continuous basis.

Velocity
Big data velocity refers to the speed at which data is generated. Today, data is often
produced in real time or near real time, and therefore, it must also be processed,
accessed, and analyzed at the same rate to have any meaningful impact.

Variety
Data is heterogeneous, meaning it can come from many different sources and can be
structured, unstructured, or semi-structured. More traditional structured data (such as
data in spreadsheets or relational databases) is now supplemented by unstructured
text, images, audio, video files, or semi-structured formats like sensor data that can’t
be organized in a fixed data schema.

In addition to these three original Vs, three others that are often mentioned in relation
to harnessing the power of big data: veracity, variability, and value.
 Veracity: Big data can be messy, noisy, and error-prone, which makes it
difficult to control the quality and accuracy of the data. Large datasets can be
unwieldy and confusing, while smaller datasets could present an incomplete
picture. The higher the veracity of the data, the more trustworthy it is.
 Variability: The meaning of collected data is constantly changing, which can
lead to inconsistency over time. These shifts include not only changes in
context and interpretation but also data collection methods based on the
information that companies want to capture and analyze.
 Value: It’s essential to determine the business value of the data you collect.
Big data must contain the right data and then be effectively analyzed in order
to yield insights that can help drive decision-making.
How does big data work?
The central concept of big data is that the more visibility you have into anything, the
more effectively you can gain insights to make better decisions, uncover growth
opportunities, and improve your business model.
Making big data work requires three main actions:
 Integration: Big data collects terabytes, and sometimes even petabytes, of raw
data from many sources that must be received, processed, and transformed
into the format that business users and analysts need to start analyzing it.
 Management: Big data needs big storage, whether in the cloud, on-premises,
or both. Data must also be stored in whatever form required. It also needs to
be processed and made available in real time. Increasingly, companies are
turning to cloud solutions to take advantage of the unlimited compute and
scalability.
 Analysis: The final step is analyzing and acting on big data—otherwise, the
investment won’t be worth it. Beyond exploring the data itself, it’s also critical
to communicate and share insights across the business in a way that everyone
can understand. This includes using tools to create data visualizations like
charts, graphs, and dashboards.
Big data benefits
Improved decision-making
Big data is the key element to becoming a data-driven organization. When you can
manage and analyze your big data, you can discover patterns and unlock insights that
improve and drive better operational and strategic decisions.
Increased agility and innovation
Big data allows you to collect and process real-time data points and analyze them to
adapt quickly and gain a competitive advantage. These insights can guide and
accelerate the planning, production, and launch of new products, features, and
updates.
Better customer experiences
Combining and analyzing structured data sources together with unstructured ones
provides you with more useful insights for consumer understanding, personalization,
and ways to optimize experience to better meet consumer needs and expectations.
Continuous intelligence
Big data allows you to integrate automated, real-time data streaming with advanced
data analytics to continuously collect data, find new insights, and discover new
opportunities for growth and value.
More efficient operations
Using big data analytics tools and capabilities allows you to process data faster and
generate insights that can help you determine areas where you can reduce costs, save
time, and increase your overall efficiency.
Improved risk management
Analyzing vast amounts of data helps companies evaluate risk better—making it easier
to identify and monitor all potential threats and report insights that lead to more
robust control and mitigation strategies.
Challenges of implementing big data analytics
While big data has many advantages, it does present some challenges that
organizations must be ready to tackle when collecting, managing, and taking action on
such an enormous amount of data.
The most commonly reported big data challenges include:
 Lack of data talent and skills. Data scientists, data analysts, and data engineers
are in short supply—and are some of the most highly sought after (and highly
paid) professionals in the IT industry. Lack of big data skills and experience
with advanced data tools is one of the primary barriers to realizing value from
big data environments.
 Speed of data growth. Big data, by nature, is always rapidly changing and
increasing. Without a solid infrastructure in place that can handle your
processing, storage, network, and security needs, it can become extremely
difficult to manage.
 Problems with data quality. Data quality directly impacts the quality of
decision-making, data analytics, and planning strategies. Raw data is messy
and can be difficult to curate. Having big data doesn’t guarantee results unless
the data is accurate, relevant, and properly organized for analysis. This can
slow down reporting, but if not addressed, you can end up with misleading
results and worthless insights.
 Compliance violations. Big data contains a lot of sensitive data and
information, making it a tricky task to continuously ensure data processing and
storage meet data privacy and regulatory requirements, such as data
localization and data residency laws.
 Integration complexity. Most companies work with data siloed across various
systems and applications across the organization. Integrating disparate data
sources and making data accessible for business users is complex, but vital, if
you hope to realize any value from your big data.
 Security concerns. Big data contains valuable business and customer
information, making big data stores high-value targets for attackers. Since
these datasets are varied and complex, it can be harder to implement
comprehensive strategies and policies to protect them.
How are data-driven businesses performing?
Some organizations remain wary of going all in on big data because of the time, effort,
and commitment it requires to leverage it successfully. In particular, businesses
struggle to rework established processes and facilitate the cultural change needed to
put data at the heart of every decision.
But becoming a data-driven business is worth the work. Recent research shows:
 58% of companies that make data-based decisions are more likely to beat
revenue targets than those that don't
 Organizations with advanced insights-driven business capabilities are 2.8x
more likely to report double-digit year-over-year growth
 Data-driven organizations generate, on average, more than 30% growth per
year
The enterprises that take steps now and make significant progress toward
implementing big data stand to come as winners in the future.
Big data strategies and solutions
Developing a solid data strategy starts with understanding what you want to achieve,
identifying specific use cases, and the data you currently have available to use. You
will also need to evaluate what additional data might be needed to meet your
business goals and the new systems or tools you will need to support those.
Unlike traditional data management solutions, big data technologies and tools are
made to help you deal with large and complex datasets to extract value from them.
Tools for big data can help with the volume of the data collected, the speed at which
that data becomes available to an organization for analysis, and the complexity or
varieties of that data.
For example, data lakes ingest, process, and store structured, unstructured, and semi-
structured data at any scale in its native format. Data lakes act as a foundation to run
different types of smart analytics, including visualizations, real-time analytics,
and machine learning.
It’s important to keep in mind that when it comes to big data—there is no one-size-
fits-all strategy. What works for one company may not be the right approach for your
organization’s specific needs.
Here are four key concepts that our Google Cloud customers have taught us about
shaping a winning approach to big data:

Open
Today, organizations need the freedom to build what they want using the tools and
solutions they want. As data sources continue to grow and new technology
innovations become available, the reality of big data is one that contains multiple
interfaces, open source technology stacks, and clouds. Big data environments will
need to be architected to be both open and adaptable to allow for companies to build
the solutions and get the data it needs to win.

Intelligent
Big data requires data capabilities that will allow them to leverage smart analytics and
AI and ML technologies to save time and effort delivering insights that improve
business decisions and managing your overall big data infrastructure. For example,
you should consider automating processes or enabling self-service analytics so that
people can work with data on their own, with minimal support from other teams.

Flexible
Big data analytics need to support innovation, not hinder it. This requires building a data
foundation that will offer on-demand access to compute and storage resources and
unify data so that it can be easily discovered and accessed. It’s also important to be able
to choose technologies and solutions that can be easily combined and used in tandem to
create the perfect data toolsets that fit the workload and use case.
Trusted
For big data to be useful, it must be trusted. That means it’s imperative to build trust
into your data—trust that it’s accurate, relevant, and protected. No matter where data
comes from, it should be secure by default and your strategy will also need to consider
what security capabilities will be necessary to ensure compliance, redundancy, and
reliability

CONVERGENCE OF KEY TRENDS


Prerequisites of Big Data:

Big Data analytics requires aptitude in mathematics, as it is the arithmetic of adequacy

To work in Big Data analytics, it is helpful to have a knowledge of Hadoop, SQL, R, Python, and other
programming language

Big Data analytics is driving employment development, as individuals with data mining and machine
learning methods, information visualization tools, and information warehousing knowledge and
experience are required for Big Data analytics. Big Data Fields can be any of the following:

S.N Lesson Reference Link NPTEL videos Research


o. page no.

UNIT
IUNDERSTANDING
BIG DATA
1 Introduction to big https:// https:// https://
data cloud.google. onlinecourses.nptel.ac www.researchgate.n
com/learn/ .in/noc24_cs130/unit? et/publication/
what-is-big- unit=16&assessment= 360410918_INTROD
data 130 UCTION_TO_BIG_DA
TA_ANALYTICS/
link/
62749645107cae291
98d30ca/download?
_tp=eyJjb250ZXh0Ijp
7ImZpcnN0UGFnZSI6
InB1YmxpY2F0aW9uI
iwicGFnZSI6InB1Ymx
pY2F0aW9uIn19

2 Convergence of key
trends

3 unstructured data

4 industry examples of
big data

5 web analytics

6 big data applications

7 big data
technologies

8 introduction to
Hadoop

10 open source
technologies

11 cloud and big data

mobile business
intelligence
Crowd sourcing
analytics
inter and trans
firewall analytics
UNIT II NOSQL
DATA
MANAGEMENT
Introduction to
NoSQL
aggregate data
models
key-value and
document data
models
relationships – graph
databases
schemaless
databases
materialized views
distribution models
master-slave
replication –
consistency
Cassandra –
Cassandra data
model – Cassandra
examples –
Cassandra clients
UNIT III BASICS OF
HADOOP
Data format
4 analyzing data with
Hadoop
scaling out
Hadoop streaming
Hadoop pipes –
design of Hadoop
distributed file
system (HDFS)
HDFS concepts
Java interface
Hadoop I/O – data
integrity
data flow ––
compression –
serialization
Avro – file-based
data structures
Cassandra – Hadoop
integration
UNIT IV MAP
REDUCE
APPLICATIONS 6
MapReduce
workflows ––––––
task execution –

unit tests with


MRUnit
test data and local
tests
MapReduce job run
– classic Map-reduce
YARN – failures in
classic
anatomy of–Map-
reduce and YARN
job scheduling
shuffle and sort
MapReduce types –
input formats –
output formats.
UNIT V HADOOP
RELATED TOOLS
Hbase – data model
and
implementations –
Hbase clients –
Hbase examples –
praxis.
Pig – Grunt – pig
data model – Pig
Latin – developing
and testing Pig Latin
scripts.
Hive – data types
and file formats –
HiveQL data
definition – HiveQL
data manipulation –
HiveQL queries.

You might also like