UNIT I BIG DATA Extra Content
UNIT I BIG DATA Extra Content
UNIT I BIG DATA Extra Content
Volume
As its name suggests, the most common characteristic associated with big data is its
high volume. This describes the enormous amount of data that is available for
collection and produced from a variety of sources and devices on a continuous basis.
Velocity
Big data velocity refers to the speed at which data is generated. Today, data is often
produced in real time or near real time, and therefore, it must also be processed,
accessed, and analyzed at the same rate to have any meaningful impact.
Variety
Data is heterogeneous, meaning it can come from many different sources and can be
structured, unstructured, or semi-structured. More traditional structured data (such as
data in spreadsheets or relational databases) is now supplemented by unstructured
text, images, audio, video files, or semi-structured formats like sensor data that can’t
be organized in a fixed data schema.
In addition to these three original Vs, three others that are often mentioned in relation
to harnessing the power of big data: veracity, variability, and value.
Veracity: Big data can be messy, noisy, and error-prone, which makes it
difficult to control the quality and accuracy of the data. Large datasets can be
unwieldy and confusing, while smaller datasets could present an incomplete
picture. The higher the veracity of the data, the more trustworthy it is.
Variability: The meaning of collected data is constantly changing, which can
lead to inconsistency over time. These shifts include not only changes in
context and interpretation but also data collection methods based on the
information that companies want to capture and analyze.
Value: It’s essential to determine the business value of the data you collect.
Big data must contain the right data and then be effectively analyzed in order
to yield insights that can help drive decision-making.
How does big data work?
The central concept of big data is that the more visibility you have into anything, the
more effectively you can gain insights to make better decisions, uncover growth
opportunities, and improve your business model.
Making big data work requires three main actions:
Integration: Big data collects terabytes, and sometimes even petabytes, of raw
data from many sources that must be received, processed, and transformed
into the format that business users and analysts need to start analyzing it.
Management: Big data needs big storage, whether in the cloud, on-premises,
or both. Data must also be stored in whatever form required. It also needs to
be processed and made available in real time. Increasingly, companies are
turning to cloud solutions to take advantage of the unlimited compute and
scalability.
Analysis: The final step is analyzing and acting on big data—otherwise, the
investment won’t be worth it. Beyond exploring the data itself, it’s also critical
to communicate and share insights across the business in a way that everyone
can understand. This includes using tools to create data visualizations like
charts, graphs, and dashboards.
Big data benefits
Improved decision-making
Big data is the key element to becoming a data-driven organization. When you can
manage and analyze your big data, you can discover patterns and unlock insights that
improve and drive better operational and strategic decisions.
Increased agility and innovation
Big data allows you to collect and process real-time data points and analyze them to
adapt quickly and gain a competitive advantage. These insights can guide and
accelerate the planning, production, and launch of new products, features, and
updates.
Better customer experiences
Combining and analyzing structured data sources together with unstructured ones
provides you with more useful insights for consumer understanding, personalization,
and ways to optimize experience to better meet consumer needs and expectations.
Continuous intelligence
Big data allows you to integrate automated, real-time data streaming with advanced
data analytics to continuously collect data, find new insights, and discover new
opportunities for growth and value.
More efficient operations
Using big data analytics tools and capabilities allows you to process data faster and
generate insights that can help you determine areas where you can reduce costs, save
time, and increase your overall efficiency.
Improved risk management
Analyzing vast amounts of data helps companies evaluate risk better—making it easier
to identify and monitor all potential threats and report insights that lead to more
robust control and mitigation strategies.
Challenges of implementing big data analytics
While big data has many advantages, it does present some challenges that
organizations must be ready to tackle when collecting, managing, and taking action on
such an enormous amount of data.
The most commonly reported big data challenges include:
Lack of data talent and skills. Data scientists, data analysts, and data engineers
are in short supply—and are some of the most highly sought after (and highly
paid) professionals in the IT industry. Lack of big data skills and experience
with advanced data tools is one of the primary barriers to realizing value from
big data environments.
Speed of data growth. Big data, by nature, is always rapidly changing and
increasing. Without a solid infrastructure in place that can handle your
processing, storage, network, and security needs, it can become extremely
difficult to manage.
Problems with data quality. Data quality directly impacts the quality of
decision-making, data analytics, and planning strategies. Raw data is messy
and can be difficult to curate. Having big data doesn’t guarantee results unless
the data is accurate, relevant, and properly organized for analysis. This can
slow down reporting, but if not addressed, you can end up with misleading
results and worthless insights.
Compliance violations. Big data contains a lot of sensitive data and
information, making it a tricky task to continuously ensure data processing and
storage meet data privacy and regulatory requirements, such as data
localization and data residency laws.
Integration complexity. Most companies work with data siloed across various
systems and applications across the organization. Integrating disparate data
sources and making data accessible for business users is complex, but vital, if
you hope to realize any value from your big data.
Security concerns. Big data contains valuable business and customer
information, making big data stores high-value targets for attackers. Since
these datasets are varied and complex, it can be harder to implement
comprehensive strategies and policies to protect them.
How are data-driven businesses performing?
Some organizations remain wary of going all in on big data because of the time, effort,
and commitment it requires to leverage it successfully. In particular, businesses
struggle to rework established processes and facilitate the cultural change needed to
put data at the heart of every decision.
But becoming a data-driven business is worth the work. Recent research shows:
58% of companies that make data-based decisions are more likely to beat
revenue targets than those that don't
Organizations with advanced insights-driven business capabilities are 2.8x
more likely to report double-digit year-over-year growth
Data-driven organizations generate, on average, more than 30% growth per
year
The enterprises that take steps now and make significant progress toward
implementing big data stand to come as winners in the future.
Big data strategies and solutions
Developing a solid data strategy starts with understanding what you want to achieve,
identifying specific use cases, and the data you currently have available to use. You
will also need to evaluate what additional data might be needed to meet your
business goals and the new systems or tools you will need to support those.
Unlike traditional data management solutions, big data technologies and tools are
made to help you deal with large and complex datasets to extract value from them.
Tools for big data can help with the volume of the data collected, the speed at which
that data becomes available to an organization for analysis, and the complexity or
varieties of that data.
For example, data lakes ingest, process, and store structured, unstructured, and semi-
structured data at any scale in its native format. Data lakes act as a foundation to run
different types of smart analytics, including visualizations, real-time analytics,
and machine learning.
It’s important to keep in mind that when it comes to big data—there is no one-size-
fits-all strategy. What works for one company may not be the right approach for your
organization’s specific needs.
Here are four key concepts that our Google Cloud customers have taught us about
shaping a winning approach to big data:
Open
Today, organizations need the freedom to build what they want using the tools and
solutions they want. As data sources continue to grow and new technology
innovations become available, the reality of big data is one that contains multiple
interfaces, open source technology stacks, and clouds. Big data environments will
need to be architected to be both open and adaptable to allow for companies to build
the solutions and get the data it needs to win.
Intelligent
Big data requires data capabilities that will allow them to leverage smart analytics and
AI and ML technologies to save time and effort delivering insights that improve
business decisions and managing your overall big data infrastructure. For example,
you should consider automating processes or enabling self-service analytics so that
people can work with data on their own, with minimal support from other teams.
Flexible
Big data analytics need to support innovation, not hinder it. This requires building a data
foundation that will offer on-demand access to compute and storage resources and
unify data so that it can be easily discovered and accessed. It’s also important to be able
to choose technologies and solutions that can be easily combined and used in tandem to
create the perfect data toolsets that fit the workload and use case.
Trusted
For big data to be useful, it must be trusted. That means it’s imperative to build trust
into your data—trust that it’s accurate, relevant, and protected. No matter where data
comes from, it should be secure by default and your strategy will also need to consider
what security capabilities will be necessary to ensure compliance, redundancy, and
reliability
To work in Big Data analytics, it is helpful to have a knowledge of Hadoop, SQL, R, Python, and other
programming language
Big Data analytics is driving employment development, as individuals with data mining and machine
learning methods, information visualization tools, and information warehousing knowledge and
experience are required for Big Data analytics. Big Data Fields can be any of the following:
UNIT
IUNDERSTANDING
BIG DATA
1 Introduction to big https:// https:// https://
data cloud.google. onlinecourses.nptel.ac www.researchgate.n
com/learn/ .in/noc24_cs130/unit? et/publication/
what-is-big- unit=16&assessment= 360410918_INTROD
data 130 UCTION_TO_BIG_DA
TA_ANALYTICS/
link/
62749645107cae291
98d30ca/download?
_tp=eyJjb250ZXh0Ijp
7ImZpcnN0UGFnZSI6
InB1YmxpY2F0aW9uI
iwicGFnZSI6InB1Ymx
pY2F0aW9uIn19
2 Convergence of key
trends
3 unstructured data
4 industry examples of
big data
5 web analytics
7 big data
technologies
8 introduction to
Hadoop
10 open source
technologies
mobile business
intelligence
Crowd sourcing
analytics
inter and trans
firewall analytics
UNIT II NOSQL
DATA
MANAGEMENT
Introduction to
NoSQL
aggregate data
models
key-value and
document data
models
relationships – graph
databases
schemaless
databases
materialized views
distribution models
master-slave
replication –
consistency
Cassandra –
Cassandra data
model – Cassandra
examples –
Cassandra clients
UNIT III BASICS OF
HADOOP
Data format
4 analyzing data with
Hadoop
scaling out
Hadoop streaming
Hadoop pipes –
design of Hadoop
distributed file
system (HDFS)
HDFS concepts
Java interface
Hadoop I/O – data
integrity
data flow ––
compression –
serialization
Avro – file-based
data structures
Cassandra – Hadoop
integration
UNIT IV MAP
REDUCE
APPLICATIONS 6
MapReduce
workflows ––––––
task execution –