117769
117769
117769
Introduction to Big Data – Characteristics of Data – Evolution of Big Data – Big Data
Analytics – Classification of Analytics – Top Challenges Facing Big Data – Importance of
Big Data Analytics – Data Analytics Tools. Data Collections: Types of Data Sources -
Sampling - Types of Data Elements - Visual Data Exploration and Exploratory - Statistical
Analysis - Missing Values - Outlier Detection and Treatment - Standardizing Data -
Categorization - Weights of Evidence Coding - Variable Selection – Segmentation.
INTRODUCTION TO BIG DATA
The "Internet of Things" and its widely ultra-connected nature are leading to a burgeoning
rise in big data. There is no dearth of data for today's enterprise. On the contrary, they are
mired in data and quite deep at that.
That brings us to the following questions:
1. Why is it that we cannot forego big data?
2. How has it come to assume such magnanimous importance in running business?
3. How does it compare with the traditional Business Intelligence (BI) environment?
4. Is it here to replace the traditional, relational database management system and data
warehouse environment or is it likely to complement their existence?
Some of the most common examples of Hadoop implementations are in the social media
space, where Hadoop can manage transactions, give textual updates, and develop social
graphs among millions of users.
Twitter and Facebook generate massive amounts of unstructured data and use Hadoop and
its ecosystem of tools to manage this high volume.
Social media: It represents a tremendous opportunity to leverage social and professional
interactions to derive new insights.
LinkedIn represents a company in which data itself is the product. Early on, Linkedln
founder Reid Hoffman saw the opportunity to create a social network for working
professionals.
As of 2014, Linkedln has more than 250 million user accounts and has added many
additional features and data-related products, such as recruiting, job seeker tools,
advertising, and lnMaps, which show a social graph of a user's professional network.
CHARACTERISTICS OF DATA
1. Composition: The composition of data deals with the structure of data, that is, the
sources of data, the granularity, the types, and the nature of data as to whether it is static
or real-time streaming
2. Condition: The condition of data deals with the state of data, that is, "Can one use this
data as is for analysis?" or "Does it require cleansing for further enhancement and
enrichment?"
3. Context: The context of data deals with "Where has this data been generated?" "Why was
this data generated?" How sensitive is this data?"
Composition
Data Condition
Context
What are the events associated with this data?" and so on.
Small data (data as it existed prior to the big data revolution) is about certainty. It is about known data
sources; it is about no major changes to the composition or context of data. Most often we have
answers to queries like why this data was generated, where and when it was generated, exactly how
we would like to use it, what questions will this data be able to answer, and so on. Big data is about
complexity.
Complexity in terms of multiple and unknown datasets, in terms of exploding volume, in terms of
speed at which the data is being generated and the speed at which it needs to be processed and in
terms of the variety of data (internal or external, behaviour or social) that is being generated.
Variety: Data can be structured data, semi-structured data and unstructured data. Data
stored in a database is an example of structured data.HTML data, XML data, email data.CSV
files are the examples of semi-structured data. Power point presentation, images, videos,
researches, white papers, body of email etc are the examples of unstructured data.
Velocity: Velocity essentially refers to the speed at which data is being created in real- time.
We have moved from simple desktop applications like payroll application to real- time
processing applications.
Volume: Volume can be in Terabytes or Petabytes or Zettabytes. Gartner Glossary Big data
is high-volume, high-velocity and/or highvariety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight and
decision making.
About moving code to data. This makes perfect sense as the program for distributed
processing is tiny (just a few KBs) compared to the data (Terabytes or Petabytes today and
likely to be Exabytes or Zettabytes in the near future).
CLASSIFICATION OF ANALYTICS
Descriptive Analytics is the foundation of business intelligence and data analysis, providing
insights into past performance. This form of analytics involves gathering data from various
sources and compiling it into a format that is easy to understand and interpret. The primary
goal here is to identify what has happened over a specific period. Techniques such as data
aggregation and mining are commonly used to summarize data, helping businesses
understand trends and patterns. For instance, a company may use descriptive analytics to
assess historical sales data to understand which products are performing well.
Diagnostic Analytics takes the insights gained from descriptive analytics a step further. It
focuses on understanding why something happened. This type of analysis often involves
more sophisticated data techniques like drill-down, data discovery, and correlation analysis.
For example, if a company notices a sudden drop in sales for a particular product, diagnostic
analytics can be used to find the cause, such as market trends or changes in customer
preferences. By identifying the factors that influence outcomes, businesses can make more
informed decisions.
Predictive Analytics is about forecasting future events based on historical data. It employs
statistical models and algorithms to identify the likelihood of future outcomes. This type of
analytics is forward-looking, using patterns found in historical and transactional data to
identify risks and opportunities. Predictive models are used in various business operations,
from anticipating customer behaviors to identifying potential risks in investment strategies.
Predictive analytics can, for example, help a retail company forecast future sales based on
seasonal trends, promotions, and other factors.
Prescriptive Analytics is one of the most advanced forms of analytics, which not only
anticipates what will happen and when it will happen but also why it will happen. This
analysis provides advice on possible outcomes and recommends actions to achieve desired
goals. It involves techniques like optimization and simulation, enabling decision-makers to
understand the effect of future decisions before they are made. This approach is particularly
useful in complex scenarios where there are many variables and constraints. For instance, a
logistics company may use prescriptive analytics to optimize routing and distribution
schedules, thereby reducing costs and improving efficiency.
Scale: Storage (RDBMS (Relational Database Management System) or NoSQL (Not only
SQL)) is one major concern that needs to be addressed to handle the need for scaling rapidly
and elastically. The need of the hour is a storage that can best withstand the attack of large
volume, velocity and variety of big data. Should you scale vertically or should you scale
horizontally?
Security: Most of the NoSQL big data platforms have poor security mechanisms (lack of
proper authentication and authorization mechanisms) when it comes to safeguarding big
data. A spot that cannot be ignored given that big data carries credit card information,
personal information and other sensitive data.
schema: Rigid schemas have no place. We want the technology to be able to fit our big data
and not the other way around. The need of the hour is dynamic schema. Static (pre-defined
schemas) are obsolete.
Continuous availability: The big question here is how to provide 24/7 support because
almost all RDBMS and NoSQL big data platforms have a certain amount of downtime built
in.
Consistency: Should one opt for consistency or eventual consistency? Partition tolerant:
How to build partition tolerant systems that can take care of both hardware and software
failures?
Data quality: How to maintain data quality- data accuracy, completeness, timeliness, etc.?
Do we have appropriate metadata in place?
Hadoop Ecosystem
Apache Hadoop: An open-source framework that allows for the distributed processing of
large data sets across clusters of computers. It's designed to scale up from single servers to
thousands of machines, each offering local computation and storage.
HDFS (Hadoop Distributed File System): A distributed file system designed to run on
commodity hardware. It provides high throughput access to application data and is suitable
for applications with large data sets.
MapReduce: A programming model and processing technique for distributed computing. It
simplifies data processing on large clusters and is key for big data processing.
YARN (Yet Another Resource Negotiator): Manages and allocates cluster resources,
improving the system's efficiency and scalability.
Visualization Tools
Tableau, Power BI: While not exclusively for big data, these tools are capable of connecting
to various big data sources, enabling the visualization and analysis of large datasets in a
more user-friendly manner.
Big Data technologies continue to evolve rapidly, with a growing emphasis on cloud-based
solutions, real-time processing, and advanced machine learning capabilities. These
technologies are crucial in a wide range of industries, including finance, healthcare, retail,
and telecommunications, for driving innovation, understanding customer behavior, and
optimizing operational efficiency
1. Informed Decision-Making
Big Data Analytics provides deep insights derived from the analysis of large volumes of
data. These insights help businesses make more informed, data-driven decisions. By
analyzing data from various sources, companies can identify trends, patterns, and
correlations that would not be apparent otherwise.
2. Enhanced Customer Experiences
Companies use Big Data Analytics to understand customer behaviors and preferences better.
This understanding enables them to tailor their products, services, and interactions to meet
the specific needs and desires of their customers, resulting in improved customer satisfaction
and loyalty.
3. Operational Efficiency
By analyzing large datasets, organizations can identify inefficiencies and bottlenecks in their
operations. Big Data Analytics can help in optimizing processes, reducing costs, and
improving overall efficiency. For instance, in supply chain management, analytics can help
predict inventory needs, optimize delivery routes, and reduce operational costs.
4. Competitive Advantage
In a highly competitive business environment, leveraging Big Data Analytics can provide a
significant advantage. Companies can use analytics to identify market trends, anticipate
customer needs, and stay ahead of their competitors.
5. Risk Management
Big Data Analytics is crucial in risk assessment and management. By analyzing historical
data, companies can predict potential risks and take proactive measures to mitigate them.
This is particularly important in industries like finance and insurance, where predicting and
managing risk is a core function.
6. Innovation and Product Development
Insights derived from Big Data Analytics can drive innovation and new product
development. Companies can identify gaps in the market, emerging trends, and customer
needs that are not currently met, leading to the development of new and improved products
and services.
7. Predictive Analytics
Beyond understanding current trends, Big Data Analytics allows for predictive modeling.
Businesses can forecast future trends, customer behaviors, and market dynamics, enabling
them to prepare and adapt in advance.
8. Personalization and Targeting
Big Data Analytics enables hyper-personalization in marketing and advertising. Companies
can tailor their messages and offers to individual customers based on their specific behaviors
and preferences, resulting in more effective marketing strategies.
9. Improved Healthcare Outcomes
In healthcare, Big Data Analytics plays a critical role in patient care and medical research.
Analyzing patient data helps in diagnosing diseases early, predicting outbreaks, and
developing new treatments.
10. Social Impact and Public Services
Governments and public service organizations use Big Data Analytics to enhance their
services, from urban planning and transportation management to social welfare programs
and environmental protection.
11. Handling Complex and Diverse Data
Big Data Analytics provides the tools and methodologies to handle the variety, velocity, and
volume of data generated in the modern world, much of which is unstructured and complex.
DATA ANALYTICS TOOLS
Data analytics tools are essential in transforming raw data into meaningful insights for
decision-making and strategic planning. These tools range from software for simple data
visualization to complex predictive modeling. Here's an overview of various types of data
analytics tools:
Each of these data sources has its strengths and limitations, and the choice of data source
largely depends on the research objectives, the nature of the analysis, budget constraints,
and the availability of data. In many cases, a combination of these sources is used to provide
a comprehensive view and to validate findings.
SAMPLING
Sampling is essential for building analytical models to represent future customer
behavior accurately, balancing between robustness and representativeness.
The optimal time window for sampling should consider both the quantity of data and its
recency to ensure representativeness.
Sampling bias, particularly in scenarios like credit scoring, must be addressed carefully
to ensure the sample reflects the target population accurately.
Techniques such as bureau-based inference are used to correct bias, but no method is
perfect.
Stratified sampling ensures proportional representation of subgroups within the sample,
crucial for skewed datasets like churn prediction or fraud detection.
Historical Through-the-Door (TTD) population divides into accepted and rejected
applicants.
Accepted subset is used for modeling, causing inherent bias.
Reject inference methods attempt to address this bias.
No method perfectly solves the bias issue.
Bureau-based inference is a common but imperfect solution to infer creditworthiness for
rejected applicants
MISSING VALUES
Missing values may occur if the information is nonapplicable, such as churn time for
non-churners.
Privacy concerns can lead to undisclosed information, like a customer not revealing
income.
Errors during data collection or merging, like typos, can also result in missing data.
Some analytical methods, such as decision trees, can inherently manage missing values.
Other techniques require additional preprocessing to handle missing values effectively.
Imputation: Replace missing values with a statistical measure like mean, median, or
mode, or use regression models to estimate missing values.
Deletion: Remove data points or features with many missing values, assuming the
missing data is random and not informative.
Retention: Treat missing values as a separate category if they hold meaningful
information, such as undisclosed income indicating unemployment.
To address missing data:
Statistically test the relationship between missing data and the target variable.
If related, categorize missing data as a separate class.
If unrelated, choose to impute or delete based on the dataset size.
STANDARDIZING DATA
Standardization adjusts variables to a similar scale, enhancing comparability in models.
Min/max standardization rescales data to a defined range, like 0 to 1.
Z-score standardization transforms data based on the mean and standard deviation.
Decimal scaling normalizes data by dividing by a power of 10, based on the largest
value.
Standardization is crucial for regression analyses but not required for decision trees.
CATEGORIZATION
Categorization in data analysis involves the process of converting continuous variables
into categorical ones. This method is often applied to simplify the data, enhance
understanding, and prepare for specific types of analysis that require categorical input.
examples:
Age Categorization: Instead of using exact ages, individuals might be grouped into
'Youth' (0-18), 'Young Adults' (19-35), 'Middle-aged' (36-50), and 'Senior' (51+). This
simplifies analysis and is particularly useful in surveys or studies focusing on
demographic impacts.
Income Bands: Rather than precise income figures, you could categorize income into
'Low' (e.g., under $20,000), 'Medium' ($20,000 to $100,000), and 'High' (over $100,000) to
study economic behaviors or preferences across different economic groups.
Educational Levels: Education years can be binned into 'No High School' (0-11 years),
'High School Graduate' (12 years), 'Some College' (13-15 years), and 'College Degree or
higher' (16+ years) for studies relating to educational attainment's impact on job
opportunities or earning potential.
Credit Score Ranges: Credit scores might be categorized into ranges like 'Poor', 'Fair',
'Good', 'Very Good', and 'Excellent', which is often done in the financial industry to
assess creditworthiness.
Categorization aids in reducing the noise in the data, making it easier for certain algorithms
to detect patterns. It is also beneficial when visualizing data, as it can reveal trends that
might not be apparent when dealing with continuous variables. However, one must be
cautious as overly coarse categorization can lead to loss of information and potential
insights.
Let's consider a simple dataset with two variables: Credit History (Good, Bad) and Loan
Default (Yes, No). Here's how WoE could be calculated:
Data Count:
Calculating Proportions:
WoE Calculation:
VARIABLE SELECTION:
Analytical models often begin with many variables; however, typically only a few
significantly predict the target variable.
Credit scoring models, for example, may use 10-15 key variables in a scorecard.
Filters are a practical method for variable selection, evaluating the univariate correlation
of each variable with the target.
Filters facilitate a rapid assessment to determine which variables to retain for deeper
analysis.
A variety of filter measures for variable selection are available in statistical literature.
The Pearson correlation ρP is calculated as follows:
the Pearson correlation coefficient, which ranges between -1 and +1, can be utilized as a
criterion. Variables can be chosen based on whether the Pearson correlation is significantly
different from 0, as indicated by the p-value. A common practice might be to select variables
with an absolute Pearson correlation |ρP| greater than a certain threshold, such as 0.50,
indicating a moderate to strong correlation with the target variable.
Imagine a dataset containing information about bank customers, which includes variables
like Age, Income, Credit Score, and Loan Default (Yes/No). To select relevant variables for
predicting Loan Default, we could use the Pearson correlation coefficient:
Selection Criterion:
Variables Selected:
Credit Score (|ρP| = 0.60) is selected as it has a strong correlation with Loan Default.
Age and Income are not selected due to weaker correlations.
This process helps in identifying the most influential variables for the prediction of Loan
Default.