117769

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Unit I INTRODUCTION TO BIGDATA

Introduction to Big Data – Characteristics of Data – Evolution of Big Data – Big Data
Analytics – Classification of Analytics – Top Challenges Facing Big Data – Importance of
Big Data Analytics – Data Analytics Tools. Data Collections: Types of Data Sources -
Sampling - Types of Data Elements - Visual Data Exploration and Exploratory - Statistical
Analysis - Missing Values - Outlier Detection and Treatment - Standardizing Data -
Categorization - Weights of Evidence Coding - Variable Selection – Segmentation.
INTRODUCTION TO BIG DATA
The "Internet of Things" and its widely ultra-connected nature are leading to a burgeoning
rise in big data. There is no dearth of data for today's enterprise. On the contrary, they are
mired in data and quite deep at that.
That brings us to the following questions:
1. Why is it that we cannot forego big data?
2. How has it come to assume such magnanimous importance in running business?
3. How does it compare with the traditional Business Intelligence (BI) environment?
4. Is it here to replace the traditional, relational database management system and data
warehouse environment or is it likely to complement their existence?

Data is widely available. What is scarce is the ability to draw valuable


insight.

Some examples of Big Data:


• There are some examples of Big Data Analytics in different areas such as retail, IT
infrastructure, and social media.
• Retail: As mentioned earlier, Big Data presents many opportunities to improve sales and
marketing analytics.
• An example of this is the U.S. retailer Target. After analyzing consumer purchasing
behavior, Target's statisticians determined that the retailer made a great deal of money from
three main life-event situations.
• Marriage, when people tend to buy many new products
• Divorce, when people buy new products and change their spending habits
• Pregnancy, when people have many new things to buy and have an urgency to buy them.
The analysis target to manage its inventory, knowing that there would be demand for
specific products and it would likely vary by month over the coming nine- to ten-month
cycles.
 IT infrastructure: MapReduce paradigm is an ideal technical framework for many Big
Data projects, which rely on large data sets with unconventional data structures.
• One of the main benefits of Hadoop is that it employs a distributed file system, meaning it
can use a distributed cluster of servers and commodity hardware to process large amounts
of data.

Some of the most common examples of Hadoop implementations are in the social media
space, where Hadoop can manage transactions, give textual updates, and develop social
graphs among millions of users.
Twitter and Facebook generate massive amounts of unstructured data and use Hadoop and
its ecosystem of tools to manage this high volume.
Social media: It represents a tremendous opportunity to leverage social and professional
interactions to derive new insights.
LinkedIn represents a company in which data itself is the product. Early on, Linkedln
founder Reid Hoffman saw the opportunity to create a social network for working
professionals.
As of 2014, Linkedln has more than 250 million user accounts and has added many
additional features and data-related products, such as recruiting, job seeker tools,
advertising, and lnMaps, which show a social graph of a user's professional network.

CHARACTERISTICS OF DATA
1. Composition: The composition of data deals with the structure of data, that is, the
sources of data, the granularity, the types, and the nature of data as to whether it is static
or real-time streaming
2. Condition: The condition of data deals with the state of data, that is, "Can one use this
data as is for analysis?" or "Does it require cleansing for further enhancement and
enrichment?"
3. Context: The context of data deals with "Where has this data been generated?" "Why was
this data generated?" How sensitive is this data?"

Composition

Data Condition

Context

What are the events associated with this data?" and so on.

Small data (data as it existed prior to the big data revolution) is about certainty. It is about known data
sources; it is about no major changes to the composition or context of data. Most often we have
answers to queries like why this data was generated, where and when it was generated, exactly how
we would like to use it, what questions will this data be able to answer, and so on. Big data is about
complexity.

Complexity in terms of multiple and unknown datasets, in terms of exploding volume, in terms of
speed at which the data is being generated and the speed at which it needs to be processed and in
terms of the variety of data (internal or external, behaviour or social) that is being generated.

EVOLUTION OF BIG DATA


1970s and before was the era of mainframes. The data was essentially primitive and structured.
Relational databases evolved in 1980s and 1990s. The era was of data intensive applications. The
World Wide Web (WWW) and the Internet of Things (IOT) have led to an
onslaught of structured, unstructured, and multimedia data.
DEFINITION OF BIG DATA
 Big data is high-velocity and high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision making.
 Big data refers to datasets whose size is typically beyond the storage capacity of and also
complex for traditional database software tools.
 Big data is anything beyond the human & technical infrastructure needed to support
storage, processing and analysis.
 It is data that is big in volume, velocity and variety.

Variety: Data can be structured data, semi-structured data and unstructured data. Data
stored in a database is an example of structured data.HTML data, XML data, email data.CSV
files are the examples of semi-structured data. Power point presentation, images, videos,
researches, white papers, body of email etc are the examples of unstructured data.
Velocity: Velocity essentially refers to the speed at which data is being created in real- time.
We have moved from simple desktop applications like payroll application to real- time
processing applications.
Volume: Volume can be in Terabytes or Petabytes or Zettabytes. Gartner Glossary Big data
is high-volume, high-velocity and/or highvariety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight and
decision making.

CHALLENGES WITH BIG DATA


Data volume: Data today is growing at an exponential rate. This high tide of data will
continue to rise continuously. The key questions are –
“will all this data be useful for analysis?”,
“Do we work with all this data or subset of it?”,
“How will we separate the knowledge from the noise?” etc.
Storage: Cloud computing is the answer to managing infrastructure for big data as far as
cost-efficiency, elasticity and easy upgrading / downgrading is concerned. This further
complicates the decision to host big data solutions outside the enterprise.
Data retention: How long should one retain this data? Some data may require for log-term
decision, but some data may quickly become irrelevant and obsolete.
Skilled professionals: In order to develop, manage and run those applications that generate
insights, organizations need professionals who possess a high-level proficiency in data
sciences.
Other challenges: Other challenges of big data are with respect to capture, storage, search,
analysis, transfer and security of big data.
Visualization: Big data refers to datasets whose size is typically beyond the storage capacity
of traditional database software tools. There is no explicit definition of how big the data set
should be for it to be considered bigdata. Data visualization(computer graphics) is becoming
popular as a separate discipline. There are very few data visualization experts.

INTRODUCTION TO BIG DATA ANALYTICS


Big Data Analytics is...
1 Technology-enabled analytics: Quite a few data analytics and visualization tools are
available in the market today from leading vendors such as IBM, Tableau, SAS, R Analytics,
Statistical, World Programming Systems (WPS), etc. to help process and analyze your big
data.
2. About gaining a meaningful, deeper, and richer insight into your business to steer it in the
right direction. understanding the customer's demographics to cross-sell and up- sell to
them, better leveraging the services of your vendors and suppliers, etc.
3. About a competitive edge over your competitors by enabling you with findings that allow
quicker and better decision-making.
4. A tight handshake between three communities: IT, business users, and data scientists.
5. Working with datasets whose volume and variety exceed the current storage and
processing capabilities and infrastructure of your enterprise.

About moving code to data. This makes perfect sense as the program for distributed
processing is tiny (just a few KBs) compared to the data (Terabytes or Petabytes today and
likely to be Exabytes or Zettabytes in the near future).

CLASSIFICATION OF ANALYTICS

Descriptive Analytics is the foundation of business intelligence and data analysis, providing
insights into past performance. This form of analytics involves gathering data from various
sources and compiling it into a format that is easy to understand and interpret. The primary
goal here is to identify what has happened over a specific period. Techniques such as data
aggregation and mining are commonly used to summarize data, helping businesses
understand trends and patterns. For instance, a company may use descriptive analytics to
assess historical sales data to understand which products are performing well.

Diagnostic Analytics takes the insights gained from descriptive analytics a step further. It
focuses on understanding why something happened. This type of analysis often involves
more sophisticated data techniques like drill-down, data discovery, and correlation analysis.
For example, if a company notices a sudden drop in sales for a particular product, diagnostic
analytics can be used to find the cause, such as market trends or changes in customer
preferences. By identifying the factors that influence outcomes, businesses can make more
informed decisions.

Predictive Analytics is about forecasting future events based on historical data. It employs
statistical models and algorithms to identify the likelihood of future outcomes. This type of
analytics is forward-looking, using patterns found in historical and transactional data to
identify risks and opportunities. Predictive models are used in various business operations,
from anticipating customer behaviors to identifying potential risks in investment strategies.
Predictive analytics can, for example, help a retail company forecast future sales based on
seasonal trends, promotions, and other factors.

Prescriptive Analytics is one of the most advanced forms of analytics, which not only
anticipates what will happen and when it will happen but also why it will happen. This
analysis provides advice on possible outcomes and recommends actions to achieve desired
goals. It involves techniques like optimization and simulation, enabling decision-makers to
understand the effect of future decisions before they are made. This approach is particularly
useful in complex scenarios where there are many variables and constraints. For instance, a
logistics company may use prescriptive analytics to optimize routing and distribution
schedules, thereby reducing costs and improving efficiency.

CHALLENGES OF BIG DATA


There are mainly seven challenges of big data: scale, security, schema, Continuous
availability, Consistency, Partition tolerant and data quality.

Scale: Storage (RDBMS (Relational Database Management System) or NoSQL (Not only
SQL)) is one major concern that needs to be addressed to handle the need for scaling rapidly
and elastically. The need of the hour is a storage that can best withstand the attack of large
volume, velocity and variety of big data. Should you scale vertically or should you scale
horizontally?

Security: Most of the NoSQL big data platforms have poor security mechanisms (lack of
proper authentication and authorization mechanisms) when it comes to safeguarding big
data. A spot that cannot be ignored given that big data carries credit card information,
personal information and other sensitive data.

schema: Rigid schemas have no place. We want the technology to be able to fit our big data
and not the other way around. The need of the hour is dynamic schema. Static (pre-defined
schemas) are obsolete.

Continuous availability: The big question here is how to provide 24/7 support because
almost all RDBMS and NoSQL big data platforms have a certain amount of downtime built
in.

Consistency: Should one opt for consistency or eventual consistency? Partition tolerant:
How to build partition tolerant systems that can take care of both hardware and software
failures?

Data quality: How to maintain data quality- data accuracy, completeness, timeliness, etc.?
Do we have appropriate metadata in place?

IMPORTANCE OF BIG DATA


Let us study the various approaches to analysis of data and what it leads to.
Reactive-Business Intelligence: What does Business Intelligence (BI) help us with? It allows
the businesses to make faster and better decisions by providing the right information to the
right person at the right time in the right format. It is about analysis of the past or historical
data and then
displaying the findings of the analysis or reports in the form of enterprise dashboards, alerts,
notifications, etc. It has support for both pre-specified reports as well as ad hoc querying.
Reactive - Big Data Analytics: Here the analysis is done on huge datasets but the approach
is still reactive as it is still based on static data.
Proactive - Analytics: This is to support futuristic decision making by use of data mining
predictive modelling, text mining, and statistical analysis on. This analysis is not on big data
as it still the traditional database management practices on big data and therefore has severe
limitations on the storage capacity and the processing capability.
Proactive - Big Data Analytics: This is filtering through terabytes, petabytes, exabytes of
information to filter out the relevant data to analyze. This also includes high performance
analytics to gain rapid insights from big data and the ability to solve complex problems
using more data.

BIG DATA TECHNOLOGIES


Big Data technologies refer to the tools and methodologies employed to handle and analyze
extremely large data sets, which are too complex and voluminous to be processed by
traditional data-processing software. The aim of these technologies is to extract meaningful
insights, identify patterns, and make decisions based on the analysis of big data. Here's an
overview of some key Big Data technologies:

Hadoop Ecosystem
Apache Hadoop: An open-source framework that allows for the distributed processing of
large data sets across clusters of computers. It's designed to scale up from single servers to
thousands of machines, each offering local computation and storage.
HDFS (Hadoop Distributed File System): A distributed file system designed to run on
commodity hardware. It provides high throughput access to application data and is suitable
for applications with large data sets.
MapReduce: A programming model and processing technique for distributed computing. It
simplifies data processing on large clusters and is key for big data processing.
YARN (Yet Another Resource Negotiator): Manages and allocates cluster resources,
improving the system's efficiency and scalability.

Data Storage and Management


NoSQL Databases (e.g., MongoDB, Cassandra): These are non-relational databases
designed for large-scale data storage and for massively-parallel data processing across a
large number of commodity servers.
Data Warehouses (e.g., Amazon Redshift, Google BigQuery): These are central repositories
of integrated data from one or more disparate sources, optimized for analysis and querying
of large datasets.

Data Processing and Analytics


Apache Spark: An open-source, distributed computing system that provides an interface for
programming entire clusters with implicit data parallelism and fault tolerance. It's known
for its ability to process large data sets quickly.
Apache Flink: A framework and distributed processing engine for stateful computations
over unbounded and bounded data streams. Flink has been designed to run in all common
cluster environments and perform computations at in-memory speed.

Real-time Data Processing


Apache Storm: A system for processing streaming data in real-time. It's commonly used for
real-time analytics, online machine learning, continuous computation, and more.
Apache Kafka: A distributed streaming platform that's used to build real-time data
pipelines and streaming apps. It is horizontally scalable, fault-tolerant, and wicked fast.

Machine Learning and Advanced Analytics


TensorFlow, PyTorch: Open-source libraries for numerical computation and machine
learning. They enable advanced analytics solutions, like deep learning, on big data.
Apache Mahout, MLlib (Spark's Machine Learning Library): These provide machine
learning algorithms optimized for distributed computing.

Cloud-Based Big Data Solutions


AWS, Azure, Google Cloud: Major cloud service providers offer comprehensive big data
services, covering data storage, processing, analytics, and machine learning capabilities,
often with managed services that simplify the operation of big data infrastructure.

Integration and ETL Tools


Apache NiFi, Talend: These tools are used for data integration, data transformation, and
ETL (Extract, Transform, Load) processes, which are crucial for transforming and moving
large volumes of data into and out of big data systems.

Visualization Tools
Tableau, Power BI: While not exclusively for big data, these tools are capable of connecting
to various big data sources, enabling the visualization and analysis of large datasets in a
more user-friendly manner.

Big Data technologies continue to evolve rapidly, with a growing emphasis on cloud-based
solutions, real-time processing, and advanced machine learning capabilities. These
technologies are crucial in a wide range of industries, including finance, healthcare, retail,
and telecommunications, for driving innovation, understanding customer behavior, and
optimizing operational efficiency

IMPORTANCE OF BIG DATA ANALYTICS


Big Data Analytics plays a crucial role in modern business and technology landscapes. Its
importance is multifaceted and touches on several key areas:

1. Informed Decision-Making
Big Data Analytics provides deep insights derived from the analysis of large volumes of
data. These insights help businesses make more informed, data-driven decisions. By
analyzing data from various sources, companies can identify trends, patterns, and
correlations that would not be apparent otherwise.
2. Enhanced Customer Experiences
Companies use Big Data Analytics to understand customer behaviors and preferences better.
This understanding enables them to tailor their products, services, and interactions to meet
the specific needs and desires of their customers, resulting in improved customer satisfaction
and loyalty.
3. Operational Efficiency
By analyzing large datasets, organizations can identify inefficiencies and bottlenecks in their
operations. Big Data Analytics can help in optimizing processes, reducing costs, and
improving overall efficiency. For instance, in supply chain management, analytics can help
predict inventory needs, optimize delivery routes, and reduce operational costs.
4. Competitive Advantage
In a highly competitive business environment, leveraging Big Data Analytics can provide a
significant advantage. Companies can use analytics to identify market trends, anticipate
customer needs, and stay ahead of their competitors.
5. Risk Management
Big Data Analytics is crucial in risk assessment and management. By analyzing historical
data, companies can predict potential risks and take proactive measures to mitigate them.
This is particularly important in industries like finance and insurance, where predicting and
managing risk is a core function.
6. Innovation and Product Development
Insights derived from Big Data Analytics can drive innovation and new product
development. Companies can identify gaps in the market, emerging trends, and customer
needs that are not currently met, leading to the development of new and improved products
and services.
7. Predictive Analytics
Beyond understanding current trends, Big Data Analytics allows for predictive modeling.
Businesses can forecast future trends, customer behaviors, and market dynamics, enabling
them to prepare and adapt in advance.
8. Personalization and Targeting
Big Data Analytics enables hyper-personalization in marketing and advertising. Companies
can tailor their messages and offers to individual customers based on their specific behaviors
and preferences, resulting in more effective marketing strategies.
9. Improved Healthcare Outcomes
In healthcare, Big Data Analytics plays a critical role in patient care and medical research.
Analyzing patient data helps in diagnosing diseases early, predicting outbreaks, and
developing new treatments.
10. Social Impact and Public Services
Governments and public service organizations use Big Data Analytics to enhance their
services, from urban planning and transportation management to social welfare programs
and environmental protection.
11. Handling Complex and Diverse Data
Big Data Analytics provides the tools and methodologies to handle the variety, velocity, and
volume of data generated in the modern world, much of which is unstructured and complex.
DATA ANALYTICS TOOLS

Data analytics tools are essential in transforming raw data into meaningful insights for
decision-making and strategic planning. These tools range from software for simple data
visualization to complex predictive modeling. Here's an overview of various types of data
analytics tools:

1. Data Visualization Tools


Tableau: Known for its powerful data visualization capabilities, Tableau allows users to
create interactive and shareable dashboards.
Microsoft Power BI: A robust business analytics tool from Microsoft, offering
comprehensive business intelligence and data visualization.
QlikView/Qlik Sense: Offers user-driven business intelligence and data visualization,
known for its associative data modeling.

2. Statistical Analysis Tools


R and RStudio: An open-source programming language and software environment for
statistical computing and graphics, widely used among statisticians and data miners.
SAS (Statistical Analysis System): A software suite developed for advanced analytics,
multivariate analysis, business intelligence, data management, and predictive analytics.
SPSS (Statistical Package for the Social Sciences): A software package used for interactive,
or batched, statistical analysis, popular in social sciences and market research.

3. Database Management Tools


SQL (Structured Query Language): The standard language for relational database
management, essential for managing and querying structured data.
MySQL, PostgreSQL: Popular open-source relational database management systems,
known for reliability and robustness.

4. Big Data Processing Tools


Apache Hadoop: An open-source framework that allows for the distributed processi4. Big
Data Processing Tools
Apache Hadoop:ng of large data sets across clusters of computers.
Apache Spark: Known for its speed and ease of use, Spark offers sophisticated analytics
capabilities and is particularly good at processing real-time data.

5. Data Warehousing Tools


Amazon Redshift: A fast, scalable data warehouse service from Amazon Web Services.
Google BigQuery: A serverless data warehouse that enables super-fast SQL queries using
the processing power of Google's infrastructure.

6. Data Preparation Tools


Alteryx:Alteryx Offers data blending, advanced analytics, and data preparation capabilities.
Talend: Known for its data integration and transformation capabilities, particularly in cloud
and big data environments.

7. Business Intelligence (BI) Tools


SAP BusinessObjects: Provides a suite of front-end applications that allow business users to
view, sort, and analyze business intelligence data.
Oracle BI: A comprehensive suite of enterprise BI products, delivering a full range of
capabilities, including interactive dashboards.

8. Predictive Analytics and Machine Learning Tools


Python with libraries like Pandas, NumPy, SciPy, Scikit-learn: Python is a versatile
programming language, and with these libraries, it becomes a powerful tool for data
analysis and machine learning.
TensorFlow, PyTorch: Open-source libraries for machine learning and artificial intelligence,
particularly strong in deep learning applications.

9. ETL (Extract, Transform, Load) Tools


Informatica PowerCenter: A widely used ETL tool, known for its ability to enable lean
integration, data governance, and data integration.
Apache NiFi: An integrated data logistics platform for automating the movement of data
between disparate systems.

10. Data Mining Tools


RapidMiner: Provides an integrated environment for data preparation, machine learning,
deep learning, text mining, and predictive analytics.
WEKA (Waikato Environment for Knowledge Analysis): A collection of machine learning
algorithms for data mining tasks, particularly useful in academic and research settings.

DATA COLLECTIONS: TYPES OF DATA SOURCES


Data collection is a critical aspect of any data analysis or research project. The type and
quality of data collected can significantly impact the insights and conclusions drawn. Data
sources can be broadly categorized into various types based on their origin, nature, and
method of collection. Here's an overview of the types of data sources:
1. Primary Data Sources
Surveys and Questionnaires: Directly collecting data from subjects through structured
questionnaires or surveys.
Interviews: Gathering information through direct, one-on-one interactions.
Observations: Data collected through observation of subjects in a natural or controlled
environment.
Experiments: Data obtained from experimental settings where variables are manipulated to
observe outcomes.
2. Secondary Data Sources
Published Data: Data available in published sources such as books, magazines, research
papers, and reports.
Government Publications: Data released by government agencies, like census data,
economic indicators, and demographic information.
Online Databases and Repositories: Access to databases that collate information from
multiple sources, such as academic journals, industry reports, and historical data archives.
Commercial Data Sources: Data provided by commercial data services which may include
market research reports, customer databases, and sales records.
3. Tertiary Data Sources
Reference Materials: Encyclopedias, bibliographies, directories, which provide summaries
or compilations of secondary data.
Indexes and Abstracts: Tools that help locate relevant primary and secondary data.
4. Quantitative Data Sources
Statistical Data: Numerical data that can be analyzed using statistical methods. This
includes sales figures, financial records, and performance metrics.
Sensor Data: Data generated by sensors, such as IoT devices, which can include anything
from temperature readings to movement data.
5. Qualitative Data Sources
Interview Transcripts: Detailed records of qualitative interviews.
Focus Groups: Discussions and interactions with a selected group of people to gather more
in-depth insights on specific topics.
Case Studies: Detailed examination of specific instances or cases, valuable for in-depth
analysis.
6. Big Data Sources
Social Media Data: Data generated from social media platforms, including user posts, likes,
shares, and comments.
Web Analytics: Data collected from web sources, including website traffic, user behavior,
and interactions.
Transactional Data: Data generated from transactions, such as purchase history, usage
patterns, and customer interactions.
7. Real-Time Data Sources
Streaming Data: Data that is generated continuously by sources such as sensors, logs, or
transactions.
Live Feeds: Data streams that are updated in real-time, such as stock market feeds, weather
data, or live social media updates.
8. Geospatial Data Sources
Satellite Imagery: Images and data collected via satellites, used for environmental,
geographical, and urban planning studies.
GIS Data: Geographic Information System data, including maps, terrain data, and location-
based data.
9. Administrative Data Sources
Records and Registers: Data maintained by organizations in the regular course of business,
like employee records, customer databases, and administrative records.
Electronic Health Records: Patient data maintained by healthcare providers, containing
medical histories, diagnoses, treatment plans, and outcomes.
10. Publicly Available Data Sources
Open Data Portals: Government or organization-provided portals where datasets are made
freely available to the public.
Public Records: Documents or pieces of information that are not considered confidential,
like property records, court records, and business licenses.

Each of these data sources has its strengths and limitations, and the choice of data source
largely depends on the research objectives, the nature of the analysis, budget constraints,
and the availability of data. In many cases, a combination of these sources is used to provide
a comprehensive view and to validate findings.

SAMPLING
 Sampling is essential for building analytical models to represent future customer
behavior accurately, balancing between robustness and representativeness.
 The optimal time window for sampling should consider both the quantity of data and its
recency to ensure representativeness.
 Sampling bias, particularly in scenarios like credit scoring, must be addressed carefully
to ensure the sample reflects the target population accurately.
 Techniques such as bureau-based inference are used to correct bias, but no method is
perfect.
 Stratified sampling ensures proportional representation of subgroups within the sample,
crucial for skewed datasets like churn prediction or fraud detection.
 Historical Through-the-Door (TTD) population divides into accepted and rejected
applicants.
 Accepted subset is used for modeling, causing inherent bias.
 Reject inference methods attempt to address this bias.
 No method perfectly solves the bias issue.
 Bureau-based inference is a common but imperfect solution to infer creditworthiness for
rejected applicants

TYPES OF DATA ELEMENTS


It is important to appropriately consider the different types of data elements at the start of
the analysis. The following types of data elements can be considered:
■ Continuous: These are data elements that are defined on an interval that can be limited or
unlimited. Examples include income, sales, RFM (recency, frequency, monetary).
■ Categorical :
Nominal: These are data elements that can only take on a limited set of values with no
meaningful ordering in between. Examples include marital status, profession, purpose of
loan.
Ordinal: These are data elements that can only take on a limited set of values with a
meaningful ordering in between. Examples include credit rating; age coded as young,
middle aged, and old.
Binary: These are data elements that can only take on two values. Examples include
gender, employment status.
Appropriately distinguishing between these different data elements is of key importance to
start the analysis when importing the data into an analytics tool. For example, if marital
status were to be incorrectly specified as a continuous data element, then the software would
calculate its mean, standard deviation, and so on, which is obviously meaningless.

VISUAL DATA EXPLORATION AND EXPLORATORY STATISTICAL ANALYSIS


 Visual data exploration is an informal method to familiarize oneself with the dataset.
 It helps to identify initial patterns, trends, and anomalies in the data.
 Pie charts are particularly useful for comparing proportional distributions among
categories.
 Bar charts are effective for showing the frequency of data points in different categories.
 Histograms facilitate the understanding of the distribution, spread, and skewness of the
data.
 Scatter plots are essential for spotting relationships and correlations between two
variables.
 OLAP-based analysis assists in investigating complex data from multiple perspectives.
 Statistical measurements provide deeper insights, like comparing averages or variations
within target groups.
 Differences in statistical measures across categories can signal meaningful patterns that
merit further analysis.

MISSING VALUES
 Missing values may occur if the information is nonapplicable, such as churn time for
non-churners.
 Privacy concerns can lead to undisclosed information, like a customer not revealing
income.
 Errors during data collection or merging, like typos, can also result in missing data.
 Some analytical methods, such as decision trees, can inherently manage missing values.
 Other techniques require additional preprocessing to handle missing values effectively.

 Imputation: Replace missing values with a statistical measure like mean, median, or
mode, or use regression models to estimate missing values.
 Deletion: Remove data points or features with many missing values, assuming the
missing data is random and not informative.
 Retention: Treat missing values as a separate category if they hold meaningful
information, such as undisclosed income indicating unemployment.
To address missing data:
 Statistically test the relationship between missing data and the target variable.
 If related, categorize missing data as a separate class.
 If unrelated, choose to impute or delete based on the dataset size.

OUTLIER DETECTION AND TREATMENT


 Outliers are extreme values that differ significantly from other observations.
 There are two types: valid outliers (e.g., exceptionally high salary) and invalid outliers
(e.g., impossible age).
 Univariate outliers stand out in one dimension, while multivariate outliers are extreme
across multiple dimensions.
 Detection often involves checking for minimum and maximum values.
 Treatment for outliers may vary depending on their nature and impact on the data.
 Histograms help detect outliers by showing the distribution of data, where outliers often
appear as isolated bars.
 Box plots are useful for visualizing outliers, representing key quartiles and showing data
beyond 1.5 times the interquartile range as potential outliers.
 Box plots can be used to identify outliers, with values beyond 1.5 times the interquartile
range often considered extreme.
 Z-scores calculate how far an observation deviates from the mean in terms of standard
deviations.
 A common rule is to label data as outliers if the z-score is greater than 3.
 These methods typically focus on detecting univariate outliers.
 Multivariate outliers can be identified through regression analysis and observing
deviations from the fitted line.
 Multivariate outliers are detected using methods like clustering or Mahalanobis distance.
 These techniques, while useful, are often omitted in modeling due to minimal impact on
performance.
 Some models, like decision trees, neural networks, and SVMs, are less affected by
outliers.
 Outliers representing invalid data may be treated as missing values.
 For valid outliers, capping or winsorizing is used, where data beyond set limits is
trimmed.
 The limits for truncation are often based on z-scores or the interquartile range (IQR).
 Sigmoid transformations can also be applied for capping to constrain values within a
specific range.

STANDARDIZING DATA
 Standardization adjusts variables to a similar scale, enhancing comparability in models.
 Min/max standardization rescales data to a defined range, like 0 to 1.
 Z-score standardization transforms data based on the mean and standard deviation.
 Decimal scaling normalizes data by dividing by a power of 10, based on the largest
value.
 Standardization is crucial for regression analyses but not required for decision trees.

CATEGORIZATION
 Categorization in data analysis involves the process of converting continuous variables
into categorical ones. This method is often applied to simplify the data, enhance
understanding, and prepare for specific types of analysis that require categorical input.
examples:
 Age Categorization: Instead of using exact ages, individuals might be grouped into
'Youth' (0-18), 'Young Adults' (19-35), 'Middle-aged' (36-50), and 'Senior' (51+). This
simplifies analysis and is particularly useful in surveys or studies focusing on
demographic impacts.

 Income Bands: Rather than precise income figures, you could categorize income into
'Low' (e.g., under $20,000), 'Medium' ($20,000 to $100,000), and 'High' (over $100,000) to
study economic behaviors or preferences across different economic groups.

 Educational Levels: Education years can be binned into 'No High School' (0-11 years),
'High School Graduate' (12 years), 'Some College' (13-15 years), and 'College Degree or
higher' (16+ years) for studies relating to educational attainment's impact on job
opportunities or earning potential.

 Credit Score Ranges: Credit scores might be categorized into ranges like 'Poor', 'Fair',
'Good', 'Very Good', and 'Excellent', which is often done in the financial industry to
assess creditworthiness.
Categorization aids in reducing the noise in the data, making it easier for certain algorithms
to detect patterns. It is also beneficial when visualizing data, as it can reveal trends that
might not be apparent when dealing with continuous variables. However, one must be
cautious as overly coarse categorization can lead to loss of information and potential
insights.

WEIGHTS OF EVIDENCE CODING

Weights of Evidence (WoE) coding is a statistical technique used primarily in the


development of predictive models, particularly in the financial industry for risk modeling. It
involves converting categorical variables into a continuous scale, representing the predictive
power of each category in relation to the target variable. The WoE value is calculated using
the natural logarithm of the distribution of good outcomes versus bad outcomes within each
category. This transformation allows for a linear relationship with the log odds of the
dependent variable, which is useful in logistic regression. It helps to assess the importance of
categories and can also handle missing values effectively.

Let's consider a simple dataset with two variables: Credit History (Good, Bad) and Loan
Default (Yes, No). Here's how WoE could be calculated:

Data Count:

Good Credit History, No Default: 70


Good Credit History, Yes Default: 30
Bad Credit History, No Default: 20
Bad Credit History, Yes Default: 80

Calculating Proportions:

Proportion of No Default with Good Credit: 70 / (70 + 30) = 0.7


Proportion of Default with Good Credit: 30 / (70 + 30) = 0.3
Proportion of No Default with Bad Credit: 20 / (20 + 80) = 0.2
Proportion of Default with Bad Credit: 80 / (20 + 80) = 0.8

WoE Calculation:

WoE for Good Credit: ln(0.7 / 0.3)


WoE for Bad Credit: ln(0.2 / 0.8)
The WoE values will show the strength of the prediction for each category of the Credit
History variable in relation to Loan Default. Higher WoE values indicate stronger prediction
power for non-default, while lower (or negative) values indicate stronger prediction for
default.

VARIABLE SELECTION:
 Analytical models often begin with many variables; however, typically only a few
significantly predict the target variable.
 Credit scoring models, for example, may use 10-15 key variables in a scorecard.
 Filters are a practical method for variable selection, evaluating the univariate correlation
of each variable with the target.
 Filters facilitate a rapid assessment to determine which variables to retain for deeper
analysis.
 A variety of filter measures for variable selection are available in statistical literature.
The Pearson correlation ρP is calculated as follows:

the Pearson correlation coefficient, which ranges between -1 and +1, can be utilized as a
criterion. Variables can be chosen based on whether the Pearson correlation is significantly
different from 0, as indicated by the p-value. A common practice might be to select variables
with an absolute Pearson correlation |ρP| greater than a certain threshold, such as 0.50,
indicating a moderate to strong correlation with the target variable.

Imagine a dataset containing information about bank customers, which includes variables
like Age, Income, Credit Score, and Loan Default (Yes/No). To select relevant variables for
predicting Loan Default, we could use the Pearson correlation coefficient:

Calculate Pearson Correlation:


 Correlation of Age with Loan Default: -0.10 (weak negative correlation)
 Correlation of Income with Loan Default: -0.45 (moderate negative correlation)
 Correlation of Credit Score with Loan Default: -0.60 (strong negative correlation)

Selection Criterion:

 Choose variables with |ρP| > 0.50.

Variables Selected:

 Credit Score (|ρP| = 0.60) is selected as it has a strong correlation with Loan Default.
 Age and Income are not selected due to weaker correlations.

This process helps in identifying the most influential variables for the prediction of Loan
Default.

You might also like