The 8 Basic Statistics Concepts For Data Science - KDnuggets

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets

News
Programming
Python JOIN NEWSLETTER
SQL

Datasets
Education
Certificates
Courses
Online Masters
Add AI to your Windows apps
Resources
Cheat Sheets

The 8 Basic Statistics Concepts for Data


Events
Jobs
Search KDnuggets…

Science
Projects
Publications
Webinars

Understanding the fundamentals of statistics is a core capability for becoming a Data Scientist. Review
these essential ideas that will be pervasive in your work and raise your expertise in the field.
By Shirley Chen, Data Analyst @ Outdoorsy on April 21, 2022 in Data Science

Latest Posts

Hypothesis Testing and A/B Testing

Scikit-learn for Machine Learning C


Sheet

KDnuggets News, September 13:


Getting Started with SQL in 5 Steps
Introduction to Databases in D...

Closed Source VS Open Source Ima


Annotation

KDnuggets Survey: Benchmark Wit


Your Peers On Data Science Spend
Trends 2023 H2

Statistics is a form of mathematical analysis that uses quantified models and Applying Descriptive and Inferentia
Statistics in Python
representations for a given set of experimental data or real-life studies. The main
advantage of statistics is that information is presented in an easy way. Recently, I reviewed
all the statistics materials and organized the 8 basic statistics concepts for becoming a data
Blog Top Posts
scientist! Top Posts
Submissions
About
Understand the Type of Analytics
Topics
Probability Artificial Intelligence
Career Advice
Central Tendency
Computer Vision
Variability Data Engineering
Data Science
Relationship Between Variables
Machine Learning
MLOps
Probability Distribution NLP
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 1/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
Programming
Hypothesis Testing and Statistical Significance
Python Understanding
J O I Machine
N N E W S LLearning
ETTER
Regression SQL Algorithms: An In-Depth Overview

Datasets 7 Best Platforms to Practice SQL


Education
Certificates
Understand
Courses the Type of Analytics
Statistics in Data Science: Theory a
Overview
Online Masters

Resources
Descriptive Analytics tells us what happened in the past and helps a business understand How to Select Rows and Columns i
Cheat Sheets Pandas Using [ ], .loc, iloc, .at and .
how it is performing by providing context to help stakeholders interpret information.
Events
Jobs Introduction to Databases in Data
Diagnostic Analytics takes descriptive
Projects data a step further and helps you understand why
Science
Publications
something happened in the past.
Webinars
4 Ways to Rename Pandas Column
Predictive Analytics predicts what is most likely to happen in the future and provides
companies with actionable insights based on the information. 3 Ways to Access GPT-4 for Free

Prescriptive Analytics provides recommendations regarding actions that will take Decision Tree Algorithm, Explained

advantage of the predictions and guide the possible actions toward a solution.
Working with Big Data: Tools and
Techniques

If You Want to Master Generative A


Probability Ignore All (But Two) Tools

Probability is the measure of the likelihood that an event will occur in a Random
Experiment.

Complement: P(A) + P(A’) = 1

Intersection: P(A∩B) = P(A)P(B)

Union: P(A∪B) = P(A) + P(B) − P(A∩B)

Get the Complete Collection of D


Science Cheat Sheets and the lea
newsletter on AI, Data Science, a
Machine Learning, straight to yo
inbox
Intersection and Union. Your Email
Blog
Top Posts
Submissions SIGN UP

Conditional Probability: P(A|B) is a measure of the probability of one event occurring with
About
By subscribing you accept KDnuggets Privacy P
some relationship to one or Topics
more other events. P(A|B)=P(A∩B)/P(B), when P(B)>0.
Independent Events: Two events areIntelligence
Artificial independent if the occurrence of one does not affect
Career Advice
the probability of occurrence ofComputer
the other.
VisionP(A∩B)=P(A)P(B) where P(A) != 0 and P(B) != 0 ,
Data Engineering
P(A|B)=P(A), P(B|A)=P(B)
Data Science
Machine Learning
Mutually Exclusive Events: Two events are mutually exclusive if they cannot both occur at
MLOps
NLP
the same time. P(A∩B)=0 and P(A∪B)=P(A)+P(B).
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 2/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
Bayes’ Theorem describes theProgramming
probability of an event based on prior knowledge of
Python JOIN NEWSLETTER
conditions that might be related
SQLto the event.

Datasets
Education
Certificates
Courses
Online Masters

Resources
Cheat Sheets
Events Bayes’ Theorem.
Jobs
Projects
Publications

Central Tendency
Webinars

Mean: The average of the dataset.

Median: The middle value of an ordered dataset.

Mode: The most frequent value in the dataset. If the data have multiple values that
occurred the most frequently, we have a multimodal distribution.

Skewness: A measure of symmetry.

Kurtosis: A measure of whether the data are heavy-tailed or light-tailed relative to a normal
distribution

Skewness.

Kurtosis.
Blog
Top Posts
Submissions
About
Variability
Topics
Artificial Intelligence
Range: The difference betweenCareer
the highest
Advice and lowest value in the dataset.
Computer Vision
Data EngineeringRange (IQR)
Percentiles, Quartiles and Interquartile
Data Science
Machine Learning
Percentiles — A measure thatMLOpsindicates the value below which a given percentage of
NLP
observations in a group of observations falls.
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 3/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
Programming
Quantiles— Values that divide the number of data points into four more or less equal
Python JOIN NEWSLETTER
parts, or quarters. SQL

Interquartile Range (IQR)— A measure of statistical dispersion and variability based


Datasets
on dividing a data set intoEducation
quartiles. IQR = Q3 − Q1
Certificates
Courses
Online Masters

Resources
Cheat Sheets
Events
Jobs
Projects
Publications
Webinars

Percentiles, Quartiles and Interquartile Range (IQR).

Variance: The average squared difference of the values from the mean to measure how
spread out a set of data is relative to mean.
Standard Deviation: The standard difference between each data point and the mean and
the square root of variance.

Population and Sample Variance and Standard Deviation.

Standard Error (SE): An estimate of the standard deviation of the sampling distribution.

Blog
Top Posts
Submissions
About

Topics
Artificial Intelligence
Career Advice
Computer Vision
Data Engineering
Population and Sample Standard Error.
Data Science
Machine Learning
MLOps
NLP
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 4/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
Relationship Between Variables
Programming
Python JOIN NEWSLETTER
SQL

Causality: Relationship between two events where one event is affected by the other.
Datasets
Education
Covariance: A quantitative measure of the joint variability between two or more variables.
Certificates
Courses
Correlation: Measure the relationship between two variables and ranges from -1 to 1, the
Online Masters

normalized version of covariance.


Resources
Cheat Sheets
Events
Jobs
Projects
Publications
Webinars

Covariance and Correlation.

Probability Distributions
Probability Distribution Functions
Probability Mass Function (PMF): A function that gives the probability that a discrete
random variable is exactly equal to some value.

Probability Density Function (PDF): A function for continuous data where the value at any
given sample can be interpreted as providing a relative likelihood that the value of the
random variable would equal that sample.
Blog
Cumulative Density FunctionTop
(CDF):
Posts
A function that gives the probability that a random
variable is less than or equal toSubmissions
a certain value.
About

Topics
Artificial Intelligence
Career Advice
Computer Vision
Data Engineering
Data Science
Machine Learning
MLOps
NLP
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 5/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
Programming
Python JOIN NEWSLETTER
SQL

Datasets
Education
Certificates
Courses
Online Masters

Resources
Cheat Sheets
Events
Jobs
Projects
Publications
Webinars

Comparison between PMF, PDF, and CDF.

Continuous Probability Distribution


Uniform Distribution: Also called a rectangular distribution, is a probability distribution
where all outcomes are equally likely.

Normal/Gaussian Distribution: The curve of the distribution is bell-shaped and


symmetrical and is related to the Central Limit Theorem that the sampling distribution of
the sample means approaches a normal distribution as the sample size gets larger.

Exponential Distribution: A probability distribution of the time between the events in


a Poisson point process.

Chi-Square Distribution: The distribution of the sum of squared standard normal deviates.
Blog
Top Posts
Submissions
About

Topics
Artificial Intelligence
Career Advice
Computer Vision
Data Engineering
Data Science
Machine Learning
MLOps
NLP
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 6/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
Programming
Python JOIN NEWSLETTER
SQL

Datasets
Education
Certificates
Courses
Online Masters

Resources
Cheat Sheets
Events
Jobs
Projects
Publications
Webinars
Discrete Probability Distribution
Bernoulli Distribution: The distribution of a random variable which takes a single trial and
only 2 possible outcomes, namely 1(success) with probability p, and 0(failure) with
probability (1-p).

Binomial Distribution: The distribution of the number of successes in a sequence


of n independent experiments, and each with only 2 possible outcomes, namely 1(success)
with probability p, and 0(failure) with probability (1-p).

Poisson Distribution: The distribution that expresses the probability of a given number of
events k occurring in a fixed interval of time if these events occur with a known constant
average rate λ and independently of the time.

Blog
Top Posts
Submissions
About

Topics
Artificial Intelligence
Career Advice
Computer Vision
Data Engineering
Data Science
Machine Learning
Hypothesis Testing and Statistical Significance
MLOps
NLP
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 7/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
Programming
Python JOIN NEWSLETTER
SQL

Null and Alternative Hypothesis


Datasets
Education
Certificates
Courses that there is no relationship between two measured
Null Hypothesis: A general statement
Online Masters
phenomena or no association among groups. Alternative Hypothesis: Be contrary to the
Resources
null hypothesis. Cheat Sheets
Events
In statistical hypothesis testing,Jobs
a type I error is the rejection of a true null hypothesis,
Projects
while a type II error is the non-rejection of a false null hypothesis.
Publications
Webinars

Interpretation
P-value: The probability of the test statistic being at least as extreme as the one observed
given that the null hypothesis is true. When p-value > α, we fail to reject the null hypothesis,
while p-value ≤ α, we reject the null hypothesis, and we can conclude that we have a
significant result.

Critical Value: A point on the scale of the test statistic beyond which we reject the null
hypothesis and is derived from the level of significance α of the test. It depends upon a test
statistic, which is specific to the type of test, and the significance level, α, which defines the
sensitivity of the test.

Significance Level and Rejection Region: The rejection region is actually dependent on
the significance level. The significance level is denoted by α and is the probability of
rejecting the null hypothesis if it is true.

Blog
Z-Test Top Posts
Submissions
A Z-test is any statistical test for which the distribution of the test statistic under the null
About

hypothesis can be approximated


Topicsby a normal distribution and tests the mean of a
Artificial
distribution in which we already knowIntelligence
the population variance. Therefore, many statistical
Career Advice
tests can be conveniently performed approximate Z-tests if the sample size is large or
ComputerasVision
Data Engineering
the population variance is known .
Data Science
Machine Learning
MLOps
NLP
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 8/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
Programming
Python JOIN NEWSLETTER
SQL

Datasets
Education
Certificates
Courses
Online Masters

Resources
Cheat Sheets
Events
Jobs
T-Test Projects
Publications
Webinars
A T-test is the statistical test if the population variance is unknown, and the sample size is
not large (n < 30).

Paired sample means that we collect data twice from the same group, person, item, or
thing. Independent sample implies that the two samples must have come from two
completely different populations.

ANOVA (Analysis of Variance)


ANOVA is the way to find out if experimental results are significant. One-way
ANOVA compares two means from two independent groups using only one independent
variable. Two-way ANOVA is the extension of one-way ANOVA using two independent
variables to calculate the main effect and interaction effect.

Blog
Top Posts
Submissions
About

Topics
Artificial Intelligence
Career Advice
Computer Vision
Data Engineering
ANOVA Table.
Data Science
Machine Learning
MLOps
NLP
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 9/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
Chi-Square Test Programming
Python JOIN NEWSLETTER
SQL

Datasets
Education
Certificates
Courses
Online Masters

ResourcesChi-Square Test Formula.


Cheat Sheets
Events
Jobs
Chi-Square Test checks whether or not a model follows approximately normality when we
Projects
Publications
have s discrete set of data points. Goodness of Fit Test determines if a sample matches
Webinars
the population fit one categorical variable to a distribution. Chi-Square Test for
Independence compares two sets of data to see if there is a relationship.

Regression

Linear Regression
Assumptions of Linear Regression

Linear Relationship
Multivariate Normality
No or Little Multicollinearity
No or Little Autocorrelation
Homoscedasticity

Linear Regression is a linear approach to modeling the relationship between a dependent


variable and one independent variable. An independent variable is a variable that is
controlled in a scientific experiment to test the effects on the dependent variable.
A dependent variable is a variable being measured in a scientific experiment.

Blog
Top Posts
Submissions
About

Topics
Artificial Intelligence
Career Advice
Computer Linear Regression Formula.
Vision
Data Engineering
Data Science
Machine Learning
Multiple Linear Regression isMLOpsa linear approach to modeling the relationship between a
NLP
dependent variable and two or more independent variables.
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 10/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
Programming
Python JOIN NEWSLETTER
SQL

Datasets
Education
Certificates
Courses
Online Masters

Resources
Cheat Sheets
Events
Jobs
Projects
Publications
Webinars
Multiple Linear Regression Formula.

Steps for Running the Linear Regression


Step 1: Understand the model description, causality, and directionality

Step 2: Check the data, categorical data, missing data, and outliers

Outlier is a data point that differs significantly from other observations. We can use the
standard deviation method and interquartile range (IQR) method.
Dummy variable takes only the value 0 or 1 to indicate the effect for categorical
variables.

Step 3: Simple Analysis — Check the effect comparing between dependent variable to
independent variable and independent variable to independent variable

Use scatter plots to check the correlation


Multicollinearity occurs when more than two independent variables are highly
correlated. We can use Variance Inflation Factor (VIF) to measure if VIF > 5 there is
highly correlated and if VIF > 10, then there is certainly multicollinearity among the
variables.
Interaction Term implies a change in the slope from one value to another value.

Step 4: Multiple Linear Regression — Check the model and the correct variables

Step 5: Residual Analysis

Check normal distribution and normality for the residuals.


Blog
Top Posts
Homoscedasticity describes a situation in which the error term is the same across all
Submissions
About
values of the independent variables and means that the residuals are equal across the
regression line. Topics
Artificial Intelligence
Career Advice
Step 6: Interpretation of Regression Output
Computer Vision
Data Engineering
R-Squared is a statistical measure
Data Science of fit that indicates how much variation of a
Machine Learning
dependent variable is explained by the independent variables. Higher R-Squared value
MLOps
represents smaller differences
NLP between the observed data and fitted values.
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 11/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
P-value Programming
Python JOIN NEWSLETTER
Regression Equation SQL

Datasets
Education
Certificates
Courses
Shirley Chen is a Data AnalystOnline
at Outdoorsy.
Masters

Resources
Cheat Sheets
Original. Reposted with permission.
Events
Jobs
Projects
Publications
More On This Topic Webinars

KDnuggets News, June 29: 20 Basic Linux Commands for Data Science…
20 Basic Linux Commands for Data Science Beginners
Advanced Statistical Concepts in Data Science
20 Core Data Science Concepts for Beginners
10 Statistical Concepts You Should Know For Data Science Interviews
7 SQL Concepts You Should Know For Data Science

Get the FREE ebook 'The Great Big Natural Language


Processing Primer' and the leading newsletter on AI,
Data Science, and Machine Learning, straight to your
inbox.
Your Email

SIGN UP

By subscribing you accept KDnuggets Privacy Policy

<= Previous post Next post =>

Top Posts
Understanding Machine Learning Algorithms: An In-Depth Overview

7 Best Platforms to Practice SQL

Statistics in Data Science: Theory and Overview


Blog
How to Select Rows and Columns inTop Posts
Pandas Using [ ], .loc, iloc, .at and .iat
Submissions
About
Introduction to Databases in Data Science
Topics
4 Ways to Rename Pandas Columns
Artificial Intelligence
3 Ways to Access GPT-4 for Free Career Advice
Computer Vision
Decision Tree Algorithm, ExplainedData Engineering
Data Science
Machine Learning
Working with Big Data: Tools and Techniques
MLOps
NLP
If You Want to Master Generative AI, Ignore All (But Two) Tools
News
https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 12/13
9/17/23, 10:31 AM The 8 Basic Statistics Concepts for Data Science - KDnuggets
News
Programming
Python JOIN NEWSLETTER
SQL

© 2023 Guiding Tech Media | About | Contact | Privacy Policy | Terms of Service
Datasets
Education
Certificates
Courses
Online Masters

Resources
Cheat Sheets
Events
Jobs
Projects
Publications
Webinars

https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html 13/13

You might also like