Unit 2
Unit 2
Unit 2
Descriptive Statistics
Classification of Data
Classification means arranging the mass (large amount) of data into different classes or groups on the
basis of their similarities (similar features).
Example:
• Collecting the data regarding the number of students admitted to a university in a year, the
students can be classified on the basis of gender.
• In this case, all male students will be put in one class and all female students will be put in
another class.
• The students can also be classified on the basis of age, marks, marital status, height, etc.
• To facilitate comparison
Classification of Data:
1. Qualitative data
2. Quantitative data
3. Discrete Data
4. Continuous Data
5. Chronological Data
6. Geographical Data
1. Simple Classification
2. Manifold Classification
Qualitative data
Classification of data according to qualitative characteristic such as sex, honesty, intelligence,
honesty, marital status etc.
Quantitative data
Discrete Data
Continuous data
Chronological Data
When the data are classified or arranged by their times of occurrence, such as years, months,
weeks , days etc.
Geographical Data
When the data are classified by geographical regions or locations, like states provinces, cities,
countries etc.
level of measurement
The level of measurement determines which statistical calculations are meaningful.
The four levels of measurement are: nominal, ordinal, interval, and ratio.
Arranged in order, but differences between data entries are not meaningful.
Interval: Data at the interval level of measurement are quantitative. A zero entry simply represents
a position on a scale; the entry is not an inherent zero.
Ratio: Data at the ratio level of measurement are similar to the interval level, but a zero entry is
meaningful.
A ratio of two data values can be formed so one data value can be expressed as a ratio.
Data in raw form are usually not easy to use for decision making.
Table
Graph
Tabulation is a systematic & logical presentation of numeric data in rows and columns, to
facilitate comparison and statistical analysis.
(Or)
The method of placing organized data into a tabular form is called as tabulation.
• A chart is a graphical representation of data, in which "the data is represented by symbols,
such as bars in a bar chart, lines in a line chart, or slices in a pie chart".
Frequency Distribution
A frequency distribution is a tabular presentation that generally organizes data into classes
and shows the number of observations (frequencies) falling into each of these classes.
Or
Class: A class is one of the categories into which qualitative data can be classified
Class Frequency: Class frequency is the number of observations in the data set that fall into a
particular class
(or)
Class frequency would be the number of students who scored between 90% to 100%, the number
of married people or the number of workers working in manufacturing
(1) uni-variate frequency distribution: The frequency distribution with one variable is called a
uni-variate frequency distribution.
(2) bi-variate frequency distribution : The frequency distribution with two variable is called
bi-variate frequency distribution.
(3) multivariate frequency distribution: The frequency distribution with more than two
variables is called multi-variate frequency distribution
Graphical methods
Graphical methods for describing Qualitative variables
1. Bar Chart
2. Pareto diagram
3. Pie chart
Bar Chart:
Bar charts and Pie charts are often used for qualitative (category) data
Height of bar or size of pie slice shows the frequency or percentage for each category
A graphical representation of information in the form of bars.
Bars of equal width are drawn to represent different categories, with the length of each
bar being proportional to the number or frequency of occurrence of each category.
Ex:
Type of Frequency
Aphasia
Anomic 10
Broca’s 5
Conduction 7
Total 22
Pareto Diagram
A bar chart, where categories are shown in descending order of frequency
A cumulative polygon is often shown in the same graph
Used to separate the “vital few” from the “trivial many”
The purpose is to highlight the most important among a (typically large) set of factors.
Pie Chart
A graph that displays data in a circular format.
The categories of the qualitative variable are represented by the slices of a pie.
Histogram
Bars of the appropriate heights are used to represent the number of observations within each
class
Ex;
4
4 3
2 30 but less than 40 5
2
0 0
40 but less than 50 4
0
0 10 20 30 40 50 60 50 but less than 60 2
Ogive graph
• Sometimes called a cumulative frequency polygon, is a type of frequency polygon that shows
cumulative frequencies.
• In other words, the cumulative percent's are added on the graph from left to right
• An ogive graph plots cumulative frequency on the y-axis and class boundaries along the x-
axis. It’s very similar to a histogram, only instead of rectangles, an ogive has a single point
marking where the top right of the rectangle would be.
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
100
Cumulative Percentage
80
60
40
20
0
10 20 30 40 50 60
Scatter Diagrams
Scatter Diagrams are used for paired observations taken from two numerical variables.
one variable is measured on the vertical axis and the other variable is measured on the
horizontal axis
Volume Cost
per day per day
23 125
26 140
29 146
33 160
38 167
42 170
50 188
55 195
60 200
Cost per Day vs. Production Volume
250
200
Cost per Day
150
100
50
positively skewed distribution (skewed to the right) has a tail that extends to the right in the
direction of positive values.
negatively skewed distribution (skewed to the left) has a tail that extends to the left in the
direction of negative values.
Methods are
Z-score
Box plot
Box plot
The box plot is a graph representing information about certain percentiles for a data set and
can be used to identify outliers. It
detects outliers.
2. Construct a rectangular box whose left (or lower) edge is at the lower quartile and whose right
(or upper) edge is at the upper quartile (the box width = IQR). Draw a vertical line segment
inside the box at the location of the median.
3. Extend horizontal line segments from each end of the box to the smallest and largest
observations in the data set. (These lines are called whiskers.)
The IQR is Q3 – Q1 and measures the spread in the middle 50% of the data.
The IQR is also called the mid spread because it covers the middle 50% of the data.
The IQR is a measure of variability that is not influenced by outliers or extreme values.
Measures like Q1, Q3, and IQR that are not influenced by outliers are called resistant measures.
Bivariate Distribution
Data for two variables (usually two types of related data).
Deals with two variables that can change and are compared to find relationships.
If one variable is influencing another variable, then you will have bivariate data that has an
independent and a dependent variable.
Scatter plots
Regression analysis
2. Joint frequencies
3. Marginal frequencies
4. Conditional frequencies
male
female
total
which is the ratio of the frequency in a particular category and the total number of data values
The purple cells on the above table are all joint frequency numbers.
The marginal frequency numbers are the numbers on the edges of a table.
The numbers in the column on the very right and on the row on the very bottom are the
marginal frequency numbers.
which is the ratio of the sum of the joint relative frequency in a row or column and the total
number of data value
On the above table, the marginal frequency numbers are in the green cells
This is a similar set up to conditional probability, where the limitation, or condition, is preceded
by the word given
For example, the percentage of people that selected software as a career, given those people
are female in the above table