Unit 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

UNIT-2

Descriptive Statistics

Classification of Data
Classification means arranging the mass (large amount) of data into different classes or groups on the
basis of their similarities (similar features).

Example:

• Collecting the data regarding the number of students admitted to a university in a year, the
students can be classified on the basis of gender.

• In this case, all male students will be put in one class and all female students will be put in
another class.

• The students can also be classified on the basis of age, marks, marital status, height, etc.

• It helps in presenting the mass of data in a concise and simple form.

• To facilitate comparison

• To pinpoint most significant features of the data

• It provides a basis for tabulation and analysis of data.

• It is a process of presenting raw data in a systematic manner enabling us to draw meaningful


conclusion.

Classification of Data:

On the basis of nature of Variables:

1. Qualitative data

2. Quantitative data

3. Discrete Data

4. Continuous Data

5. Chronological Data

6. Geographical Data

On the basis of nature of Attributes:

1. Simple Classification

2. Manifold Classification

Qualitative data
Classification of data according to qualitative characteristic such as sex, honesty, intelligence,
honesty, marital status etc.
Quantitative data

Classification of data according to quantitative characteristic such as age, weight , height,


marks etc

Discrete Data

Classification of data which takes exact numerical values.

Ex: Number of children in a family, show size

Continuous data

Classification of data which takes numerical values within certain range.

Chronological Data

When the data are classified or arranged by their times of occurrence, such as years, months,
weeks , days etc.

Geographical Data

When the data are classified by geographical regions or locations, like states provinces, cities,
countries etc.
level of measurement
The level of measurement determines which statistical calculations are meaningful.

The four levels of measurement are: nominal, ordinal, interval, and ratio.

Nominal: Data at the nominal level of measurement are qualitative only.

• Calculated using names, labels, or qualities.

• No mathematical computations can be made at this level.

Ordinal: Data at the ordinal level of measurement are qualitative or quantitative.

Arranged in order, but differences between data entries are not meaningful.

Interval: Data at the interval level of measurement are quantitative. A zero entry simply represents
a position on a scale; the entry is not an inherent zero.

Arranged in order, the differences between data entries can be calculated.

Ratio: Data at the ratio level of measurement are similar to the interval level, but a zero entry is
meaningful.

A ratio of two data values can be formed so one data value can be expressed as a ratio.

Tabular & Graphical Presentation of Data

 Data in raw form are usually not easy to use for decision making.

 Some type of organization is needed.

 Table

 Graph

 The type of graph to use depends on the variable being summarized.

Categorical Variables Numerical Variables

• Frequency distribution • Line chart


• Bar chart • Frequency distribution
• Pie chart • Histogram and Ogive
• Pareto diagram • Scatter plot

Tabulation & Graphical representation of data

Tabulation is a systematic & logical presentation of numeric data in rows and columns, to
facilitate comparison and statistical analysis.

(Or)

The method of placing organized data into a tabular form is called as tabulation.
• A chart is a graphical representation of data, in which "the data is represented by symbols,
such as bars in a bar chart, lines in a line chart, or slices in a pie chart".

Frequency Distribution
A frequency distribution is a tabular presentation that generally organizes data into classes
and shows the number of observations (frequencies) falling into each of these classes.

Or

Frequency distribution in statistics provides the information of the number of occurrences


(frequency) of distinct values distributed within a given period of time or interval, in a list,
table, or graphical representation

Class: A class is one of the categories into which qualitative data can be classified

Class Frequency: Class frequency is the number of observations in the data set that fall into a
particular class
(or)

Class frequency refers to the number of observations in each class

Class frequency would be the number of students who scored between 90% to 100%, the number
of married people or the number of workers working in manufacturing

Categories of frequency distribution

(1) uni-variate frequency distribution: The frequency distribution with one variable is called a
uni-variate frequency distribution.

(2) bi-variate frequency distribution : The frequency distribution with two variable is called
bi-variate frequency distribution.

(3) multivariate frequency distribution: The frequency distribution with more than two
variables is called multi-variate frequency distribution

Graphical methods
Graphical methods for describing Qualitative variables

1. Bar Chart
2. Pareto diagram
3. Pie chart
Bar Chart:
 Bar charts and Pie charts are often used for qualitative (category) data
 Height of bar or size of pie slice shows the frequency or percentage for each category
 A graphical representation of information in the form of bars.
 Bars of equal width are drawn to represent different categories, with the length of each
bar being proportional to the number or frequency of occurrence of each category.
Ex:
Type of Frequency
Aphasia

Anomic 10

Broca’s 5

Conduction 7

Total 22

Pareto Diagram
 A bar chart, where categories are shown in descending order of frequency
 A cumulative polygon is often shown in the same graph
 Used to separate the “vital few” from the “trivial many”
 The purpose is to highlight the most important among a (typically large) set of factors.

Step 1: Sort by defect cause, in descending order

Step 2: Determine % in each category

Step 3: Show results graphically

Pie Chart
 A graph that displays data in a circular format.

 The categories of the qualitative variable are represented by the slices of a pie.

 Each slice of a pie represents a portion or percentage of the total

Histogram

 A graph of the data in a frequency distribution is called a histogram


 The interval endpoints are shown on the horizontal axis

 the vertical axis is either frequency, relative frequency, or percentage

 Bars of the appropriate heights are used to represent the number of observations within each
class

Ex;

Histogram: Daily High


8 Temperature
6 10 but less than 20 3
6 5
20 but less than 30 6
Frequency

4
4 3
2 30 but less than 40 5
2
0 0
40 but less than 50 4
0
0 10 20 30 40 50 60 50 but less than 60 2

Ogive graph
• Sometimes called a cumulative frequency polygon, is a type of frequency polygon that shows
cumulative frequencies.

• In other words, the cumulative percent's are added on the graph from left to right

• An ogive graph plots cumulative frequency on the y-axis and class boundaries along the x-
axis. It’s very similar to a histogram, only instead of rectangles, an ogive has a single point
marking where the top right of the rectangle would be.

The Cumulative Frequency Distribution

Data in ordered array:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class Frequency Percentage Cumulative Cumulative


frequency percentage
10-20 3 15 3 15
20-30 6 30 9 45
30-40 5 25 14 70
40-50 4 20 18 90
50-60 2 10 20 100
total 20 100
Ogive: Daily High Temperature

100
Cumulative Percentage
80
60
40
20
0
10 20 30 40 50 60

Scatter Diagrams
 Scatter Diagrams are used for paired observations taken from two numerical variables.
 one variable is measured on the vertical axis and the other variable is measured on the
horizontal axis

Volume Cost
per day per day

23 125

26 140

29 146

33 160

38 167

42 170

50 188

55 195

60 200
Cost per Day vs. Production Volume

250

200
Cost per Day

150

100

50

0 Volume per Day


0 10 20 30 40 50 60 70

Measures of shape: either Symmetric or skewed.


The shape of the distribution is said to be symmetric if the observations are balanced, or
evenly distributed, about the center.
The shape of the distribution is said to be skewed if the observations are not symmetrically
distributed around the center.

positively skewed distribution (skewed to the right) has a tail that extends to the right in the
direction of positive values.

negatively skewed distribution (skewed to the left) has a tail that extends to the left in the
direction of negative values.

Methods for Determining Outliers


An outlier is a measurement that is unusually large or small relative to the other values.

Methods are

Z-score

Box plot

Box plot
 The box plot is a graph representing information about certain percentiles for a data set and
can be used to identify outliers. It

 plots the five-number summary.

 shows the spread of the data.

 detects outliers.

Box plot Construction

To construct a Box plot


1. Draw a horizontal scale.

2. Construct a rectangular box whose left (or lower) edge is at the lower quartile and whose right
(or upper) edge is at the upper quartile (the box width = IQR). Draw a vertical line segment
inside the box at the location of the median.

3. Extend horizontal line segments from each end of the box to the smallest and largest
observations in the data set. (These lines are called whiskers.)

Interquartile Range (IQR)

 The IQR is Q3 – Q1 and measures the spread in the middle 50% of the data.

 The IQR is also called the mid spread because it covers the middle 50% of the data.

 The IQR is a measure of variability that is not influenced by outliers or extreme values.

 Measures like Q1, Q3, and IQR that are not influenced by outliers are called resistant measures.

Bivariate Distribution
 Data for two variables (usually two types of related data).

 Deals with two variables that can change and are compared to find relationships.

 If one variable is influencing another variable, then you will have bivariate data that has an
independent and a dependent variable.

 Scatter plots

Frequency distribution (Two way table)


 Correlation coefficient

 Regression analysis

Frequency distributions for bivariate data


1. Two way table

2. Joint frequencies

3. Marginal frequencies

4. Conditional frequencies

Two way table


It is a table listing two categorical variables whose values have been paired

Each set of numbers in a two-way table has a specific name.

software teaching forming total

male

female

total

Joint relative frequency


 The middle cells are the joint frequency numbers.

 which is the ratio of the frequency in a particular category and the total number of data values

 The purple cells on the above table are all joint frequency numbers.

Marginal relative frequency

 The marginal frequency numbers are the numbers on the edges of a table.

 The numbers in the column on the very right and on the row on the very bottom are the
marginal frequency numbers.

 which is the ratio of the sum of the joint relative frequency in a row or column and the total
number of data value

 On the above table, the marginal frequency numbers are in the green cells

Conditional relative frequency


 The ratio of a joint relative frequency and related marginal relative frequency

 This is a similar set up to conditional probability, where the limitation, or condition, is preceded
by the word given

 For example, the percentage of people that selected software as a career, given those people
are female in the above table

You might also like