Basic Statistical Tools in Research and Data Analysis
Basic Statistical Tools in Research and Data Analysis
Basic Statistical Tools in Research and Data Analysis
This article has been corrected. See Indian J Anaesth. 2016 October; 60(10): 790.
This article has been cited by other articles in PMC.
Go to:
Statistics is a branch of science that deals with the collection, organisation, analysis of
data and drawing of inferences from the samples to the whole population.[1] This
requires a proper design of the study, an appropriate selection of the study sample and
choice of a suitable statistical test. An adequate knowledge of statistics is necessary
for proper designing of an epidemiological study or a clinical trial. Improper statistical
methods may result in erroneous conclusions which may lead to unethical practice.[2]
Go to:
Variable is a characteristic that varies from one individual member of population to
another individual.[3] Variables such as height and weight are measured by some type
of scale, convey quantitative information and are called as quantitative variables. Sex
and eye colour give qualitative information and are called as qualitative variables[3]
[Figure 1].
Figure 1
Classification of variables
Quantitative variables
Quantitative or numerical data are subdivided into discrete and continuous
measurements. Discrete numerical data are recorded as a whole number such as 0, 1,
2, 3,… (integer), whereas continuous data can assume any value. Observations that
can be counted constitute the discrete data and observations that can be measured
constitute the continuous data. Examples of discrete data are number of episodes of
respiratory arrests or the number of re-intubations in an intensive care unit. Similarly,
examples of continuous data are the serial serum glucose levels, partial pressure of
oxygen in arterial blood and the oesophageal temperature.
A hierarchical scale of increasing precision can be used for observing and recording
the data which is based on categorical, ordinal, interval and ratio scales [Figure 1].
Categorical or nominal variables are unordered. The data are merely classified into
categories and cannot be arranged in any particular order. If only two categories exist
(as in gender male and female), it is called as a dichotomous (or binary) data. The
various causes of re-intubation in an intensive care unit due to upper airway
obstruction, impaired clearance of secretions, hypoxemia, hypercapnia, pulmonary
oedema and neurological impairment are examples of categorical variables.
Ordinal variables have a clear ordering between the variables. However, the ordered
data may not have equal intervals. Examples are the American Society of
Anesthesiologists status or Richmond agitation-sedation scale.
Interval variables are similar to an ordinal variable, except that the intervals between
the values of the interval variable are equally spaced. A good example of an interval
scale is the Fahrenheit degree scale used to measure temperature. With the Fahrenheit
scale, the difference between 70° and 75° is equal to the difference between 80° and
85°: The units of measurement are equal throughout the full range of the scale.
Ratio scales are similar to interval scales, in that equal differences between scale
values have equal quantitative meaning. However, ratio scales also have a true zero
point, which gives them an additional property. For example, the system of
centimetres is an example of a ratio scale. There is a true zero point and the value of 0
cm means a complete absence of length. The thyromental distance of 6 cm in an adult
may be twice that of a child in whom it may be 3 cm.
Go to:
where x = each observation and n = number of observations. Median[6] is defined as
the middle of a distribution in a ranked data (with half of the variables in the sample
above and half below the median value) while mode is the most frequently occurring
variable in a distribution. Range defines the spread, or variability, of a sample.[7] It is
described by the minimum and maximum values of the variables. If we rank the data
and after ranking, group the observations into percentiles, we can get better
information of the pattern of spread of the variables. In percentiles, we rank the
observations into 100 equal parts. We can then describe 25%, 50%, 75% or any other
percentile amount. The median is the 50th percentile. The interquartile range will be
the observations in the middle 50% of the observations about the median (25th-
75th percentile). Variance[7] is a measure of how spread out is the distribution. It gives
an indication of how close an individual observation clusters about the mean value.
The variance of a population is defined by the following formula:
where σ2 is the population variance, X is the population mean, Xi is the ith element
from the population and N is the number of elements in the population. The variance
of a sample is defined by slightly different formula:
where s2 is the sample variance, x is the sample mean, xi is the ith element from the
sample and n is the number of elements in the sample. The formula for the variance of
a population has the value ‘n’ as the denominator. The expression ‘n−1’ is known as
the degrees of freedom and is one less than the number of parameters. Each
observation is free to vary, except the last one which must be a defined value. The
variance is measured in squared units. To make the interpretation of the data simple
and to retain the basic unit of observation, the square root of variance is used. The
square root of the variance is the standard deviation (SD).[8] The SD of a population
is defined by the following formula:
where σ is the population SD, X is the population mean, Xi is the ith element from the
population and N is the number of elements in the population. The SD of a sample is
defined by slightly different formula:
where s is the sample SD, x is the sample mean, xi is the ith element from the sample
and n is the number of elements in the sample. An example for calculation of variation
and SD is illustrated in