Threats To Internal Validity

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 13
At a glance
Powered by AI
The key takeaways are threats to internal validity like bias, confounding variables and random error. Some well known threats mentioned are self-selection effects, history effects, maturation effects and regression toward the mean effects.

Some threats to internal validity mentioned are self-selection effects, history effects, maturation effects, experimental mortality, and regression toward the mean effects.

Some types of reliability mentioned are intra-rater reliability, inter-rater reliability, intra-session reliability and inter-session reliability.

Threats to internal validity are threats to causal control.

They mean that we


do not know for sure what caused the effects that we observed. Naturally,
we like to hope that our interventions (experimental treatments) or other
known and measured independent variables caused the effects.
Unfortunately this is often not the case. For exampe, because of their
multidimensionality, confounded variables (which measure more than one
entity) are a threat to internal validity.
BIAS VERSUS RANDOM ERROR
If you have tight control over your experimental treatments (and, of course,
you used randomization), hopefully the only source of variance left in your
dependent variables will be random error.
Random error is just that: It is the random variation that occurs on
measurements across administrations, situations, or time periods. If
random error is VERY large, it can pose a threat to the reliability
(predictability, stability) of our measurements. Many political attitudes, for
example, are highly unstable or volatile.
On the other hand, because it is random, random error does not usually
pose a threat to internal validity.
Bias is systematic error, such as the scale that always weighs you in at five
pounds too light. Bias introduces a constant source of error into
measurements or results. Bias can occur when test items that favor a
particular ethnic, age, or gender group are used. For example, a "culture
exam" that asked respondents to identify songs from the 1950s and the
1960s would discriminate against younger people. Tests of "science
knowledge" often favor younger people because they use the most recent
definitions of science phenomena and thus favor those with a more recent
education. Bias in testing instruments is a threat to internal validity
because it poses an alternative explanation for the results that we found.
If we could either control bias experimentally (random assignment controls
much of it by making experimental treatment groups roughly equivalent at
the beginning of a study, thus controlling factors such as self-selection or
regression toward the mean effects) or measure the variables we suspect
cause bias and thus control them statistically, we would at least maximize
internal validity.
Unfortunately bias is often hidden, either in the variables you didn't
measure--or the variables you didn't consider at all. Thus you didn't
measure it and only discover your mistake after all your data are collected.
Confounded variables are a major threat to internal validity.

HERE ARE SOME WELL-KNOWN THREATS TO


INTERNAL VALIDITY
Self-selection effects : When subjects can select their own treatments,
we do not know whether the intervention or a pre-existing factor of the
subject caused the outcomes we observed. Random assignment can cure
this problem. The same problem can occur with differential selection, only
in this case, the investigator (rather than the subject) uses human
judgement to assign groups or subjects to treatment. A common variation
on this one is selecting extreme groups (see below).
Experimental mortality. When subjects discontinue the study and this
occurs more in certain conditions than others, we do not know how to
causally interpret the results because we don't know how subjects who
discontinued participation differed from those who completed it. A pretest
questionnaire given to all subjects make help clarify this, but watch out for
pretesting effects (a Solomon four group design can help here.)
History: Some kind of event occurred during the study period (such as
the assaults on New York City) and it is reactions to these events that
caused the outcomes we observed. Sometimes this is a medical event
(such as a flu outbreak) and sometimes an actual political or historical
event. Random assignment and a control group helps with this problem.
Maturation effects are especially important with children and youth
(such as college freshmen) but could happen at any age. For example,
young children's speech will normally become more complex, no matter
what reading method you use. Some studies have found that most college
students pull out of a depression within six months, even if they receive no
treatment whatsoever. A certain number of people will stop smoking,
whether they receive treatment or not. Again, a randomized control group
helps.
Regression toward the mean effects ("statistical regression") are
especially likely when you study extreme groups. For example, students
scoring at the bottom of a test typically improve their scores a least a little
when they retake the test. Students with nearly perfect scores might miss
an item the second time around. That is, people with extreme scores, or in
extreme groups, will often fall back toward the average or "regress to the
mean" on a second administration of the dependent variable.

Regression toward the mean effects are especially likely to occur among
well-meaning investigators, who want to give a treatment that they believe
is very beneficial to the group that appears to need it the most (the top
scoring group is usually left alone.) When the scores of the worst group
improve after the intervention (and the top group scores a little lower on
the readministration), misguided investigators are even more convinced
that they have found a good treatment (instead of a methodological
artifact.) How to avoid this threat to internal validity? Either avoid extreme
groups, or if you do use them, randomly assign their members to treatment
conditions, INCLUDING A CONTROL GROUP.
Testing. Just taking a pretest can sensitize people and many people
improve their performance with practice. Almost every classroom teacher
knows that part of a student' s performance on assessment tests depends
on their familiarity with the format. Solution? A Solomon Four Group
Design, wherein half the subjects do not receive a pretest is a good way to
control inferences in this case.

REACTIVITY AND THREATS TO INTERNAL VALIDITY


Reactivity refers to changes in the subjects' behavior simply because they
are being studied.
For example, some people get nervous when a doctor or nurse takes their
blood pressure, and their blood pressure goes up.
Reactivity poses a distinct threat to internal validity because we don't
know what caused the outcome: treatment effects or reactivity. The
experimental laboratory is probably the most reactive because people have
come for an experiment and they know their behavior is being watched.
That is why so many experimenters use deception. They are trying to divert
subject attention so that the "true behavior under study" is not altered.
Demand effects, in which subjects or respondents "follow orders" or
cooperate in ways that they almost never would under their routine daily
lives.

In research several decades ago, Martin Orne found that laboratory


subjects would do virtually anything an experimenter asked them to
do. They would eat bitter crackers for several minutes, they would
throw what they were told was acid at a laboratory assistant, they
would pick up snakes or prepare to eat worms.

Social Desirability effects take several forms.

Subjects may become nervous about being monitored, or evaluation


apprehension. When people become anxious, many things happen.
Physiological indicators, such as heart rate or blood pressure,
change. If people are slightly anxious, they may do better on tests,
performance, or assessments. However if people are very anxious
("flooded") they will almost certainly do worse

People may try to "fake good," to appear smarter, more attractive, or


more tolerant than they normally are. Paper and pencil
questionnaires are especially prone to these effects because often
the answers are not checked for their veracity. (And, on online
surveys, we may not even correctly know who anyone is.)

It is not just individuals who have social desirability effects. A


century ago, the famous writer Lev Tolstoy wrote about "Potemkin
villages." When the Czar went on cross-country trips, goverment
officials were a little ways ahead of him. In cooperation with local
government, they would erect false-fronted buildings (as on a movie
set) and the best looking young men and women of the village would
stand before these fake structures smiling and throwing flowers.
While most groups or organizations will not go to this extent, they
may "hide" their more embarassing members, "fudge" or slightly
alter records, and claim your procedures were followed when they
were not.

Most people and groups (who allow you to study them at all) try to
cooperate with researchers. But some try to descover the purpose of the
intervention and thwart it, or "wreck the study." Social Reactance effects
refer to boomerang effects in which individuals or groups "fake bad," or
deliberately deviate from study procedures. This happens more among
college students, and others who suspect that their autonomy is being
threatened.
ON REACTIVITY AND INTERNAL VALIDITY. If demand effects are specific to
a particular situation, reactivity problems may also influence generalizing,
or external validity (this is how your Wiersma book treats the term.)
However, I think reactivity introduces an alternative causal explanation for
our results: they occurred, not because of the intervention or treatment,
but because people were so self-conscious that they changed their
behavior. This is internal validity. Reactivity may also statistically interact
with the experimental manipulation. For example, if the treatment somehow
impacts on self-esteem (say you are told that the stories you tell to the TAT
pictures indicate your leadership ability), reactivity may be a greater
internal validity problem.

MORE ON GENERALIZING: "EXPERIMENTAL" VERSUS


"MUNDANE" REALITY
More of a threat to external validity is the issue of the reality of the study
setting. In many cases, such as studies of classrooms or online
environments, the setting of the study is identical to the "everyday reality"
or mundane reality in which most subjects live their lives. High mundane
reality makes it easier to generalize to people's typical settings and it
facilitates external validity. Field studies of all kinds, and ethnographies,
too, take place in typical, as opposed to unusual, settings.
However, laboratory experiments in particular may use unusual settings or
tasks. For example, some sports experiments will have subjects on a
treadmill for hours. In other studies, subjects may be injected with
substances (such as adrenaline) or take pills. Subjects may see specially
constructed movies that are nothing like they see on TV. Or may be called
upon to perform tasks (watching a light "move" in a darkened room) that
bear no resemblance to their normal environment. While these settings or
tasks may be engrossing or compelling, thus high in experimental reality,
they do not resemble the settings to which researchers may really want to
generalize.

DID ANYBODY NOTICE? I HOPE YOU USED A Manipulation


check.
YOU are certain that your intervention will make life healthier or enhance
learning. But what if no one pays attention to the treatment or
comprehends its message? Then it will appear that you have no effects at
all, whereas if you had simply used a stronger manipulation, your
guesswork would have been confirmed.
Anyone doing experimental work needs to have a manipulation check, an
inclusion to measure if subjects even paid attention to factors in the
treatment and understood their messages For example, if you show
different movies to different groups and your topic is filmed aggression,
have a short questionnaire that has subjects rate the violence of the movie.
The group receiving the more aggressive film should rate it as more violent
than those receiving an unaggressive movie. If you are trying a new
reading technique, make sure that students understand the stories they are
exposed to and remember something about them. If you try a new template
in your online learning course, did students even pay attention?

THE HUMAN FACTOR: USING DOUBLE BLIND

When the medical and pharmacy professions test a new medicine, they
don't just use a "sugar pill" placebo.
Subjects in the study do not know if they are taking a new medication, an
old medication, or a sugar pill.
The individuals who pass out the medication and assess the subjects'
health and behavior also do not know whether the person is taking a new
medication, an old medication, or a sugar pill.
Thus both those involved as subjects and those involved with collecting
data are "blind:" blind to the purposes of the study, the condition that
subjects are in, and the results expected.
This means that

You may need to deceive subjects about the true purpose of the
study (if you were told the purpose of the study was to measure
leadership qualities in sports, might you try to "shape up?")
Avoid collecting your own data; don't act as your own experimenter
or interviewer. Trade off with another student or apply for a small
University or external grant to hire someone.
Don't tell interviewers or experimenters the true purpose of the study
and don't tell them (if possible) which subject is in which condition.
You might give each person a "generic overview" of the study ("this
study is about which movies children like.")

Almost no one who collects data "likes deception" but without at least a
little, you may introduce reactivity and bias into your study. Do the
minimum (I prefer "omission" rather than deliberate lies) and be sure to
debrief subjects after their participation in the study is completed. This
means that you tell them the true purpose of the study and any
manipulations pertinent to their role in it. Debriefing is ethically mandatory,
and is especially important if your manipulation involved lies about the
student's performance ("no, you really didn't score in the 5th percentile on
that test, all feedback was bogus") or any other aspect of the "real world."
Susan Carol Losh September 21 2001
This page was built with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.

Effect of Validity and Reliability


The precision with which you measure things also has a major impact on sample size: the
worse your measurements, the more subjects you need to lift the signal (the effect) out of
the noise (the errors in measurement). Precision is expressed as validity and reliability.
Validity represents how well a variable measures what it is supposed to. Validity is
important in descriptive studies: if the validity of the main variables is poor, you may
need thousands rather than hundreds of subjects. Reliability tells you how reproducible
your measures are on a retest, so it impacts experimental studies: the more reliable a
measure, the less subjects you need to see a small change in the measure. For example, a
controlled trial with 20 subjects in each group or a crossover with 10 subjects may be
sufficient to characterize even a small effect, if the measure is highly reliable. See the
details on the stats pages.
Pilot Studies
As a student researcher, you might not have enough time or resources to get a sample of
optimum size. Your study can nevertheless be a pilot for a larger study. Perform a pilot
study to develop, adapt, or check the feasibility of techniques, to determine the reliability
of measures, and/or to calculate how big the final sample needs to be. In the latter case,
the pilot should have the same sampling procedure and techniques as in the larger study.
For experimental designs, a pilot study can consist of the first 10 or so observations of a
larger study. If you get respectable confidence limits, there may be no point in continuing
to a larger sample. Publish and move on to the next project or lab!
If you can't test enough subjects to get an acceptably narrow confidence interval, you
should still be able to publish your finding, because your study will set useful bounds on
how big and how small the effect can be. A statistician can also combine your finding
with the findings of similar studies in something called a meta-analysis, which derives a
confidence interval for the effect from several studies. If your study is not published, it
can't contribute to the meta-analysis! Many reviewers and editors do not appreciate this
important point, because they are locked into thinking that only statistically significant
results are publishable.
WHAT TO MEASURE
In any study, you measure the characteristics of the subjects, and the independent and
dependent variables defining the research question. For experiments, you can also
measure mechanism variables, which help you explain how the treatment works.
Characteristics of Subjects
You must report sufficient information about your subjects to identify the population
group from which they were drawn. For human subjects, variables such as sex, age,

height, weight, socioeconomic status, and ethnic origin are common, depending on the
focus of the study.
Show the ability of athletic subjects as current or personal-best performance, preferably
expressed as a percent of world-record. For endurance athletes a direct or indirect
estimate of maximum oxygen consumption helps characterize ability in a manner that is
largely independent of the sport.
Dependent and Independent Variables
Usually you have a good idea of the question you want to answer. That question defines
the main variables to measure. For example, if you are interested in enhancing sprint
performance, your dependent variable (or outcome variable) is automatically some
measure of sprint performance. Cast around for the way to measure this dependent
variable with as much precision as possible.
Next, identify all the things that could affect the dependent variable. These things are the
independent variables: training, sex, the treatment in an experimental study, and so on.
For a descriptive study with a wide focus (a "fishing expedition"), your main interest is
estimating the effect of everything that is likely to affect the dependent variable, so you
include as many independent variables as resources allow. For the large sample sizes that
you should use in a descriptive study, including these variables does not lead to
substantial loss of precision in the effect statistics, but beware: the more effects you look
for, the more likely the true value of at least one of them lies outside its confidence
interval (a problem I call cumulative Type 0 error). For a descriptive study with a
narrower focus (e.g., the relationship between training and performance), you still
measure variables likely to be associated with the outcome variable (e.g., age-group, sex,
competitive status), because either you restrict the sample to a particular subgroup
defined by these variables (e.g., veteran male elite athletes) or you include the variables
in the analysis.
For an experimental study, the main independent variable is the one indicating when the
dependent variable is measured (e.g., before, during, and after the treatment). If there is a
control group (as in controlled trials) or control treatment (as in crossovers), the identity
of the group or treatment is another essential independent variable (e.g., Drug A, Drug B,
placebo in a controlled trial; drug-first and placebo-first in a crossover). These variables
obviously have an affect on the dependent variable, so you automatically include them in
any analysis.
Variables such as sex, age, diet, training status, and variables from blood or exercise tests
can also affect the outcome in an experiment. For example, the response of males to the
treatment might be different from that of females. Such variables account for individual
differences in the response to the treatment, so it's important to take them into account.
As for descriptive studies, either you restrict the study to one sex, one age, and so on, or
you sample both sexes, various ages, and so on, then analyze the data with these variables

included as covariates. I favor the latter approach, because it widens the applicability of
your findings, but once again there is the problem of cumulative Type 0 error for the
effect of these covariates. An additional problem with small sample sizes is loss of
precision of the estimate of the effect, if you include more than two or three of these
variables in the analysis.
Mechanism Variables
With experiments, the main challenge is to determine the magnitude and confidence
intervals of the treatment effect. But sometimes you want to know the mechanism of the
treatment--how the treatment works or doesn't work. To address this issue, try to find one
or more variables that might connect the treatment to the outcome variable, and measure
these at the same times as the dependent variable. For example, you might want to
determine whether a particular training method enhanced strength by increasing muscle
mass, so you might measure limb girths at the same time as the strength tests. When you
analyze the data, look for associations between change in limb girth and change in
strength. Keep in mind that errors of measurement will tend to obscure the true
association.
This kind of approach to mechanisms is effectively a descriptive study on the difference
scores of the variables, so it can provide only suggestive evidence for or against a
particular mechanism. To understand this point, think about the example of the limb
girths and strength: an increase in muscle size does not necessarily cause an increase in
strength--other changes that you haven't measured might have done that. To really nail a
mechanism, you have to devise another experiment aimed at changing the putative
mechanism variable while you control everything else. But that's another research
project. Meanwhile, it is sensible to use your current experiment to find suggestive
evidence of a mechanism, provided it doesn't entail too much extra work or expense. And
if it's research for a PhD, you are expected to measure one or more mechanism variables
and discuss intelligently what the data mean.
Finally, a useful application for mechanism variables: they can define the magnitude of
placebo effects in unblinded experiments. In such experiments, there is always a doubt
that any treatment effect can be partly or wholly a placebo effect. But if you find a
correlation between the change in the dependent variable and change in an objective
mechanism variable--one that cannot be affected by the psychological state of the
subject--then you can say for sure that the treatment effect is not all placebo. And the
stronger the correlation, the smaller the placebo effect. The method works only if there
are individual differences in the response to the treatment, because you can't get a
correlation if every subject has the same change in the dependent variable. (Keep in mind
that some apparent variability in the response between subjects is likely to be random
error in the dependent variable, rather than true individual differences in the response to
the treatment.)
Surprisingly, the objective variable can be almost anything, provided the subject is
unaware of any change in it. In our example of strength training, limb girth is not a good

variable to exclude a placebo effect: subjects may have noticed their muscles get bigger,
so they may have expected to do better in a strength test. In fact, any noticeable changes
could inspire a placebo effect, so any objective variables that correlate with the noticeable
change won't be useful to exclude a placebo effect. Think about it. But if the subjects
noticed nothing other than a change in strength, and you found an association between
change in blood lipids, say, and change in strength, then the change in strength cannot all
be a placebo effect. Unless, of course, changes in blood lipids are related to susceptibility
to suggestion...unlikely, don't you think?

Validity (does your test measure what it's supposed to?)

gold standard (highest: rarely have this in PT research!)


o equipment calibrated against an accurate standard
internal (cause-effect relationship between the independent and dependent
variables)
content (is the sample representative of the population?)
o some (silly) examples with poor content validity

asking young people questions and then generalising to the whole


population
face (lowest: does the test seem to measure what it's supposed to?)
o some silly examples with no face validity
gathering "normal data" from subjects who have a disease!
using a goniometer to measure velocity

Risks to Validity

Selection (should be randomised for age, sex etc.)


History (background of subjects should be similar)
Maturation (subjects may change, e.g. fatigue, during the experiment)
Repeated Testing (subjects are affected by the test)
Instrumentation (may affect subjects)
Regression to the Mean (subjects with extreme scores on a first test tend to have
scores closer to the mean on a second test)
Experimental Mortality (subjects who drop out of the experiment)
Selection-Maturation Interaction
Experimenter Bias (always try to blind experimenter)

Reliability (is your test repeatable?)


Scatter plots
Best way to get an initial feel for the data is to draw a Scatter plot, and calculate the
Correlation coefficient (Pearson's 'r'):more here!

1.0 = perfect direct correlation

0 = no correlation

-1 = perfect inverse correlation

o
o
o

generally, r > 0.8 is regarded as good reliability


square of the correlation coefficient (r2) is equal to the proportion of
variation in the dependent variable that is accounted for
Correlation does not imply causation!

Significance Tests

't' test

Analysis of Variance (ANOVA)


o Intraclass correlation coefficient (ICC)

Types of reliability

Intra-rater
Inter-rater
Intra-session
Inter-session

Statistical Testing: types of error

Type I: did we detect a difference that isn't really there?


o alpha test (p < 0.05)
Type II: is there really a difference that we didn't detect?
o beta test (statistical power - difficult to calculate!)

Qualitative Tests, or non-parametric statistics, for ordinal data (integers, categories)

Mann-Whitney

You might also like