Best Practices in Summative Assessment
Best Practices in Summative Assessment
Best Practices in Summative Assessment
doi:10.1152/advan.00116.2016.
BEST PRACTICES
Kibble JD. Best practices in summative assessment. Adv Physiol of the instructional plan for a course. The present review will
Educ 41: 110 –119, 2017; doi:10.1152/advan.00116.2016.—The goal focus on the practical steps needed to build robust tools to
of this review is to highlight key elements underpinning excellent measure final learning outcomes from the instructor perspec-
high-stakes summative assessment. This guide is primarily aimed at tive; leveraging assessment for learning will be the topic of
faculty members with the responsibility of assigning student grades
and is intended to be a practical tool to help throughout the process of
another review in this series.
planning, developing, and deploying tests as well as monitoring their
Criteria for Excellent Assessment
effectiveness. After a brief overview of the criteria for high-quality
assessment, the guide runs through best practices for aligning assess- One of the most enduring frameworks to define what makes
ment with learning outcomes and compares common testing modali- a good assessment is van der Vleuten’s notion of assessment
ties. Next, the guide discusses the kind of validity evidence needed to
utility, which he defined as the product of reliability, validity,
support defensible grading of student performance. This review con-
centrates on how to measure the outcome of student learning; other feasibility, cost effectiveness, acceptance, and educational im-
reviews in this series will expand on the related concepts of formative pact (44). Reliability refers to the reproducibility of the mea-
testing and how to leverage testing for learning. surement; validity asks whether there is a coherent body of
evidence supporting the use of the assessment results for their
summative assessment; validity; blueprinting; reliability; generaliz- stated purpose, i.e., does the test measure what it purports to?
ability
Feasibility and cost effectiveness relate to how realistic tests
are in the local context, and acceptance refers to the whether all
SUMMATIVE ASSESSMENTS are usually applied at the end of a the stakeholders have regard for the process and the results.
period of instruction to measure the outcome of student learn- Educational impact relates to whether the assessment motivates
ing. They are high stakes for all concerned, most obviously for students to prepare in ways that have educational benefits.
the learners who are being judged but also in the sense that the Norcini et al. (35) extended this framework to include equiv-
data may be used to drive course improvement, to assess alence and catalytic effect. Equivalence asks if similar results
teaching effectiveness, and for program-level assessments such and decisions will occur when tests are used across cycles of
as accreditation. At the other end of the spectrum, we define testing or in different institutions, and the idea of catalytic
formative assessments as those intended to enrich the learning effect asks whether the results and feedback from assessment
process by providing nonjudgmental feedback; they are assess- drive future learning forward. We will draw on these frame-
ments for learning than assessments of learning (39). Assess- works throughout this review to clarify the purpose of various
ment often falls somewhere between these pure summative and suggestions in an effort to remain evidence based in an area of
formative poles, for example, when grade incentives are pro- education where intuition and tradition often exert powerful
vided for assignments or quizzes during a course. Therefore, influence on instructors.
there is a continuum of summative to formative assessment
depending on the primary intended purpose, although feedback Learning Outcomes and Assessment Planning
to learners should be a common feature. My experience has been that subject matter experts naturally
Both summative and formative testing have important ef- tend to start thinking about the content they should teach in a
fects on student learning, and careful attention on the selection course, then about how they will teach it, and finally about how
and deployment of each is needed. It is an age-old axiom that to assess student learning. As an example, a few years ago, I
summative assessment drives learning since most college-level wrote a review textbook for medical physiology (27) and,
students will think hard about strategies to maximize perfor- looking back, I did not think much about learning outcomes,
mance. On the other hand, we should not underestimate the relying instead on what seemed implicitly clear content the
value of formative assessment, especially given the recent book would need to include. All I really did was create my own
demonstrations of how powerfully the “testing effect” en- synthesis of well-trodden ground, with some multiple-choice
hances learning and memory compared with other study meth- practice questions thrown in for assessment. In contrast, shortly
ods, such as rereading a text (38). Therefore, just as selection afterward, I joined the planning team in a new medical school
of a summative assessment plan must align with the overall where we had to decide how discipline-based learning would
course goals, formative assessment should be an integral part be incorporated into an integrated curriculum (26). We were
now confronted with student learning outcomes that placed at
Address for reprint requests and other correspondence: J. D. Kibble, Univ.
least equal importance on patient care, critical thinking, team
of Central Florida College of Medicine, 6850 Lake Nona Blvd., Orlando, FL skills, communication, information literacy, and professional-
32827-7408 (e-mail: [email protected]). ism as they did on knowledge of physiology. Therefore, our
110 1043-4046/17 Copyright © 2017 The American Physiological Society
Downloaded from journals.physiology.org/journal/advances (180.194.011.094) on January 19, 2022.
A GUIDE TO HIGH-STAKES ASSESSMENT 111
assessment plan needed much more than written tests of should also indicate how scores are applied to the rubric and
medical knowledge but also included practical assessments, how scores from different assessments are combined.
direct faculty observation of students, peer assessment, proj- In a previous study (28), I used a modified Bloom’s taxon-
ects, portfolios, collaborative writing, and group presentations. omy to annotate a new question bank that was created by nine
The landmark Vision and Change report charting the future faculty members for an upper-division undergraduate human
of undergraduate biology education makes clear that we should physiology course. Despite setting many learning outcomes for
drive our course planning from the intended big-picture learn- the course beyond the knowledge level, about half the ques-
ing outcomes (2). By definition, a good learning outcome must tions developed were still found to be at this most basic level
be measurable, such that serious thought about summative and only ~20% reached the application level. While testing
assessment is needed at the start of the planning process. some basic knowledge is a good thing, the data indicated that
Learning taxonomies are helpful when developing and match- faculty members often defaulted to testing knowledge, perhaps
ing learning outcomes with assessments. The most commonly because such items are easier to develop and grade. For me,
used is Bloom’s taxonomy (8), which was modified by Ander- this experience underlined how important it is to be intentional
son and Krathwohl (4) to define six cognitive process dimen- about matching assessment to learning outcomes and also the
sions (remembering, understanding, applying, analyzing, eval- need for faculty development and peer review in the test
uating, and creating) that can be applied in four different development process. As an aside, we also discovered that
knowledge dimensions (factual, conceptual, procedural, and faculty members cannot reliably judge the difficulty of indi-
metacognitive); an excellent interactive version of this taxon- vidual items they write; we will be discussing best practices
omy with examples is available via the Iowa State University around test construction and standard setting later.
Center for Excellence in Teaching and Learning (9). Crowe et
al. (15) have also developed a “Blooming Biology Tool” to Validity: Meaningful Interpretation of Test Scores
assist in aligning learning outcomes and assessment and have
presented initial data suggesting improved teaching and learn- The historical literature on validity is complex, and a search
ing outcomes. on the topic is likely to yield articles about multiple types of
In medical education, the same mantra of driving curriculum assessment validity. The 21st century consensus definition
decisions from learning outcomes is an accreditation require- according to the The Standards for Educational and Psycho-
ment (29a). Borrowing from medical education, a simple but logical Testing (2a) is more straightforward: “Validity refers to
powerful taxonomy for thinking about assessment is Miller’s the degree to which evidence and theory support the interpre-
pyramid (32), which describes levels of learning starting with tations of test scores entailed by proposed uses of the test.” For
a knowledge base (“knows”) to basic competence of knowing example, the Medical College Admissions Test (MCAT) pro-
what should be done (“knows how”) to being able to demon- poses that test scores are a good predictor of whether students
strate a skill or behavior under standard conditions (“shows”) will do well in medical training. This unitary idea of validity is
to actually applying the competencies in a real situation also referred to as construct validity. A construct is some
(“does”). A new top layer was recently proposed for Miller’s postulated attribute of people, assumed to be reflected in their
pyramid for individuals in advanced training (e.g., PhD and test scores (14); in the MCAT example, the construct is
health professionsals) who have formed a true professional “educational ability.”
identity and consistently display such values (the person not The negative decisions we make about our students based on
only “does” but also “is”) (16). Table 1 shows common test scores can have enormous emotional and financial conse-
assessment methods using Miller’s pyramid and provides com- quences for them personally, just as our positive decision to
mentary on some of their advantages and disadvantages (see graduate or progress a student has consequences for their future
also Refs 3, 5b, 17, 29, and 42). teachers, professions, and clients. Therefore, validity becomes
In physiology courses, it is likely that many of our exami- the most fundamental consideration once we start building tests
nations will be of the written type. Although written tests are for decision making. Validity does not refer to the assessment
classified in the “knows” and “knows how” levels, they cer- instrument itself but rather to the scores produced at a given
tainly have the potential to test higher orders of learning. For time in a given context and with a given group of students in
example, questions that include data or graphical interpretation relation to how those results will be used to make decisions.
or that require predictions or decisions address levels of appli- The modern idea of validity requires that we establish lines of
cation, analysis, and evaluation. Longer-form written exams, evidence to argue that we can make good decisions based on
such as essays and projects, can require students to provide a our test scores. According to the Standards (2a), there are five
synthesis of multiple sources of information or even demon- general lines of validity evidence, which are based on the
strations of creativity that could, for example, be presented following: test content, response process, internal test struc-
orally or as posters and can be collected over time in a portfolio ture, relations to other variables, and assessment consequences.
(30). As we move further away from knowledge testing to In effect, validity is about stating hypotheses about how a test
higher levels where performance observations are needed or may be used and gathering the evidence to support or refute the
collections of work are judged, use of rubrics becomes essen- hypothesis. For something like the MCAT, huge resources are
tial to make clear for students and graders what the standards needed to establish what material should be tested and to
for accomplishment are (1). A good rubric is a matrix that develop excellent test items, and equal effort is needed to show
clearly articulates what the levels of achievement are with clear that the results indeed predict future outcomes in medical
descriptors of performance levels that meet expectation, ex- school and beyond. However, even if we are single instructors
ceed expectation, or are below expectation. The syllabus in a foundation-level course, there are some simple things,
discussed below, that we can do to avert major threats to sentation (18). Construct underrepresentation may mean too
validity. few items in particular areas, bias of some kind in the item
selection, or a mismatch with learning outcomes. While we
Validity Evidence Based on Content will be discussing the importance of test statistics (see Validity
Based on Internal Structure), the value of expert opinion
As physiology teachers, the most common type of test we
during test development should not be underestimated since
use is probably a written assessment, and for this the most
most decisions about testing end up being value laden in some
fundamental type of validity evidence relates to test content.
way. The finished blueprint should also be shared with all the
There should be documentary evidence of a test outline and
teachers in a team-taught course (likely it will serve as their
plan that shows in detail what topics are tested (specifically,
instruction on what items to write) as well as with students so
how many test items on each topic), how they relate to the
that the process is transparent. In my experience, such an
learning outcomes, and what cognitive levels are being tested.
approach goes a long way to assuring a sense of fairness and
An external examiner should be able to look at this document
broad acceptance of the assessment plan (the other dimension
and agree that the test is a representative sample of the domain
of fairness is difficulty level, which is also discussed at greater
of interest and appropriately addresses the goals of the course.
length later). Coderre points out that it is not enough to prepare
The notion of whether a test is a representative sample of
a blueprint; there also needs to be follow through to prepare
domain of interest is a basic and critical component of test
items according to the plan and to audit adherence to the
development. This is sometimes referred to as instructional
blueprint. It is very helpful to create item banks using a
validity and includes not only a face value judgment of content
commercial software program, which typically allow for
sampling but consideration of the extent to which instruction
metadata tags to be applied to items, such as what learning
was truly provided for the tested content. Errors in this stage of
outcomes an item addresses as well as linkage to the
test development are likely to have a major impact on accept-
performance statistics. Once a blueprint is operational, it
ability of test results, discussed further below.
can also become apparent that the learning objectives do not
Documentation of the test outline and plan is known as a test
properly reflect the true relative importance of concepts,
blueprint and can be a simple table with a topic or learning
which often emerge as the ones most tested, and this can
outcome in each row, with columns describing other attributes
help inform continuous course improvement.
such as cognitive levels (e.g., knowledge or application), com-
petencies (e.g., knowledge or attitude) and where each cell Validity Evidence Based on Response Process
indicates the number of items devoted to that category. The
examples shown in Table 2 are for single tests. The syllabus This category is mostly about the integrity of data through-
should indicate the overall assessment plan for a course de- out the testing process. This might seem trivial but I can recall
scribing the different types of assessment and their relative two painful examples that make me attentive this aspect of
weighting. Similar tables can be created to show how tests validity. In my first year as a faculty member, I recall a very
relate to learning outcomes. Coderre (10) has provided some distressed student pleading with me to reconsider a grade I had
excellent tips on making and using blueprints, among the most given on an essay. Although initially skeptical, I retrieved her
important of which is sharing them with stakeholders. Feed- essay from a large stack to discover that the number circled in
back from colleagues in the planning stage can avoid one of the pen on her script was not the number represented in the grading
most common validity threats known as construct underrepre- spreadsheet: a simple clerical error that had caused much
Body System
Remembering 1 3 2 2 2 10
Understanding 6 6 7 7 4 30
Applying 6 8 6 5 5 30
Analyzing 6 5 3 3 3 20
Evaluating 1 3 2 3 1 10
Creating 0 0 0 0 0 0
Item Totals 20 25 20 20 15 100
Body System/Topic
unnecessary distress. I also once published an electronically So why does reliability matter? The value of knowing ␣ is
rescored multiple-choice final exam to over 300 medical stu- that we can use it to estimate error and explore confidence
dents only to discover that the way I had used the program intervals of scores at different cut levels, using the SE statistic,
caused some kind of scoring error. Both errors were caught and which is the SD of the error term, calculated as follows:
corrected but at some cost to all concerned. At my current
SE ⫽ SD共x兲兹1 ⫺ ␣
institution, we have set a policy allowing 1 wk for scores in all
courses to be double checked and we conduct data audits Table 3 shows sample data showing how the confidence
before anything is published to students. However, the reality interval for a test score is affected by changing the test
is that individual faculty members are often doing a lot of reliability from 0.5 to 0.9. Imagine a traditional grading
manual grading under tight deadlines and data processing scheme of ABCF, where 90%, 80%, and 70% cut scores are
errors are probably common. Response process validity is best applied, respectively. How should we treat the case of a student
achieved with a quality assurance plan, starting with clear with a score of 65%? The data in Table 3 show that if our test
testing instructions to students and practice opportunities for reliability is 0.9 or above, then the student’s score of 65% is
examinees regarding test formats (e.g., are they familiar with outside a 95% confidence interval and we would probably feel
computer-based testing programs or how to complete test comfortable awarding a failing grade. However, if our test
forms?). There should be a documented process for checking reliability is ⬍0.8 in this case the student has a score within the
final answer keys and any rescoring procedures when items are 95% confidence interval. Would you fail this student? My
removed, a protocol for how data moves between software school currently has an ABCF grading scheme, and most
systems such as from spreadsheets to learning management courses unconsciously avoid the conundrum of borderline test
systems, and an audit of manual data entry. In addition to failures by including in the overall assessment plan some
students experiencing smooth test deployment and scoring continuous assessment points or team points that mean in
processes, the syllabus should explain cases where scores are practice a score lower than 65% is likely to be needed to fail a
combined to give composite totals, and score reports should course outright. We have also used standard setting methods to
allow students to reproduce final grading data; any rubrics or aid with decision making (see below). On the other hand, I
other forms of rating instrument should be available and have encountered faculty members in the past who have
explained to students before tests are administered. absolute faith that a score of 69% represents a true failure,
when they have no knowledge of the margin of error in their
Validity Evidence Based on Internal Structure examinations. We will next introduce some other basic test
statistics and consider how to maximize test reliability.
This relates to the reproducibility (reliability) of test scores Any commercially available testing program will calculate
and to other statistical or psychometric properties (e.g., item Cronbach ␣ for a set of test scores as well as providing some
difficulty and discrimination index). My experience has been other standard item statistics. Most useful are individual item
that this is one of the least appreciated areas in high-stakes discrimination indexes that allow faculty members to review
classroom testing and one of the most critical types of validity the performance of test questions. Item discrimination de-
evidence for test scores that will be used to assign grades scribes the extent to which success on a given item corresponds
because it deals with measurement error. The idea of reliability to success on the whole test. A discrimination index (there are
is simply whether test scores are reproducible (41). In classical many) is calculated using equal-sized high- and low-scoring
test theory, it is assumed that the behavior being measured in groups on the test. The idea is that if generally strong students
a person or a group is stable and that an observed test score get an item correct and weak students get it incorrect, then the
consists of a true score of the ability or behavior combined with item is “discriminating.” In practice, for each item, the number
an error score. Sources of combined error include human of successes by the low-scoring group is subtracted from the
factors such as level of fatigue or anxiety at the point of testing number of successes by the high-scoring group and this dif-
as well as inherent errors within the measurement tool. To ference is divided by the size of a group, producing an index
explore good test-retest reproducibility, it is helpful to think ranging from ⫹1 to ⫺1. Traditionally, the top and bottom 27%
about students taking an imaginary parallel test, such that by of the class are used, and, generally speaking, item discrimi-
comparing the results we could determine if students get the nation values of ⫹0.4 and above are regarded as high and less
same scores and are ranked in the same position in the class than ⫹0.2 as low (20). Another approach to discrimination is
and if decisions about pass/fail or grades are the same. In to calculate the “point-biserial correlation,” which is the Pear-
classroom testing, we do not usually have the option of double son correlation between responses to an item and scores on the
tests but could, for example, randomly split the test scores in
half and compare the two data sets. The most commonly used Table 3. Effect of test reliability on the confidence intervals
reliability coefficient is Cronbach’s ␣ (13), which takes the for test scores
idea of splitting a test up to its logical limit by subdividing it
the most possible times, i.e., by comparing each item to the rest Reliability Coefficient (Cronbach ␣) SE, % 95% Confidence Interval, %
of the test. Cronbach’s ␣ provides a test-retest correlation value
0.5 4.2 ⫾8.3
between 0 and 1 and is referred to as a measure of internal 0.6 3.8 ⫾7.4
consistency, with a value of 1.0 indicating a perfectly repro- 0.7 3.3 ⫾6.4
ducible test. It is often quoted that ␣ ⬎ 0.9 is the ideal target 0.8 2.7 ⫾5.3
for high-stakes tests with a lower limit of 0.7 being acceptable 0.9 1.9 ⫾3.7
for classroom tests (19, 36). Note: data are derived from a test with a mean score of 80% and SD of 6%.
whole test and, therefore, also takes a value from ⫹1 to ⫺1. variance is “teaching to the test,” which may result in scores
Values of either index that are close to zero or are negative that do not accurately reflect what students know or do not
detract from the test reliability since we assume that all items know. This is one reason my school does not allow faculty
on the test are cooperating to measure the same attributes and members to give preexam review sessions, which often carry
such potentially faulty items should be carefully reviewed after an implicit expectation of clues to the tested content; instead,
the test. we invite student questions and provide liberal access to
Item analysis is part science and part judgment; items with faculty office hours leading up to major tests. Instances of
a high percentage correct will naturally fail to discriminate, but cheating or loss of test security are other examples of construct
several such items are likely to be intentionally included on a irrelevant variance in test scores. Downing (18) also notes that
test to gauge basic mastery and these should not be eliminated indefensible passing scores produce construct irrelevant vari-
just because they do not discriminate (although ideally there ance and are a major validity threat, bearing in mind the whole
are not too many such items). At my school, we routinely point of validity is to be on solid ground when making
include opportunity for students to record item challenges decisions from the test outcomes.
using report cards during the test and these can be a great help Generalizability theory is an alternative to the basic ap-
when combined with item analysis to understand what went proach to reliability studies offered by classical test theory.
wrong if an item performs poorly: was the wording ambigu- Generalizability theory uses an analysis of variance approach,
ous? does it conflict with what was taught in some way? and a generalizability coefficient is calculated as the ratio of
Our ability to maximize reliability of tests comes down to wanted variance: total variance; if the only input variable is
two major factors: 1) having enough test items and 2) having ranked student scores, this produces the same answer as Cron-
high-quality discriminating items. As a guide, if the average bach ␣ on a 0 –1 scale. However, generalizability theory is
item discrimination is around ⫹0.3, then 50 – 60 items are much more flexible because the investigator can identify in-
enough to produce a reliability of around 0.8; if the average tended facets (factors) of variance such as students, items, or
item discrimination is ⫹0.2, then 100 items are needed (24), raters, and the analytic approach allows each variance to be
whereas ⬎100 items usually produces only small additional examined separately. The statistical tools also allow for a
gains in reliability. Item difficulty around 60 –70% correct followup decision study that allow estimates to be calculated
gives the best potential for high discrimination. However, the for how each variable can be manipulated to increase reliabil-
reality is that there may be limits to the number of “hard” items ity. For example, how many additional items or raters would be
you can use depending on the grading traditions of your needed to reach a reliability of 0.9? If the reader is familiar
institution. For example, 50% of students with a grade C in a with doing ANOVA calculations, generalizability coefficient
course would likely make acceptance levels for assessment calculations are fairly straightforward and free software tools
very poor, so there are always trade offs and judgments to be are available (6).
made. Item discrimination levels are also a function of the The most advanced approach to reliability is the use of item
students; if you teach a course with a wide range of ability response theory. Unless the reader has a statistics background
levels, this tends to produce high item discrimination, whereas or available expertise, this approach is less accessible and
classes such as those with medical students are usually fairly probably only worth pursuing if you are conducting larger-
homogeneous in ability levels and there is not much real scale exams, perhaps with different test forms or multiple
difference between the top and bottom quartiles in a class. campuses. Classical test theory and generalizability theory are
From the foregoing discussion, we can better appreciate both limited by the inability to separate the effects of test
solutions to the two most common threats to validity in difficulty from student ability. Item response theory is based on
faculty-developed tests: namely, construct underrepresentation setting up probability functions in which the probability of
and construct irrelevant variance (18). Construct underrepre- correctly answering an item is a function of student ability.
sentation is most commonly a problem of too few items in the This sigmoidal curve will move position and slope according to
sample domain but also results from the inclusion of trivial test item difficulty, discrimination, and guessing effects. Pretest
items, maldistribution of test items across topics, or poor data are needed to execute the mathematical models, making it
reliability; use of a strong blueprint that broadly samples the impractical for most single instructor courses, but supplemen-
domain of interest with enough items to generate reliable tal literature is provided for the interested reader (43).
measurements and the development of high-quality items using
peer review should prevent construct underrepresentation prob- Validity Evidence Based on Relationships to Other Variables
lems. Construct irrelevant variance represents noise in the
measurement and increases the error term. Construct irrelevant It is often the case for physiology teachers that the main
variance has several sources, but most are again related to purpose of our judgments about student learning is to deter-
poorly constructed items that can be caught in peer review, for mine readiness for future learning, whether it is progressing
example, items may be too hard, too easy, contain trivial within an undergraduate program, moving on to further train-
details, are culturally insensitive, are biased in some way (e.g., ing or to employment. The validity hypothesis (commonly
long reading time for second language students), or are off known as predictive validity) is that exam results meaningfully
target from learning outcomes. Other examples causing con- predict that students are ready for this next stage, which is
struct irrelevant variance are items that include language cue- often readily testable by checking on the outcome. For exam-
ing test-wise students to the correct answer and guessing from ple, in a new medical school, we needed to show that our newly
limited option sets. Several studies have shown how peer developed internal assessment program would produce mean-
review can significantly increase item quality and test reliabil- ingful prediction of success on United States Medical Licens-
ity (31, 33, 46). Another factor that causes construct irrelevant ing Exam (USMLE) Step 1 (25). This is an example of
convergent validity evidence where we should expect that tools whether instructional interventions have had a positive impact
measuring very similar constructs produce similar outcomes. It on learning but we rarely think about the impacts of the
is also valuable to include comparisons expected to produce assessment itself. Cook and Lineberry (11) have proposed a
divergent outcomes. For example, correlation with physiology framework that includes assessing impacts first on the exam-
exams in our institution is much lower when comparing out- inee: is there evidence that the test itself promotes learning?
comes with patient interviewing skills or research project For example, I have frequently advocated decreasing the num-
performance (unpublished observations). Similarly, in a recent ber of summative knowledge assessments within courses to
study (12) where we developed a novel assessment to focus on allow more time for learning and have never observed any
clinical reasoning, the degree of correlation with knowledge appreciable change in final exam performance when summa-
testing outcomes was significantly less than previous compar- tive quizzes were replaced with formative quizzes (i.e., I found
isons between tests of medical knowledge. These kinds of no apparent learning benefit of making midsession quizzes
observations, when taken together, give confidence that we are summative). Another student impact to consider is whether
able to make valid measurements of the intended construct. there is evidence of improved preparation due to testing; for
Analysis of the predictive power of test results is necessarily example, does the presence of a practical exam induce greater
a long-term project, but we can also look to concurrent tests for time spent practicing skills rather than remembering informa-
validity evidence. In the case of a new school, we elected to run tion? What are the effects on motivation, emotions, and well-
a series of progress tests in parallel with the formal curriculum being of the summative testing program?
using the National Board of Medical Examiners Comprehen- We can also investigate impacts on faculty members. Is
sive Basic Science Exam, which was given five times over a there evidence that the curriculum is being improved to address
2-yr period. We were able to correlate results of the internal apparent areas of student weakness? Are teachers collaborating
testing program with these external gold-standard tests as a more effectively as a result of sharing in the planning and
concurrent outcome to provide validity evidence for our newly development process? Are scholarly projects emerging? Is
developed exams (25). In undergraduate physiology courses, there higher status attached to demonstrations of externally
similar data could be obtained by comparing student testing validated high-achieving students? How are their emotions and
outcomes in parallel courses that are measuring similar con- well-being affected by student performance or by resource
structs; curriculum committees or institutional quality im- limitations? Similarly at the program level, evidence of con-
provement offices can usually offer support for such studies. sequences or impacts of testing might relate to allocation of
resources or curriculum changes driven by testing outcomes.
Validity Evidence Based on Consequences of Testing A final special case related to consequences validity is the
impact of grading classifications. This is most pronounced at
Consequences validity evidence is a relatively new domain the pass/fail cut point such that standard setting requires
but is somewhat analogous to van der Vleuten’s consideration particular attention, discussed further below. Apart from any
of educational impact (44). Cook and Lineberry (11) have practical considerations for repeating or remediating failed
recently likened high-stakes summative assessments to medical courses, there is likely to be a negative impact on self-efficacy
tests in that they both result in important decisions and actions and motivation of receiving the label of “failure” (40). In
for the subject; the argument that follows is that neither kind of medical education, the issue of what classifications to use for
test should be performed unless the need is justified and benefit grading is a hot topic, with many schools shifting from tradi-
clearly exceeds costs. We are asking the following question: tional letter grade to pass/fail systems (7). Given that grades
“Does the activity of measuring and the subsequent interpre- are used later on to help select graduates for highly competitive
tation and application of scores achieve our desired results with postgraduate residency training, the impact of grades is poten-
few negative side effects?” (11). Consequences evidence con- tially huge. At my school, we are actively reviewing whether to
siders impacts that may be beneficial or harmful, intended or shift to a pass/fail system with concerns that we may be hurting
unintended (2). For example, in my school, we have recently students who are competent to practice but have some C grades
changed the definition of a C grade from “conditional” (mean- compared with students in other schools who have an undif-
ing a progress committee would review the candidate in detail ferentiated “pass” grade. Examining the consequences of labels
to determine if remediation should occur) to “unconditional” is therefore an important topic, especially remembering the
(meaning the student passes the class without further discus- data in Table 3, which demonstrates that the difference be-
sion). After graduating four classes, we were able to model the tween a B and a C could be spurious to begin with!
relationship between the number of C grades and final outcome
on USMLE Step 1; while C grades were correlated with lower Setting Standards
scores, they did not predict outright failure, and thus our
remediation point needed to be revised. An additional concern The cut points on exams are given special significance that
was that the presence of a conditional C grade evidently can have major impact on examinees, especially around the
produced student distress by generating uncertainty as to pass/fail line. Emphasis on traditional arbitrary numbers like
whether a student with a C grade would be promoted to the “70% is passing” is rather meaningless unless faculty members
next academic year. The high level of student distress ex- justify this special status. At a minimum, it is helpful to
pressed in perception surveys was an example of an unjustified maintain a database of examination items over time and to keep
negative side effect, and the overall consequences validity record of student performance. This allows some degree of
argument indicated a need for change. prediction about the likely test outcomes and ability to compare
In educational scholarship, we are familiar with using final new items with old items. Once a testing database is estab-
assessment scores as the outcome measure to determine lished, faculty members must make decisions about whether it
will be completely sequestered or not. What degree of postex- sions during item review and scoring is helpful to strengthen
amination review and feedback will be allowed? Will past validity.
examinations be provided for students to review before testing?
In my view, the summative testing database should be secure Other Aspects of Assessment Utility and the Need to
since it takes a lot of faculty effort to create a validated bank Compromise
that has been subject to peer review and item analysis. It is
rarely possible to generate completely new high-quality assess- Fairness of assessment is one aspect not completely ad-
ments each year. A secure item database is the bedrock of valid dressed through consideration of validity evidence. The Stan-
and reliable testing. However, providing students with feed- dards (2) describe fairness in several ways: lack of bias,
back is also important and can risk the leaking of questions. In equitable treatment of all in the testing process, and equality in
most database programs, annotation of items is possible that outcomes and in opportunities to learn. Assuring that tests are
allows detailed reporting of strengths and weaknesses by topic. as fair as possible requires a combination of planning and data
In addition to this, our faculty members hold closed-test gathering. Before testing, all examinees should have equal
reviews for the purpose of coaching and apply the same access to learning materials, practice opportunities, and test
security measures used during examinations. Students who instructions. In the test development process, there should be
fail tests are allowed to review one on one with a faculty an effort to avoid introducing bias. For example, if the con-
member. We monitor item performance each time an item is struct of interest is knowledge of physiology, then unnecessar-
reused and watch for trends such as decreasing difficulty and ily complex language or complex mathematical treatments
discrimination that suggest an item may be compromised. beyond the prerequisite course level should be avoided. Mon-
itoring differential item functioning between ethnic or other
Practice quizzes should be developed separately from the
groups is advisable where possible to evaluate possible sources
main summative item bank; they should be used liberally
construct irrelevant variance affecting certain groups. Where
during the learning phase of the course and should include
direct interactions between the examiner and examinee are
rich feedback.
involved in the testing process, the examiner needs to be
There are several formal standard setting methods that can
particularly conscious of potential bias and introducing con-
provide stronger justification for where cut points are defined
struct-irrelevant variance through factors such as undue stress
(5a, 34). By their nature, standards are an expression of values
on examinees. For example, I can vividly remember as a
and all the methods rely on expert judgment in some way. The student not doing well on a pharmacology oral exam given by
first step is to select the kind of standard desired. This can be two rather angry and probably very tired examiners; only after
norm referenced to the performance of examinees in a cohort the encounter did I realize that I knew the correct answers to
or criterion referenced ahead of time. Norm-referenced stan- most of the questions they had asked. In cases where examin-
dards are most suited to situations like admissions or selecting ees have a learning or physical disability, the law requires that
students for awards, where examinees are being ranked for appropriate accommodations are provided and students should
selection to some category with limited availability. When we be made aware early in the program how to access such
are interested in whether students are competent, criterion- services and reminded of the process in each course syllabus.
referenced standards are more appropriate, although faculty There are several important elements to judging assessment
members sometimes gravitate to norming scores as an easy fix utility that require qualitative data such as student and faculty
when difficulty levels seem wrong after a test. Many formal surveys, focus groups, or interviews. Student input on the
standard setting methods have been described and validated validity of content sampling, quality of items, the difficulty
and have been reviewed in more detail elsewhere (5a, 34). As level, and a global sense of acceptance is easily obtained
an example of this type of process, one of the most commonly through perception of instruction surveys. Student perception
used is the Anghoff method (5), in which a panel of judges is not the final determinant but is a valuable perspective to
(6 – 8 judges ideally) are first asked to discuss the characteris- consider. Similarly, debrief meetings in team-taught courses
tics of the borderline (minimally competent) student. The can quickly establish the faculty viewpoint. Feasibility and cost
judges then go through the whole test and indicate for each effectiveness are areas that faculty members are usually quite
question whether such as student should get it correct or not. vocal about, particularly in relation to the time demands on
The mean of the judge’s scores is used as the passing standard. them for setting, supervising, and grading the assessment.
I have used this method many times to check that a test These are very real concerns that often mean compromise is
conforms with the institutionally fixed values of passing grades needed. A common problem is the introduction of construct
(e.g., 70%) and found it to be remarkably accurate with underrepresentation discussed earlier because of feasibility
question histories and often close to the lower limit of the 95% concerns. For example, practical or clinical examinations are
confidence interval of actual student scores on the test. The resource intensive and often end up with too few stations. Just
Anghoff method has the advantage that judges can also com- as learning is contextual, so is assessment, and results from one
ment on the individual items as a check on content validity and testing station do not generalize to the whole construct (3, 41).
item quality. The credibility of passing standards set this way For instance, the ability to solve a problem about cardiac
depends on who the judges are, and they should ideally be a function does not help to determine if the student can solve
diverse group of faculty members with good working knowl- problems about the gastrointestinal tract, a problem known as
edge of the curriculum and students. Standard setting repre- case specificity. On the other hand, if we were to cancel the
sents a gold-standard ideal that is not possible in all situations. practical examination because of concerns for reliability of
However, even having another pair of eyes on the test items in data, this could have a disastrous educational impact by leading
development and a colleague to help review and make deci- students to avoid practicing the very skills that are needed to
meet the learning outcomes. In such cases, creative solutions 5b.Ben-David MF, Davis MH, Harden RM, Howie PW, Ker J, Pippard
are needed, such as having a shorter practical exam that is MJ. AMEE Medical Education Guide No. 24: Portfolios as a method of
student assessment. Med Teach 23: 535–551, 2001. doi:10.1080/
extended with supplementary written items to bolster reliability 01421590120090952.
(45). The overall utility considerations for assessment often 6. Bloch R, Norman G. Generalizability theory for the perplexed: a practical
demand compromise and judgment. introduction and guide: AMEE Guide No. 68. Med Teach 34: 960 –992,
2012. doi:10.3109/0142159X.2012.703791.
7. Bloodgood RA, Short JG, Jackson JM, Martindale JR. A change to
Summary and Practice Points
pass/fail grading in the first two years at one medical school results in
High-stakes assessment is among the biggest responsibilities improved psychological well-being. Acad Med 84: 655– 662, 2009. doi:
10.1097/ACM.0b013e31819f6d78.
we have, given the potential impacts the results have on 8. Bloom BS, Krathwohl DR, Masia BB. Taxonomy of Educational Ob-
students from a social, emotional, and financial perspective as jectives: the Classification of Educational Goals. New York: McKay,
well as the long-term impact on our profession and future 1956.
clients. In summary, some basic elements of practice for 9. Center for Excellence in Learning and Teaching, Iowa State Univer-
excellent assessment are as follows: sity. Revised Bloom’s Taxonomy (online). http://www.celt.iastate.edu/
teaching/effective-teaching-practices/revised-blooms-taxonomy [15 July
• Use backward design that starts by defining the learning 2016].
outcomes and what types of assessment are most suitable to 10. Coderre S, Woloschuk W, McLaughlin K. Twelve tips for blueprinting.
Med Teach 31: 322–324, 2009. doi:10.1080/01421590802225770.
measure the outcomes. 11. Cook DA, Lineberry M. Consequences validity evidence: evaluating the
• Document a testing blueprint that shows what domains will impact of educational assessments. Acad Med 91: 785–795, 2016. doi:
be tested and how this matches the learning outcomes; share 10.1097/ACM.0000000000001114.
the blueprint with all stakeholders. 12. Cramer N, Asmar A, Gorman L, Gros B, Harris D, Howard T,
• Engage as much as possible in faculty peer review during the Hussain M, Salazar S, Kibble JD. Application of a utility analysis to
test development process to avoid introducing construct evaluate a novel assessment tool for clinically oriented physiology and
pharmacology. Adv Physiol Educ 40: 304 –312, 2016. doi:10.1152/
underrepresentation and construct irrelevant variance. advan.00140.2015.
• Include enough items, and items of high quality, to assure 13. Cronbach LJ. Coefficient alpha and the internal structure of tests.
adequate test reliability and defensibility of scores. Psychometrika 16: 297–334, 1951. doi:10.1007/BF02310555.
• Apply standard setting methods. 14. Cronbach LJ, Meehl PE. Construct validity in psychological tests.
• Provide students with clear instructions and practice mate- Psychol Bull 52: 281–302, 1955. doi:10.1037/h0040957.
15. Crowe A, Dirks C, Wenderoth MP. Biology in bloom: implementing
rials and develop a plan to assure the integrity of data
Bloom’s Taxonomy to enhance student learning in biology. CBE Life Sci
throughout the testing process. Educ 7: 368 –381, 2008. doi:10.1187/cbe.08-05-0024.
• Monitor the fairness, acceptability, and impact of testing 16. Cruess RL, Cruess SR, Steinert Y. Amending Miller’s pyramid to
over time with routine surveying of stakeholders and com- include professional identity formation. Acad Med 91: 180 –185, 2016.
parison of test scores with other measures of student doi:10.1097/ACM.0000000000000913.
outcomes. 17. Downing SM. Assessment of knowledge with written test forms. In:
International Handbook of Research in Medical Education, edited by
DISCLOSURES Norman GR, van der Vleuten CP, Newble DI. Dordrecht: Kluwer Aca-
demic, 2002, p. 647– 672. doi:10.1007/978-94-010-0462-6_25
No conflicts of interest, financial or otherwise, are declared by the author(s). 18. Downing SM. Validity: on meaningful interpretation of assessment data.
Med Educ 37: 830 – 837, 2003. doi:10.1046/j.1365-2923.2003.01594.x.
AUTHOR CONTRIBUTIONS 19. Downing SM. Reliability: on the reproducibility of assessment data. Med
J.D.K. conceived and designed research; analyzed data; interpreted results Educ 38: 1006 –1012, 2004. doi:10.1111/j.1365-2929.2004.01932.x.
of experiments; drafted manuscript; edited and revised manuscript; and ap- 20. Ebel RL. Procedures for the analysis of classroom tests. Educ Psychol
proved final version of manuscript. Meas 14: 352–364, 1954. doi:10.1177/001316445401400215.
24. Hopkins K. Educational and Psychological Measurement and Evalua-
REFERENCES tion. Needham Heights, MA: Allen and Bacon, 1998.
25. Johnson TR, Khalil MK, Peppler RD, Davey DD, Kibble JD. Use of
1. Allen D, Tanner K. Rubrics: tools for making learning goals and the NBME Comprehensive Basic Science Examination as a progress test
evaluation criteria explicit for both teachers and learners. CBE Life Sci in the preclerkship curriculum of a new medical school. Adv Physiol Educ
Educ 5: 197–203, 2006. doi:10.1187/cbe.06-06-0168. 38: 315–320, 2014. doi:10.1152/advan.00047.2014.
2. American Association for the Advancement of Science. Vision and 26. Khalil MK, Kibble JD. Faculty reflections on the process of building an
Change in Undergraduate Biology Education: a Call to Action (Online)
integrated preclerkship curriculum: a new school perspective. Adv Physiol
http://visionandchange.org/files/2013/11/aaas-VISchange-web1113.pdf
Educ 38: 199 –209, 2014. doi:10.1152/advan.00055.2014.
[04 July 2016].
27. Kibble JD, Halsey C. The Big Picture: Medical Physiology. New York:
2a.American Educational Research Association, American Psychological
McGraw-Hill Professional, 2009.
Association, National Council on Measurement in Education. Stan-
dards for Educational and Psychological Testing. Washington, DC: 28. Kibble JD, Johnson T. Are faculty predictions or item taxonomies useful
American Educational Research Association, 1999. for estimating the outcome of multiple-choice examinations? Adv Physiol
3. Amin Z, Seng CY, Eng KH (editors). Practical Guide to Medical Educ 35: 396 – 401, 2011. doi:10.1152/advan.00062.2011.
Student Assessment. Singapore: World Scientific, 2006. doi:10.1142/6109 29. Kramer GA, Albino JE, Andrieu SC, Hendricson WD, Henson L,
4. Anderson LW, Krathwohl D, Cruikshank KA, Mayer RE, Pintrich Horn BD, Neumann LM, Young SK. Dental student assessment toolbox.
PR, Raths J, Wittrock MC. A Taxonomy for Learning, Teaching, and J Dent Educ 73: 12–35, 2009.
Assessing: a Revision of Bloom’s Taxonomy of Educational Objectives 29a.Liaison Committee for Medical Education. Functions and Structure of
(Complete Edition). New York: Longman, 2001, p. 508 – 600. a Medical School: Standards for Accreditation of Medical Education
5. Anghoff WH. Scales, norms and equivalent scores. In: Educational Programs Leading to the MD Degree (online) http://lcme.org/publica-
Measurement, edited by Thorndike RL. Washington, DC: American tions/ [04 July 2016].
Council on Education, 1971. 30. Madden T. Supporting Student e-Portfolios: a Physical Sciences Practice
5a.Ben-David MF. AMEE Guide No. 18: Standard setting in student assess- Guide. United Kingdom: The Higher Education Academy Physical Sci-
ment. Med Teach 22: 120 –130, 2000. doi:10.1080/01421590078526. ence Center, 2007.
31. Malau-Aduli BS, Zimitat C. Peer review improves the quality of MCQ 41. Schuwirth LW, van der Vleuten CP. General overview of the theories
examinations. Assess Eval High Educ 37: 919 –931, 2011. doi:10.1080/ used in assessment: AMEE Guide No. 57. Med Teach 33: 783–797, 2011.
02602938.2011.586991. doi:10.3109/0142159X.2011.611022.
32. Miller GE. The assessment of clinical skills/competence/performance. Acad 42. Shumway JM, Harden RM; Association for Medical Education in
Med 65, Suppl: S63–S67, 1990. doi:10.1097/00001888-199009000-00045. Europe. AMEE Guide No. 25: The assessment of learning outcomes for
33. Naeem N, van der Vleuten C, Alfaris EA. Faculty development on item the competent and reflective physician. Med Teach 25: 569 –584, 2003.
writing substantially improves item quality. Adv Health Sci Educ Theory doi:10.1080/0142159032000151907.
Pract 17: 369 –376, 2012. doi:10.1007/s10459-011-9315-2. 43. Thissen D, Steinberg L. Item response theory. In: The SAGE Hand-
34. Norcini JJ. Setting standards on educational tests. Med Educ 37: 464 – book of Quantitative Methods in Psychology, edited by Millsap
469, 2003. doi:10.1046/j.1365-2923.2003.01495.x. RE, Maydeu-Olivares A. London, UK: SAGE, 2009, p. 148 –177.
35. Norcini J, Anderson B, Bollela V, Burch V, Costa MJ, Duvivier R,
doi:10.4135/9780857020994.n7
Galbraith R, Hays R, Kent A, Perrott V, Roberts T. Criteria for
44. Van Der Vleuten CP. The assessment of professional competence:
good assessment: consensus statement and recommendations from the
Ottawa 2010 Conference. Med Teach 33: 206 –214, 2011. doi:10.3109/ developments, research and practical implications. Adv Health Sci
0142159X.2011.551559. Educ Theory Pract 1: 41– 67, 1996. doi:10.1007/BF00596229.
36. Nunnaly J. Psychometric Theory. New York: McGraw-Hill, 1978. 45. Verhoeven BH, Hamers JG, Scherpbier AJ, Hoogenboom RJ, van
38. Roediger HL, Karpicke JD. Test-enhanced learning: taking memory der Vleuten CP. The effect on reliability of adding a separate written
tests improves long-term retention. Psychol Sci 17: 249 –255, 2006. assessment component to an objective structured clinical examination.
doi:10.1111/j.1467-9280.2006.01693.x. Med Educ 34: 525–529, 2000. doi:10.1046/j.1365-2923.2000.00566.x.
39. Rolfe I, McPherson J. Formative assessment: how am I doing? Lancet 46. Wallach PM, Crespo LM, Holtzman KZ, Galbraith RM, Swanson DB.
345: 837– 839, 1995. doi:10.1016/S0140-6736(95)92968-1. Use of a committee review process to improve the quality of course
40. Schunk DH. Self-efficacy and academic motivation. Educ Psychol 26: examinations. Adv Health Sci Educ Theory Pract 11: 61– 68, 2006.
207–231, 1991. doi:10.1080/00461520.1991.9653133. doi:10.1007/s10459-004-7515-8.