LA - Session 2 - Principles of Assessment PDF

Language assessment – session 2
• Principles of language assessment

• Types of test
Brown, H. Douglas (2004) Chapter 2 and 3.
Course instructor: Dr. Nguyen Thi Hong Tham.

Email: [email protected]
Principles of language assessment
Brown, H. Douglas (2004) Chapter 2, pp. 19-41.
Five cardinal criteria for ‘testing a test’:

1. Practicality
2. Reliability
3. Validity
4. Authenticity
5. washback
1. Practicality
• is concerned with test implementation rather than the

meaning of test scores.
• An effective test is practical. It
• is not excessively expensive

• stays within appropriate time constraints,
• is relatively easy to administer, and
• Has a scoring/evaluation procedure that is specific and time efficient.
2. Reliability
is the consistency of test scores across facets of the test

(giving the same test to the same student or matched
students on two different occasions, the test should yield
similar results).
Example and similarity of ‘Reliability’
(http://peoplelearn.homestead.com/beduc/chapter_10.pdf)
Reliability (cont.)
Factors contributing to the unreliability of a test:
➢Student-related reliability: fluctuations in the student

(e.g. illness, anxiety, etc.) can cause unreliability.
➢Test administration reliability: unreliability due to the

conditions in administering the test (e.g. street noise,
photocopying conditions, light, temperature, etc.).
➢Test reliability: the nature of the test itself can cause

measurement error (e.g. too long test, poorly written test
items…)
➢Rater-reliability: unreliability due to human error,
subjectivity, bias in the scoring process.
• Inter-rater reliability: unreliability due to inconsistency

in scores rated by two or more raters of the same test.
• Intra-rater reliability: unreliability is a common

occurrence for classroom teachers because of unclear
scoring criteria, fatigue, bias towards particularly
‘good’ and ‘bad’ students, or carelessness.
3. Validity
• means discovering whether a test ‘measures accurately what it is

intended to measure’(Hughes, 1989: 22).
• ‘the extent to which inferences made from assessment results are

appropriate, meaningful, and useful in terms of the purpose of the
assessment’ (Gronlund 1998: 226).
Five types of evidence of validity
1. Content-related evidence/content validity: any

attempt to show that the content of the test is a
representative sample from the domain that is to be
tested (Fulcher and Davidson 2007:6).
• To achieve content validity in classroom assessment: test performance

directly.
2. Criterion-related evidence/criterion-related validity
• Is the extent to which the ‘criterion’ of the test has actually

been reached.
• Including two categories:
• Concurrent validity: the results of a test supported by other concurrent

performance beyond the assessment itself.
• Predictive validity: when the test scores are used to predict some future
criterion, such as academic success (e.g. placement tests).
3. Construct-related evidence/construct validity:
• is the extent to which we can interpret a given test score as

an indicator of the ability(ies), or construct(s) we want to
measure.
• A construct is any theory, hypothesis, or model that attempts to

explain observed phenomena in our universe of perception
(e.g. proficiency, communicative competence).
• Constructs may nor may not be directly or empirically

measured.
Construct-related evidence/construct validity (cont.)
• Construct validity is a major issue in validating large-scaled

standardized tests of proficiency.
• Construct validity has to do with the domain of
generalisation to which our score interpretation generalise.
(domain of generalisation is the set of tasks in the target

language use domain to which the test tasks correspond)
4. Consequential validity: encompasses all consequences of
a test, including:
• its accuracy in measuring intended criteria,

• its impact on the preparation of test-takers.
• its effect on the learner, and
• The (intended and unintended social consequences of a test’s interpretation and
use.
5. Face validity:
• the extent to which ‘students view the assessment as fair, relevant,

and useful for improving learning’ (Gronlund 1998: 210).
• ‘the degree to which a test looks right and appears to measure the
knowledge or ability it claims to measure, based on the subjective
judgement of the examinees who take it, the administrative
personnel who decide on its use, and other psychometrically
unsophisticated observers’ (Mousavi 2002: 244).
Validity and reliability: Examples
(http://peoplelearn.homestead.com/beduc/chapter_10.pdf)
4. Authenticity
• is defined as the relationship between test task
characteristics, and the characteristics of tasks in
the real world.
5. Washback
➢A facet of consequential validity.

➢is ‘the effect of testing on teaching and learning’ (Hughes 2003:1)
Applying principles to the evaluation of classroom tests
1. Are the test procedures practical?

2. Is the test reliable?
3. Does the procedure demonstrate content validity?
Two steps to evaluating the content validity of a classroom test:
a. Are classroom objectives identified and appropriately framed?
b. Are lesson objectives represented in the form of test specifications?
4. Is the procedure face valid and ‘biased for best’?
5. Are the test tasks as authentic as possible?
6. Does the test offer beneficial washback to the learner?
(Please read more about the application of these principles in Brown,

H. Douglas 2004, p. 31-38).
The five questions you need to bear in mind when starting designing tests
1. What is the purpose of the test?

2. What are the objectives of the test?
3. How will the test specifications reflect both the purpose and the
objectives?
4. How will the test tasks be selected and the separate items
arranged?
5. What kind of scoring, grading, and/or feedback is expected?
Types of test
Type of test Purpose
1. Proficiency Test To assess general ability in a second
language.
2. Achievement Test To evaluate how much a learner knows
from a defined amount of course or class
work.
3. Diagnostic Test To identify a student’s strengths or
weaknesses in specific areas of language.
4. Placement Test To determine which would be the most
appropriate class, stream or level in which
to place a student so that subsequent
language teaching is appropriate to their
needs.
Types of test
1. LANGUAGE APTITUDE TESTS
• Predict a person’s success prior to exposure to the second

language.
• Measure capacity/ general ability to learn a foreign
language and ultimate success in that undertaking.
• Two standardized aptitude tests: Modern Language
Aptitude Test (MLAT) and Language Aptitude Battery
(PLAB).
• Limitation: measuring success by similar processes of
mimicry, memorisation, and puzzle-solving; no research to
show unequivocally that these tasks predict
communicative success in a language, esp. untutored
acquisition of the language.
2. PROFICIENCY TESTS
• Not limited to any one course, curriculum, or single skill in
the language.
• Test overall ability.
• Summative and norm-referenced.
• Providing a single score which plays the gate-keeping role.
• Not equipped to provide diagnostic feedback.
• The tasks must be legitimate samples of English language
use in the defined context.
• Creating these tasks and validating them with research is
time-consuming and costly process.
• Examples: TOEFL, IELTS
3. PLACEMENT TESTS
• Proficiency tests can act in the role of placement tests.
• Purpose: to place a student into a particular level or section of a
language curriculum or school.
• Example: The English as a Second Language Placement Test
(ESLPT)
• Face validity, diagnostic information on students’ performance
and authenticity
4. DIAGNOSTIC TESTS
• Diagnose specified aspects of a language.
• Diagnostic and placement tests may sometimes be

indistinguishable from each other.
5. ACHIEVEMENT TESTS
• is related directly to classroom lessons, units, or even a total

curriculum.
• Primary role: to determine whether course objectives have
been met – and appropriate knowledge and skills acquired – by
the end of a period of instruction.
• Often summative, but also formative in offering washback
about the quality of a learners’ performance in subsets of the
unit or course.
Distinguishing a diagnostic test and a general
achievement test:
a. Achievement tests: analyse the extent to which

students have acquired language features that have
already been taught.
b. Diagnostic test: should elicit information on what

students need to work on in the future.

LA - Session 2 - Principles of Assessment PDF

Uploaded by

Copyright:

Available Formats

LA - Session 2 - Principles of Assessment PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LA - Session 2 - Principles of Assessment PDF

Uploaded by

Copyright:

Available Formats

Language assessment – session 2

• Principles of language assessment

Course instructor: Dr. Nguyen Thi Hong Tham.

Five cardinal criteria for ‘testing a test’:

• is concerned with test implementation rather than the

• An effective test is practical. It

• is not excessively expensive

is the consistency of test scores across facets of the test

➢Student-related reliability: fluctuations in the student

➢Test administration reliability: unreliability due to the

➢Test reliability: the nature of the test itself can cause

• Inter-rater reliability: unreliability due to inconsistency

• Intra-rater reliability: unreliability is a common

• means discovering whether a test ‘measures accurately what it is

• ‘the extent to which inferences made from assessment results are

1. Content-related evidence/content validity: any

• To achieve content validity in classroom assessment: test performance

• Is the extent to which the ‘criterion’ of the test has actually

• Including two categories:

• Concurrent validity: the results of a test supported by other concurrent

• is the extent to which we can interpret a given test score as

• A construct is any theory, hypothesis, or model that attempts to

• Constructs may nor may not be directly or empirically

• Construct validity is a major issue in validating large-scaled

(domain of generalisation is the set of tasks in the target

• its accuracy in measuring intended criteria,

• the extent to which ‘students view the assessment as fair, relevant,

➢A facet of consequential validity.

1. Are the test procedures practical?

(Please read more about the application of these principles in Brown,

1. What is the purpose of the test?

• Predict a person’s success prior to exposure to the second

• Diagnose specified aspects of a language.

• Diagnostic and placement tests may sometimes be

• is related directly to classroom lessons, units, or even a total

a. Achievement tests: analyse the extent to which

b. Diagnostic test: should elicit information on what

You might also like