|
A Basic Primer for Understanding Standardized Tests and Using
Test Scores
April Zenisky, Lisa Keller, Stephen G. Sireci
UMass Amherst
Introduction
It's nearly impossible to live in
American society today without
having to take some kind of standardized test. You have to pass
a test to get a driver's license, get American citizenship, receive
a General Educational Development (GED) certificate, get into college,
and
be considered for certain kinds of jobs. Here in Massachusetts,
our children's teachers have to pass state tests to be licensed,
and the children themselves have to pass the Grade 10 version of
the Massachusetts Comprehensive Assessment System test, or MCAS,
to graduate from high school.
Why do we have to take all these tests? Basically, because there
is widespread agreement (but not complete agreement) that tests
can tell if a person has the knowledge or skills needed for a diploma,
a certificate, a school class level, or a job. But it's not just
any test we're talking about here - it's standardized tests. Standardized
tests are used because people feel that if you're going to judge
someone's abilities, you'd better use a means that's reliable and
fair, and standardized tests are designed to be reliable and fair
- though people might disagree about whether they succeed in those
goals.
We will not debate that issue here; the purpose of this article
is to equip readers with a basic understanding of what goes into
standardized test making and what test scores purport to show about
learners' skills and abilities. We welcome any constructive use
of this knowledge, whether it be better instruction or better policies,
but all constructive uses start with accurate knowledge.
Meeting the reliability and validity criteria
Federal policies now require that states prove that ABE funds result
in learner gains in reading, writing, language acquisition, and
math. In addition, they require that states measure these gains
with valid and reliable tests. After months of reviewing many standardized
assessments and their respective alignment with the Curriculum Frameworks,
Massachusetts policymakers and education professionals have agreed
to use the TABE for ABE Reading, Writing, and Math; the BEST for
ESOL Speaking and Listening; and the REEP for ESOL Writing. Scores
in each of these tests are meant to represent what students know
or can do in those areas. What does it mean when we say these tests
are reliable and valid? Let's take up each of these concepts in
turn.
Reliability
The consistency of scores across different administrations or scorers
is known as reliability. It is crucial that test scores be adequately
reliable in representing a person's knowledge and skills. Some level
of error is always a factor in testing (more on this later) and
test scores. If a person takes the same test on different days,
we expect the results to be slightly different, but the more error
there is in the test's make-up, the more different the two test
scores are likely to be. If the two test scores are very different,
it is reasonable to conclude that the difference is due to test
error and that the scores do not really reflect what the test taker
knows and is able to do.
Inconsistencies in scoring tests might also undercut reliability.
Some tests are composed of multiple-choice questions, while others
require that the test taker construct a response, such as an essay.
Scoring a multiple-choice question is straightforward, because there
is one right answer; the answer provided is either correct or incorrect.
Therefore, regardless of who scores the test, the score on that
question will be the same. Essay-type questions, however, require
human judgment and are therefore more difficult to score. If two
people read the same essay, it's likely that each person will give
the essay a slightly different score. However, if the two scores
given by the two scorers, or "raters," are very different,
then the score on that essay is not very consistent, or reliable.
The measure of consistency between scorers is called inter-rater
eliability.
The closer the scores assigned to an essay by different raters,
the higher the inter-rater reliability of that test. While it might
seem impossible to get different raters to assign exactly the same
score, it is possible to train raters so that they all score in
a very similar way. If this goal is accomplished, there can be more
confidence that the score assigned to the essay reflects the ability
of the student.

Validity
How do we know whether a test measures the ability we are interested
in? Even if a test is perfectly reliable and virtually error-free,
how do we know if it is measuring the abilities we want it to and
not something else? This is the central concern of validity, and
ultimately involves the kinds of judgments that can be based on
test scores.
Let's consider a math test consisting only of word problems. The
test score could appropriately be used to indicate the student's
ability to solve math problems that require reading; that would
be a valid use of the test score. However, using the test score
as a representation of the student's math ability in general would
not be valid.
People who develop tests analyze them in several ways to determine
the appropriate (i.e., valid) use of test scores. Let's review some
of the issues considered in determining the valid use of test scores:
- Do the questions on the test represent the entire subject matter
about which conclusions are to be drawn? For instance, if a test
is designed to measure general arithmetic ability, there should
be questions about addition, subtraction, multiplication, and
division. If there are no questions about division, the test does
not measure the entire content of arithmetic, so the test score
cannot be said to reflect general arithmetic ability.
- Is the student required to demonstrate the skill that the test
is intended to measure? Tests should be directly targeted to the
skills measured and that skill should affect test performance.
For example, a test designed to measure writing proficiency should
ask test takers to write something, and better writers should
be shown to receive higher scores.
- Are the test scores consistent with other indicators of the
same knowledge and skills? Suppose a student takes a test designed
to measure writing ability. If the student does well on writing
assignments in class, then he or she should also do well on the
writing test, so long as the type of writing on the test is consistent
with that done in class. On the other hand, students who do not
perform well on writing assignments in class should not do
as well on the test. The validity of using that test score as
an indication of the person's ability is questionable if there
is inconsistency between the score and classroom performance.
Using test scores
By itself, a test score is just a number. Elsewhere in this issue,
you'll discover how teachers are finding ways to apply elements
of goal setting and assessment to classroom practice; our purpose
here, however, is to provide readers with a basic understanding
of standardized tests and scoring. When teachers, students, and
others who use test scores are looking at a test score for a particular
student, there are an additional few pieces of information they
can use to make that number mean something. In the next few pages,
some of these pieces of information are explained to help you understand
what test scores do and do not mean.

Test score scales
A score scale is the range of possible scores on the test.
Score scales come in all shapes and sizes. On the TABE, for example,
different students might get scores as divergent as 212 and 657.
In contrast, on the REEP, scores range only from 0 to 6. A student
who takes the BEST, depending on his or her ability to comprehend
and speak English, will score from 0 to 65 or higher. Is a 212 on
the TABE a "better" score than a 5.4 on the REEP? Even
though 212 is a bigger number, these two scores come from tests
that are very different and are designed
to test very different things. For this reason, comparing scores
across different tests is generally not a good idea.
Because scores from the REEP, the TABE, and the BEST are all on
different score scales, the number a person gets as a score on one
of those tests has meaning for that test only. It might be confusing
to have different score scales, but the people who develop tests
do this on purpose to make sure that users do not interpret scores
on a particular test according to some other standard or yardstick.
For example, in the United States the score scale of 0 to 100 is
commonly used in many classrooms, but people who make standardized
tests often avoid that score scale because many people would assume
that such scores mean the same thing they do in the classroom. Sometimes
test developers work really hard to create a unique score scale:
e.g., on one test used in the United States for admission to medical
school, scores are graded from J (the lowest score) to T (the highest
score)!

Error in test scores
As we explained at the beginning of this article, some error is
always a factor in test score interpretation. In fact, tests simply
cannot provide
information that is 100% accurate. This might sound surprising,
but this
is true for many reasons; for example:
- The extent to which a student has learned the breadth and depth
of a subject will influence how she or he performs on a test.
On a reading test, for example, a student might do well with questions
about word meaning and finding the main idea of a passage but
have had less practice distinguishing fact from opinion. The experience
(or lack thereof) that a test-taker brings to the test represents
a source of error in terms of using the test score to generalize
about the student's reading ability.
- Sometimes a student taking a test is just plain unlucky. If
a student is tired, hungry, nervous, or too warm, he or she might
do worse on the test than if the circumstances were different.
- A test might have questions that seem tricky or confusing. If
a student is not clear about the meaning of a question, he or
she will have trouble finding the correct answer.
- As we mentioned earlier, mistakes may be made in scoring a test.
When students are not given credit for correct answers or are
given credit for incorrect answers, score accuracy suffers.
Standard error of measurement
The score a person gets on the test is meant to indicate how well
that person knows the information being tested. One way of looking
at a test score is to think of it as consisting of two parts. One
part represents the real but unknowable true ability of a person.
This part is unknowable because it is never possible to get inside
someone's head and have a perfect measure of their ability in the
area of interest. The other part of a test score represents the
error, all the things that make the test a less-than-perfect snapshot
of someone's knowledge at one moment in time. Unlike the way we
can manufacture a yardstick that is exactly three feet long to measure
length, even the best tests can provide scores that are only approximations
of the true ability.
Unfortunately, it is impossible to break these two pieces of a
test score (the true ability and error) apart. But it is important
to understand that any test score contains a certain amount of error,
and as we've illustrated the error might be due to things that are
going on with the test taker or things that involve how the test
is created or scored. Errors in test scores cannot be completely
eliminated, but fortunately there are techniques that can be used
to provide some idea about how much the score is affected by error.
For example, testing specialists can calculate the standard error
of measurement, which can be thought of as the range of scores obtained
by the same person taking the same test at different times. The
standard error of measurement is a "best guess" about
how close the test is to measuring a person's knowledge or skill
with 100% accuracy. The standard error of measurement is a statistical
estimate of how far off the true score the test score is likely
to be.
Let's take the TABE as an example. Suppose a student takes the
TABE Reading Test, Level 7E and gets a score of 447. First of all,
that score isn't very low or very high. The next piece of information
that will be helpful in understanding this TABE score is the standard
error of measurement. The statistics of test development have shown
that the standard error of measurement associated with 447 is 17
points, which means that the student's true score is probably between
430 and 464. This score range was calculated by adding and subtracting
17, the standard error of measurement, from the score of 447. The
standard error of measurement gives us a good idea of score accuracy.
In the last example the true score was described as probably falling
within 17 points of the score the student got on the test; for a
score of 630 on the same test, the standard error of measurement
is a much bigger number: 64. In this case, the student's true score
falls between 566 and 694. There is probably a very big difference
in TABE reading knowledge between a 566 score and a 694, so it would
be harder to interpret a student's knowledge within such a large
range. The size of the standard error of measurement is in large
part dependent on the reliability of the test, which was explained
previously.

Conclusions
Concepts like reliability, validity, test score scales, and standard
error of measurement give meaning to numbers that on their own might
not mean much. Of course, the score that someone gets on a test
is just one piece of information that tells what he or she knows
and is able to do in one very specific and carefully defined subject
area. While tests and test scores are important, and it is important
to try your best on any test you take, it is also important to remember
that any one test score is just that: one test score. The sidebar
rules for interpreting test scores given in this article might help
you use test scores in meaningful ways.
Are all tests as good as they should be? Do all tests provide useful
information? Unfortunately the answer is "no," but researchers
at UMass, working in collaboration with the Massachusetts Department
of Education, Adult and Community Learning Services, are striving
to create tests for ABE students that produce scores that are reliable
and can help us make valid decisions about students and programs.
Our efforts are focused on making sure the numbers that are test
scores - whether from the REEP, the BEST, the TABE, or any new tests
that will be developed - are as meaningful and dependable as possible.
April L. Zenisky is Senior Research Fellow and Project Manager
at the Center for Educational Assessment at UMass Amherst. Her research
interests include computer-based testing methods and designs, applications
of item response theory, and innovations in test item formats.
Lisa A. Keller is Assistant Professor in the Research and Evaluation
Methods Program and Assistant Director of the Center for Educational
Assessment at UMass Amherst. Her research interests include Bayesian
statistics, computerized adaptive testing and item response theory.
Stephen
G. Sireci is Associate Professor in the Research and Evaluation
Methods
Program and Co-Director of the Center for Educational Assessment
in the School of Education at UMass Amherst. He is known for his
research in evaluating test fairness, particularly issues related
to content validity, test bias, cross-lingual assessment, standard
setting, and sensitivity review.
Originally published in Adventures in Assessment,
Volume 16 (Spring 2004),
SABES/World Education, Boston, MA, Copyright 2004.
Funding support for the publication of this document
on the Web provided in part by the Ohio State Literacy Resource
Center as part of the LINCS
Assessment Special Collection.
|
|