SABES Logo HomeSystem for Adult Basic Education SupportSystem for Adult Basic Education SupportSABES Contact Us
AssessmentCurriculumLicensureWorkforce Development & Community PlanningSABES Calendar
Administration & Organizational DevelopmentTechnologyLinks Beyond SABESStudent LeadershipResources
SABES Home> Resources> Publications> Adventures

[Adventures in Assessment logo]

Volume 16 Spring 2004
 PDF version

Introduction: Volume 16
Carey Reid, Editor

How to Do Your Best on Standardized Tests: Some Suggestions for Adult Learners
Ronald K. Hambleton and Stephen Jirka

Using the REEP Assessment for ESOL and ABE Classroom Instruction
Joanne Pinsonneault and Carey Reid

Integrating Goal Setting into Instructional Practice
Staff at the New Americans

A Basic Primer for Understanding Standardized Tests and Using Test Scores
April Zenisky, Lisa Keller, and
Stephen G. Sireci

Using Data for Program Improvement
Luanne Teller

ACLS, SABES, UMASS: Perfect Together
Stephen Sireci



Search Our Site!
 

A Basic Primer for Understanding Standardized Tests and Using Test Scores

April Zenisky, Lisa Keller, Stephen G. Sireci
UMass Amherst

Introduction

It's nearly impossible to live in American society today without
having to take some kind of standardized test. You have to pass a test to get a driver's license, get American citizenship, receive a General Educational Development (GED) certificate, get into college, and
be considered for certain kinds of jobs. Here in Massachusetts, our children's teachers have to pass state tests to be licensed, and the children themselves have to pass the Grade 10 version of the Massachusetts Comprehensive Assessment System test, or MCAS, to graduate from high school.

Why do we have to take all these tests? Basically, because there is widespread agreement (but not complete agreement) that tests can tell if a person has the knowledge or skills needed for a diploma, a certificate, a school class level, or a job. But it's not just any test we're talking about here - it's standardized tests. Standardized tests are used because people feel that if you're going to judge someone's abilities, you'd better use a means that's reliable and fair, and standardized tests are designed to be reliable and fair - though people might disagree about whether they succeed in those goals.

We will not debate that issue here; the purpose of this article is to equip readers with a basic understanding of what goes into standardized test making and what test scores purport to show about learners' skills and abilities. We welcome any constructive use of this knowledge, whether it be better instruction or better policies, but all constructive uses start with accurate knowledge.

Meeting the reliability and validity criteria

Federal policies now require that states prove that ABE funds result in learner gains in reading, writing, language acquisition, and math. In addition, they require that states measure these gains with valid and reliable tests. After months of reviewing many standardized assessments and their respective alignment with the Curriculum Frameworks, Massachusetts policymakers and education professionals have agreed to use the TABE for ABE Reading, Writing, and Math; the BEST for ESOL Speaking and Listening; and the REEP for ESOL Writing. Scores in each of these tests are meant to represent what students know or can do in those areas. What does it mean when we say these tests are reliable and valid? Let's take up each of these concepts in turn.

Reliability

The consistency of scores across different administrations or scorers is known as reliability. It is crucial that test scores be adequately reliable in representing a person's knowledge and skills. Some level of error is always a factor in testing (more on this later) and test scores. If a person takes the same test on different days, we expect the results to be slightly different, but the more error there is in the test's make-up, the more different the two test scores are likely to be. If the two test scores are very different, it is reasonable to conclude that the difference is due to test error and that the scores do not really reflect what the test taker knows and is able to do.

Inconsistencies in scoring tests might also undercut reliability. Some tests are composed of multiple-choice questions, while others require that the test taker construct a response, such as an essay. Scoring a multiple-choice question is straightforward, because there is one right answer; the answer provided is either correct or incorrect. Therefore, regardless of who scores the test, the score on that question will be the same. Essay-type questions, however, require human judgment and are therefore more difficult to score. If two people read the same essay, it's likely that each person will give the essay a slightly different score. However, if the two scores given by the two scorers, or "raters," are very different, then the score on that essay is not very consistent, or reliable.

The measure of consistency between scorers is called inter-rater eliability.
The closer the scores assigned to an essay by different raters, the higher the inter-rater reliability of that test. While it might seem impossible to get different raters to assign exactly the same score, it is possible to train raters so that they all score in a very similar way. If this goal is accomplished, there can be more confidence that the score assigned to the essay reflects the ability of the student.

See PDF page 30

Validity

How do we know whether a test measures the ability we are interested in? Even if a test is perfectly reliable and virtually error-free, how do we know if it is measuring the abilities we want it to and not something else? This is the central concern of validity, and ultimately involves the kinds of judgments that can be based on test scores.

Let's consider a math test consisting only of word problems. The test score could appropriately be used to indicate the student's ability to solve math problems that require reading; that would be a valid use of the test score. However, using the test score as a representation of the student's math ability in general would not be valid.

People who develop tests analyze them in several ways to determine the appropriate (i.e., valid) use of test scores. Let's review some of the issues considered in determining the valid use of test scores:

  • Do the questions on the test represent the entire subject matter about which conclusions are to be drawn? For instance, if a test is designed to measure general arithmetic ability, there should be questions about addition, subtraction, multiplication, and division. If there are no questions about division, the test does not measure the entire content of arithmetic, so the test score cannot be said to reflect general arithmetic ability.
  • Is the student required to demonstrate the skill that the test is intended to measure? Tests should be directly targeted to the skills measured and that skill should affect test performance. For example, a test designed to measure writing proficiency should ask test takers to write something, and better writers should be shown to receive higher scores.
  • Are the test scores consistent with other indicators of the same knowledge and skills? Suppose a student takes a test designed to measure writing ability. If the student does well on writing assignments in class, then he or she should also do well on the writing test, so long as the type of writing on the test is consistent with that done in class. On the other hand, students who do not perform well on writing assignments in class should not do
    as well on the test. The validity of using that test score as an indication of the person's ability is questionable if there is inconsistency between the score and classroom performance.

Using test scores

By itself, a test score is just a number. Elsewhere in this issue, you'll discover how teachers are finding ways to apply elements of goal setting and assessment to classroom practice; our purpose here, however, is to provide readers with a basic understanding of standardized tests and scoring. When teachers, students, and others who use test scores are looking at a test score for a particular student, there are an additional few pieces of information they can use to make that number mean something. In the next few pages, some of these pieces of information are explained to help you understand what test scores do and do not mean.


Test score scales

A score scale is the range of possible scores on the test. Score scales come in all shapes and sizes. On the TABE, for example, different students might get scores as divergent as 212 and 657. In contrast, on the REEP, scores range only from 0 to 6. A student who takes the BEST, depending on his or her ability to comprehend and speak English, will score from 0 to 65 or higher. Is a 212 on the TABE a "better" score than a 5.4 on the REEP? Even though 212 is a bigger number, these two scores come from tests that are very different and are designed
to test very different things. For this reason, comparing scores across different tests is generally not a good idea.

Because scores from the REEP, the TABE, and the BEST are all on different score scales, the number a person gets as a score on one of those tests has meaning for that test only. It might be confusing to have different score scales, but the people who develop tests do this on purpose to make sure that users do not interpret scores on a particular test according to some other standard or yardstick.

For example, in the United States the score scale of 0 to 100 is commonly used in many classrooms, but people who make standardized tests often avoid that score scale because many people would assume that such scores mean the same thing they do in the classroom. Sometimes test developers work really hard to create a unique score scale: e.g., on one test used in the United States for admission to medical school, scores are graded from J (the lowest score) to T (the highest score)!

See PDF page 32

Error in test scores

As we explained at the beginning of this article, some error is always a factor in test score interpretation. In fact, tests simply cannot provide
information that is 100% accurate. This might sound surprising, but this
is true for many reasons; for example:

  • The extent to which a student has learned the breadth and depth of a subject will influence how she or he performs on a test. On a reading test, for example, a student might do well with questions about word meaning and finding the main idea of a passage but have had less practice distinguishing fact from opinion. The experience (or lack thereof) that a test-taker brings to the test represents a source of error in terms of using the test score to generalize about the student's reading ability.
  • Sometimes a student taking a test is just plain unlucky. If a student is tired, hungry, nervous, or too warm, he or she might do worse on the test than if the circumstances were different.
  • A test might have questions that seem tricky or confusing. If a student is not clear about the meaning of a question, he or she will have trouble finding the correct answer.
  • As we mentioned earlier, mistakes may be made in scoring a test.
    When students are not given credit for correct answers or are given credit for incorrect answers, score accuracy suffers.


Standard error of measurement

The score a person gets on the test is meant to indicate how well that person knows the information being tested. One way of looking at a test score is to think of it as consisting of two parts. One part represents the real but unknowable true ability of a person. This part is unknowable because it is never possible to get inside someone's head and have a perfect measure of their ability in the area of interest. The other part of a test score represents the error, all the things that make the test a less-than-perfect snapshot of someone's knowledge at one moment in time. Unlike the way we can manufacture a yardstick that is exactly three feet long to measure length, even the best tests can provide scores that are only approximations of the true ability.

Unfortunately, it is impossible to break these two pieces of a test score (the true ability and error) apart. But it is important to understand that any test score contains a certain amount of error, and as we've illustrated the error might be due to things that are going on with the test taker or things that involve how the test is created or scored. Errors in test scores cannot be completely eliminated, but fortunately there are techniques that can be used to provide some idea about how much the score is affected by error.

For example, testing specialists can calculate the standard error of measurement, which can be thought of as the range of scores obtained by the same person taking the same test at different times. The standard error of measurement is a "best guess" about how close the test is to measuring a person's knowledge or skill with 100% accuracy. The standard error of measurement is a statistical estimate of how far off the true score the test score is likely to be.

Let's take the TABE as an example. Suppose a student takes the TABE Reading Test, Level 7E and gets a score of 447. First of all, that score isn't very low or very high. The next piece of information that will be helpful in understanding this TABE score is the standard error of measurement. The statistics of test development have shown that the standard error of measurement associated with 447 is 17 points, which means that the student's true score is probably between 430 and 464. This score range was calculated by adding and subtracting 17, the standard error of measurement, from the score of 447. The standard error of measurement gives us a good idea of score accuracy.

In the last example the true score was described as probably falling within 17 points of the score the student got on the test; for a score of 630 on the same test, the standard error of measurement is a much bigger number: 64. In this case, the student's true score falls between 566 and 694. There is probably a very big difference in TABE reading knowledge between a 566 score and a 694, so it would be harder to interpret a student's knowledge within such a large range. The size of the standard error of measurement is in large part dependent on the reliability of the test, which was explained previously.



Conclusions

Concepts like reliability, validity, test score scales, and standard error of measurement give meaning to numbers that on their own might not mean much. Of course, the score that someone gets on a test is just one piece of information that tells what he or she knows and is able to do in one very specific and carefully defined subject area. While tests and test scores are important, and it is important to try your best on any test you take, it is also important to remember that any one test score is just that: one test score. The sidebar rules for interpreting test scores given in this article might help you use test scores in meaningful ways.

Are all tests as good as they should be? Do all tests provide useful information? Unfortunately the answer is "no," but researchers at UMass, working in collaboration with the Massachusetts Department of Education, Adult and Community Learning Services, are striving to create tests for ABE students that produce scores that are reliable and can help us make valid decisions about students and programs. Our efforts are focused on making sure the numbers that are test scores - whether from the REEP, the BEST, the TABE, or any new tests that will be developed - are as meaningful and dependable as possible.


April L. Zenisky is Senior Research Fellow and Project Manager at the Center for Educational Assessment at UMass Amherst. Her research interests include computer-based testing methods and designs, applications of item response theory, and innovations in test item formats.

Lisa A. Keller is Assistant Professor in the Research and Evaluation Methods Program and Assistant Director of the Center for Educational Assessment at UMass Amherst. Her research interests include Bayesian statistics, computerized adaptive testing and item response theory. Stephen

G. Sireci is Associate Professor in the Research and Evaluation Methods
Program and Co-Director of the Center for Educational Assessment in the School of Education at UMass Amherst. He is known for his research in evaluating test fairness, particularly issues related to content validity, test bias, cross-lingual assessment, standard setting, and sensitivity review.

Originally published in Adventures in Assessment, Volume 16 (Spring 2004),
SABES/World Education, Boston, MA, Copyright 2004.

Funding support for the publication of this document on the Web provided in part by the Ohio State Literacy Resource Center as part of the LINCS Assessment Special Collection.

 

Boston CRC Central Northeast Southeast West
SABES is funded by Massachusetts Department of Education : :|: : Creative Commons Copyright Info.: :| : Webmaster : :| : :Site Map : : Last Modified 05/01/06