Jerry W. Larson
Harold S. Madsen
Tests with their many inexact measurements have been the concern of psychometricians for many years. The two issues of adequate precision and a common 'yardstick' for measuring persons of different abilities have been particularly difficult to deal with. Computer-adaptive testing holds great promise in dealing with these and related issues in the field of testing. This paper describes several test variables of concern and explains the concept of computer-adaptive testing and its relationship to these variables.
KEYWORDS: testing, CALT, computer-adaptive, test variables, item response theory, proficiency, precision, ability, Rasch, Mediax, Microscale Plus, outfit, infit, logit.
While the role of computers in language teaching and materials development has been widely discussed in professional literature, much less has been written about the role of computers in language testing (Henning 1985, Perkins and Miller 1984). After briefly surveying applications of computers in language testing, we will focus on the promising specialized area of computerized adaptive testing.
Before discussing computer applications in testing, it seems prudent to identify some of the possible limitations as well as the advantages of this relatively new technology.
Disadvantages of Computers in Language Testing
An obvious concern in many institutions (notably in public schools) is the cost of both the hardware and the software. although there is considerable interest in computers around the country, most schools have not yet been able to budget for enough hardware to handle instructional or testing needs. There is also the need to acquaint examinees with the operation of a computer. Lack of facility with the computer could lead to a double jeopardy evaluation situation in which results reflect not only language proficiency but also computer proficiency. For some, there is the possibility of unanticipated results (see Cohen 1984) such as anxiety in manipulating the machine. This in turn suggests the potential for test bias, favoring those with experience on computers.
A final limitation is the tendency for users of the new technology to utilize objectively scored language tests almost exclusively. This means that sound approaches to evaluation including essays, dictation, and holistically-scored oral interviews might well be neglected by persons eager to employ computers in language testing.
Advantages of Computers in Language Testing
If the concerns of cost and appropriate balance in test format can be dealt with satisfactorily, for example, by using computer-related evaluation to complement but not supplant other forms of testing, there is much to be gained in the process. For one thing, computer assisted testing (CAT) is obviously compatible with computer assisted instruction (CAI). Where CAI terminals are available for students, it is logical to employ these same terminals for evaluation as well. Since CAI is typically self-paced, exams could also be introduced when the CAI student is ready to be evaluated, thus reducing the tension associated with group paper and pencil tests. Relatively frequent evaluation is possible without constituting a drain on class time. This not only enables students to corroborate and monitor their progress but also to avoid the tiring and threatening, make or break examinations. Another plus is the possibility of getting test results immediately rather than having to wait for days until the teacher has time to grade the tests. This advantage not only relieves the teacher of a considerable burden but also provides feedback to students while they still remember the tasks they had been engaging in.
CAT saves time for the teacher in revising the tests as well as in scoring the items. In fact, the ease of editing the test encourages revision, since there is no longer any need to retype the entire exam. The computer is also able to maintain banks of test items, which can be easily accessed.
Test types in addition to multiple-choice can be managed on the computer, such as the cloze, where words are to be typed into blanks in a prose passage; scaled items (for example, examiner ratings that range from 0 to 4 of student responses on an oral interview); and even essay exams. While the latter are not scored objectively, compositions written on the computer have been shown under experimental conditions to be superior to others since students voluntarily revise their work repeatedly on the computer (Nickell 1985, Strong-Krause and Smith 1985). And teachers typically have a cleaner copy to evaluate.
Later we will discuss additional advantages of using computers in language testing, such as individualizing test items and providing flexibility in test length. But first we will review some of their most significant applications.
The Computer as Adjunct to Testing
Computers, notably mainframe, have been used for some time as adjuncts to language evaluation. Computers have been particularly useful to those concerned with test research. Factor analysis, ANOVA, multiple-regression and other statistical procedures have helped facilitate investigations into such matters as construct validation (Bachman and Palmer 1983), and test affect and test bias (Madsen 1982).
Educational institutions, particularly colleges and universities, have used computers for a number of years as a convenient record keeper of exam results. Some schools maintain item banks in various disciplines; others provide for the machine scoring of tests as well as computer-generated item analyses of multiple-choice questions, and statistical graphs of results (Frey 1984).
Computer-Assisted Test Applications
In its least imaginative form CAT serves merely as a page turner. But more creative applications are readily apparent: As in CAI, cures to examinees can be personalized. At the conclusion of the test, scores can be provided as well as explanatory information on suggested review material or placement sections. The computer can track student testing patterns, recording not only the scores but also the time spent and even the sequence of items attempted.
Depending on how the test is constructed, helpful diagnostic information on areas of strength or weakness can be instantly provided (Jones and McKay 1983). In addition, there is the possibility of capitalizing on calls for communicative interaction in testing (Carroll 1982). Just as interactive video is rapidly being recognized as a significant tool in instruction (Rowe, Scott, Benigni 1985), there are fascinating possibilities for interaction in language testing. These range from a request to have a word glossed or translated to a request for a phrase or item to be repeated on a listening test (with appropriate point adjustment). The student might even be provided with a menu allowing for branching to less challenging items. The possibilities seem almost limitless.
The Special Role of Computerized-Adaptive Language Testing
As intriguing as the potential of CAT is, computerized-adaptive language testing (CALT) provides solutions to problems that have vexed psychometricians for decades: notably the two issues of adequate precision and a common 'yardstick' for measuring persons of differing ability.
While providing the means of resolving these two concerns of test experts, CALT also helps satisfy some of the practical concerns faced by language teachers, students, and test administrators: These include the matter of identifying item bias, as well as the problems of test length and precision, which is related to validity. Most are aware of lengthy tests (created to ensure that coverage has been truly adequate) and the resulting boredom on the part of advanced learners or frustration on the part of less proficient students. Language teachers are also painfully aware of their imprecise tests, which identify with certainty only the very brightest and the weakest, leaving a large number of in-between student inadequately evaluated.
Two Key Concerns of Test Experts
One concern of test experts is the need for real precision in tests. Over half a century ago psychometricians began formulating what is known as item response theory (IRT). An important finding was that the most effective test was one
that presented items within a range of difficulty close to the ability level of the examinee. This concept led very early to the notion of tailored testing. For example, over half a century ago Binet's mental measurement test was administered one-on-one to capitalize on this notion. When larger numbers of students needed to be evaluated, this procedure was modified and finally abandoned. But using Wainer's track and field metaphor (1983), we can see the logic of this tailored-testing concept: If one's high-jumping range were four to five feet, hurdles of only one and two feet would contribute essentially nothing to our measurement of the individual, or would impossibly high hurdles of seven and eight feet. Like contemporary language tests, they would only bore or frustrate the person being evaluated.
In language testing, the tailored-testing concept has been applied in the Foreign Service Institute's classical oral interview (the FSI) and retained in the government's Interagency Language Roundtable exam (ILR). However, this test is so costly in terms of examiner preparation and examinee administration time, that only limited application has been made of the item response theory concept, i.e., that the most precise measure of a person's proficiency is one derived from items at or near one's level of ability.
IRT also addresses itself to psychometricians' concerns for a common scale of measurement. Contemporary item response theory manifests itself in various statistical models under the rubric of Latent Trait analysis. The one most appropriate for relatively small-scale language test analysis (with an N of only 100 or sometimes less) is the one-parameter Rasch model that focuses on item difficulty and student ability. Rasch analysis, as well as other Latent-Trait procedures, comes to grip with one of the most frustrating problems in language testing: that is, the development of a suitable, invariable 'yardstick' for measuring student ability (Wright 1977).
Teachers and test experts alike are well aware that on any given grammar test, for example, a 'high' score for one group of students may be 'low' for another group. And whether or not a test item is difficult or easy depends on who is being evaluated. How absurd it would be to get differing weights for a bag of sugar on two different scales!
Now, with the logarithmic calculations built into the Rasch latent-trait model, it is at last possible to calibrate test items independent of the persons being measured; and it is likewise possible to place examinees on an absolute scale, independent of the measure being used. In short, a common scale of measurement is now within the reach of test experts and language teachers. Moreover, by utilizing computer applications mentioned in the next section of this article, the other concern of test experts can be met: namely, the providing of precise evaluation by selecting items tailored to the ability level of each examinee. And, finally, certain needs of language teachers and administrators can simultaneously be dealt with through computerized Rasch analyses. These needs include improved precision and validity, identification of item bias, and up to an 80 percent reduction in test length.
Computer Application of the Rasch Model
For many years educators have used computers to organize and analyze data using a variety of spreadsheet and statistical packages. Early in the computer era, programs on mainframe computers were all that were available to educators. Recently, however, several companies have produced statistical software for micro-computers. One of these new programs, Microscale Plus!, developed by Mediax Interactive Technologies, Inc. has been developed expressly to perform the Rasch analysis. Presently the program is somewhat limited in terms of the amount of data that can be processed at one time (254 students' responses on 62 items or 62 students' responses on 254 items). Reportedly, however, it is soon to be upgraded to accommodate approximately 9000 students.
Microscale uniquely draws attention to misfitting items or to students by identifying those that do not 'fit' on the uniformly calibrated scale. The program analyzes responses scored as either right or wrong as well as responses on a rating scale, providing the interval of the rating scale is identical across all items.
Data Entry and Calibration
Item responses are recorded onto the Supercalc3! spreadsheet program, which serves as the data organizer. Using the spreadsheet, one is able to make entry corrections, change the order of the data, or do other desired data manipulations before actually performing the analysis routines of Microscale. Major headings as well as subheadings can be specified in order to clearly understand the charts and graphs produced during the analysis.
Data can be entered manually using the keyboard of the micro-computer or downloaded from a mainframe to the PC. Student responses to each item are entered into the data matrix of the spreadsheet as either 1 (correct) or 0 (incorrect). Unlike many statistical analyses, Microscale will handle up to 50% missing data, estimating what a student's score would be, based on his performance on the rest of the test.
Once the data have been entered, two statistical algorithms (PROX and UCON) are applied to calculate estimates of student ability and item difficulty. Convergence iterations cease when the expected score for each student or item reaches a difference of 0.1 from the actual score on the test.
After the analysis routines have completed, results tables and graphs are generated and can be displayed on the computer screen or printed (using a dot matrix printer). Table I illustrates the results table for items. (The data presented here are from a recent study in which we analyzed the results of an ESL reading test. The test contained 60 items and was administered to 183 subjects.)
The same kinds of data are available in a results table for students as are shown in Table I for the items. (Note, however, that while the Item Results Table portrays item difficulty, the Student Results Table indicated student ability) The first column indicates the item number on the test. The next column reveals how many students answered that item
0x01 graphiccorrectly. The third column gives the estimated difficulty of that item in logits. The error statistic in the fourth column refers to the standard error of the difficulty estimate. The next two columns present the fit statistics—infit and outfit—which are more easily understood, perhaps, by referring to their graphic representations (see Figures 1 and 2). These two statistics indicate how well each item is able to measure the individuals taking the test.
The infit and outfit statistics are based on all persons taking the item. The latter, however, is significantly affected by unexpected responses, or misfits. Information about item goodness of fit can be helpful in identifying problematic items, such as those that are biased and not suited for persons of a particular ethnic background. This information also enables one to uncover items that are redundant or dependent upon earlier responses and therefore of questionable value.
In addition to the outfit and infit graphs, Microscale produces a map that juxtaposes items and students on the same scale (see Figure 3). The distribution of students is represented above the dotted line, and the spread of items is shown below the line. The map proves to be very useful for determining areas of student ability that the test items are not measuring. Notice, for example, that there are several items to the far left (very easy items) that do not help to evaluate student ability. These items may be deleted in an effort to refine this test. On the other hand, there is a shortage
of items that discriminate among students at the other end of the scale; therefore, we are less confident of our measurement of student proficiency at the highest range of ability. Additional items in this difficulty range should be added to the test.
Item Linking and Banking
Being able to calibrate items with extreme precision makes it possible to link together items from separate measures to form item banks. These linked or coordinated questions, which all serve to define a given variable, eg., reading comprehension, provide a pool of items from which alternate tests forms can be generated without compromising accuracy. Using an anchoring procedure, Microscale makes it possible to calibrate a new set of items to the difficulty level of those in the bank. This is done by including in the new test a small number of selected items (about 10 to 20) from the initial, calibrated test administered to a different group of students. While a few of the new items might not fit and have to be discarded, the potential for expanding the item bank is virtually unlimited. It may even become feasible to share items among institutions.
Being able to create large banks of items calibrated to various levels of difficulty permits the development of tests designed to assess specific levels of ability. One such test is a computerized adaptive language test.
The Computer and Computerized Adaptive Language Tests
The underlying assumption of an adaptive test is that the test must be able to adjust to the ability level of the person taking the test. We have noted that asking examinees questions at or near their level of ability results in a much more precise measure of their competence than subjecting the student to questions ranging far above or beneath their capability.
Requisite to constructing an adaptive language test is having items whose difficulty level has been accurately identified. These precise calibrations are necessary in order to be certain that students are presented an easier item than the one just missed or a more difficult item than the one just answered correctly. Once the calibrated items are available, tests can be created that will adjust to the ability level of the student.
The computer is able to play a primary role in adaptive testing because of its capability to branch quickly from item to item, selecting either more difficult questions or easier questions according to the responses of the examinees. Though that kind of operation is easily handled by the computer, determining the branching logic is a bit more difficult. Programming and authoring decisions must first be made, such as how far up or down the difficulty scale the computer should search for the next item, or how many or what percentage of correct responses are needed at a given level to determine with a high degree of confidence the student's level of ability.
Once these issues are resolved, the computer is able to function as administrator of the test, somewhat similar to the role of an oral proficiency examiner: the examinees begin the test at a relatively low, comfortable level, allowing them to gain confidence and to feel a little more at ease; the computer then begins to probe in order to assess the students' highest competency level; after the examinees have successfully completed a predetermined number of items, they may be given another item or two slightly below their ability level to end the test on a positive note.
While it is true that there are some restrictions and disadvantages in using computers for language testing, the potential for computer application in this area is enormous. Both computer assisted and computerized adaptive testing warrant careful consideration as language testers devise measures to evaluate proficiency or achievement and to provide information for placement or diagnostic purposes
Further information about Microscale Plus! can be obtained by writing to Mr. David Trojanowski, Product Manager, Mediax Interactive Technologies, Inc., 21 Charles Street, Westport, Connecticut 06880-5889.
Logit is a unit of measure used by Microscale. For further explanation, see Benjamin D. Wright and Mark H. Stone, Best Test Design. (Chicago: Mesa Press, 1979) 16-17.
Bachman, Lyle F. and Adrian S. Palmer. "The Construct Validity of the FSI Oral Interview," in John W. Oller, Jr. (ed.) Issues in Language Testing Research Rowley, Massachusetts: Newbury House Publishers, Inc. 1983.
Carroll, Brendan J. Testing Communicative Performance: An Interim Study Oxford, England: Pergammon Press, 1980.
Cohen, Andrew D. "Fourth ACROLT Meeting on Language Testing," TESOL Newsletter, 18:2 (April 1984), 23.
Frey, Gerard. "Computer-assisted Testing for the Tests, Measurement and Evaluation Division," Medium: Pedagogical Journal 9:3 (December 1984), 143-145.
Henning, Grant. "Advantages of Latent Trait Measurement in Language Testing." Forthcoming, 1985.
Jones, Randall L. and Brian McKay, "An Apple Computer-based ESL Diagnostic Testing Program," Paper presented at the Seventeenth Annual TESOL Convention, Toronto, Canada; March 17, 1983.
Madsen, Harold S. "Determining the Debilitative Impact of Test Anxiety," Language Learning, 32:1 (June 1982), 133-143.
Nickell, Samilia Sturgell. "Computer-assisted Writing Conferences," Paper presented at the CALICO '85 Symposium, Baltimore, Maryland; February 2, 1985.
Perkins, Kyle and Leah D. Miller. "Comparative Analyses of English as a Second Language Reading Comprehension Data: Classical Test Theory and Latent Trait Measurement," Language Testing, 1:1 (June 1984), 21-32.
Rowe, A. Alan, Al Scott, George Benigni. "Interactive Videodisc Courseware Development," Workshop presented at the CALICO '85 Symposium, Baltimore, Maryland; January 30, 1985.
Strong-Krause, Dianne and Kim L. Smith. "ESL Students Learn to Write with a Word Processor," Paper presented at the CALICO '85 Symposium, Baltimore, Maryland; February 1, 1985.
Wainer, Howard. "On Item Response Theory and Computerized Adaptive Tests," Journal of College Admissions, 28:4 (April 1983), 9-16.
Wright, Benjamin D. "Solving Measurement Problems with the Rasch Model," Journal of Educational Measurement, 14:2 (Summer 1977), 97-116.