DOCUMENT RESUME TM 820 813 Hogan, Thomas P.

Transcription

DOCUMENT RESUMETM 820 813ED 224 811AUTHORTITLESPONS AGENCYPUB DATENOTEPUB TYPEEDRS PRICEDESCRIPTORSIDENTIFIERSHogan, Thomas P.Relationship between Free-Response and Choice-TypeTests of Achievement: A Review of the Literature.National Inst. of Education (ED), Washington, DC.[81]51p.Information Analyses (070)MF01/PC03 Plus Postage.*Achievement Tests; Correlation; Essay Tests;*Measurement Techniques; Multiple Choice Tests;*Objective Tests; *Test Format; *Test Selection*Free Response Test Items; National Assessment ofEducational ProgressABSTRACTDo choice-type tests (multiple-choice, true-false,etc.) measure the same abilities or traits as free response (essay,recall, completion, etc.) tests? A large number of studies conductedwith several different methodologies and spanning a long period oftime have addressed this question. In this review, attention will befocused almost exclusively on the measurement of the traditionalproduct of education, namely knowledge. This review is limited toempirical studies of the equivalence of free-response and choice-typetests. The major methods used to study the relationship betweenfree-response and choice-type measures are the direct correlation,the criterion correlation and the treatment effect. Contrary towidely held beliefs about choice-type tests, the studies indicatethat the two types of tests do generally measure the same traits orabilities. To the extent that there are minor differences, thechoice-type measures tend to be more valid; and use of choice-typemeasures does not seem to have adverse effects on study habits.However, the generalizations are limited by insufficient diversity inthe groups studied and may not apply to certain types of moredivergent processes. Aspect of National Assessment (NAEP) dealt within this document: Assessment Instrument (Multiple Choice Exercises)(Open Ended Exercises). **********************************Reproductions supplied by EDRS are the best that can be made**from the original ******************************

RELATIONSHIP BETWEEN FREE-RESPONSE ANDCHOICE-TYPE TESTS OF ACHIEVEMENT:A Review of the LiteratureThomas P. HoganUniversity of Wisconsin-Green BayThis review was prepared for the NationalAssessment of Educational Progress, aproject of the Education Commission of theStates, funded by the National Instituteof Education.(\-)AU.S. DEPARTMENT OF EDUCATIONINSTITUTE OF EDUCATIONEDUCATIONAL RESOURCES INFORMATIONN ATIONAL"PERMISSION TO REPRODUCE THISMATERIAL HAS BEEN GRANTED BYCENTER (ERIC)3/This document hes been reproduced asCreceived from the person or organizationoriginating it.Minor changes have been made to improvereproduction quality.Points of view or opinions stated in this docurnent do not necessarily represent official NIEposition or policy.TO THE EDUCATIONAL RESOURCESINFORMATION CENTER (ERIC)."cr*4

AbstractDo choice-type tests (multiple-choice, true-false, etc.) measure the sameabilities or traits as free response (essay, recall, completion, etc.) tests?A large number of studies conducted with several different methodologies andspanning a long period of time have addressed this question.Contrary towidely held beliefs about choice-type tests, the studies indicate that the twotypes of tests do generally measure the same traits or abilities; to theextent that there are minor differences, the choice-type measures tend to bemore valid; and use of choice-type measures does not seem to have adverseeffects on study habits.However, the generalizations are limited byinsufficient diversity in the groups studied and may not apply to certaintypes of more divergent processes.3

CONTENTSPageTHE PROBLEM1METHODS OF STUDY3Direct CorrelationCriterion CorrelationTreatment EffectOther Methodological Issues4467THE CORRECTION FOR ATTENUATIONTHE EARLY YEARSHistorical Background for Early StudiesAn Overview of the Classical ReferencesMajor Studies of the Early Years: DirectCorrelation Studies. Criterion Correlation Studies. Treatment Effect StudiesOther StudiesLiterature ReviewsThe "Can't Let Go" SyndromeMORE RECENT INVESTIGATIONSDirect Correlation StudiesCriterion Correlation StudiesThe Difficulty Issue RevisitedThe Effect of Testing IBLIOGRAPHY43

RELATIONSHIP BETWEEN FREE-RESPONSE ANDCHOICE-TYPE TESTS OF ACHIEVEMENT:A Review of the LiteratureTHE PROBLEMTo determine with some degree of accuracy and objectivity how muchstudents know (or don't know) is the central problem of educational testing.The determination may be made for a variety of purposes--grading, diagnosis,program evaluation, etc.--thus tilting a particular application in thisdirection or that, but the basic problem remains the same.The problem, of course, is not new:educational process.It goes hand-in-glove with theEducators, or their external supervisors, have always"tested" students (DuBois, 1970; Ebel, 1972), although in times long past (saybefore 1850) testing, as well as instructional methodology, was not open toquestion.Everyone knew precisely how to do it!We're not so confident today.For reasons which may some day be divined by cognitive psychologists orevolutionary biologists, there is an overwhelming inclination to believe thatthe "natural," "correct" or "direct" way to assess student knowledge is to puta question to the student and have him/her respond in a free and open manner.Such responses, referred to in testing jargon variously as free-response,open-ended, or constructed responses, may be given orally or in writing.Inthe latter mode, they are sometimes referred to generically as "essay" tests,although in some instances the "essay" may be as short as a word, a phrase, ora number.In contrast to the free-response method of testing, we have the nowfamiliar choice type items, one of the most distinctive contributions ofbehavioral sciences to contemporary society.In choice type items, the

2examinee is presented with a number of alternative answers to the question andchooses, most typically, one of these answers as correct.The most popularforms of choice type items are the multiple-choice variety and the true-falseitem, which is really just a specific case of a multiple-choice item, i.e. oneIn addition to thewith two choices (true or false) as possible answers.multiple choice and true-false types of items, numerous other choice typeitems have been devised and experimented with.Much research has beenconducted with variations in formats, direct.ions, and scoring producers forthis or that choice type item.The enduring question for the choice type items is whether or not theseseemingly artificial contrivances measure the same thing as the more "naturaland direct" free-response types of item.Popular opinion on this question israther well formulated and almost universally negative, i. e. the two types ofitems do not measure the same thing.One can hear multiple-choice andtrue-false questions castigated in nearly any teachers' lounge in the countryon a daily basis, and they are lampooned with regular frequency in cartoonstrips (perhaps the best "social indicator" of the pervasiveness of the choicetype item).In addition, professional journals and books in the educationfield routinely lambaste the alleged triviality, ambiguity, and assorted otherevils of the choice type question, not infrequently reaching a feverish pitch.But at root the question of whether free-response alid choice-type testsare measuring the same thing (trait, ability, level of knowledge) is anempirical one, not a philosophical or polemical one.The concepts andmethodology to determine the equivalence of the two types of measures havebeen available for a little over 50 years and have, in fact, been applied indozens of studies.In this review, we wish to "pull these studies together"to determine to what extent research has provided an answer to the questionregarding the equivalence of the two types of measures.In this review,attention will be focused almost exclusively on the measurement of the

3traditional product of education, vis. knowledge.No reference at all is madeto studies in the realm of affect or personality.Furthermore, we have chosento exclude the communication skills of reading, writing and speaking, notbecause these skills fall outside our concern for the products of education,but because they se-.tm to present sufficiently unique cases to warrant coveragein separate reviews.Finally, we limit the review to empirical studies of theequivalence of free-response and choice-type tests; no attempt will be made toreview strictly rhetorical analyses of the question.Before concluding this introductory section, it might be noted that thequestion under review is of considerably more than academic interest.question has substantial financial implications.TheIt turns out that in most,though not necessarily all, instances which involve large-scale testing, it ismuch less expensive to use choice-type items than to use free-response items,due to differences in scoring costs.Although detailed cost comparisons wouldhave to be made within the context of a specific project, it would not beunusual for the cost of an assessment endeavor depending heavily uponfree-response measures to cost twice as much as a similar project dependingheavily upon choice-type measures or, conversely, for the choice-orientedproject to cost half what the free-response project would cost.When onecontemplates an assessment project costing, say 500,000, the importance ofknowing whether the two types of measures are equivalent becomes poignantlyclear.METHODS OF STUDYAlthough the methodologies employed in specific studies will be discussedas each study is introduced in subsequent sections, it will be convenient tooutline first the major methods Which have been used to study the relationshipbetween free-response and choice-type measures.basic methodologies in use to attack the problem.There appear to be threeWe shall refer to them as

4the direct correlation, the criterion correlation, and the treatment effectmethods.Direct Correlation.In the direct correlation method, the correlation(usually the Pearson) between a free-response and choice-type measure isdetermined; most frequently, the correlation is corrected for attenuation (seenext section).If the corrected correlation approaches a value of 1.00, it isconcluded that the two types of tests are measuring the same trait, variable,ability or skill.If the corrected.correlation departs substantially fromunity, obviously one concludes that the two types of tests are measuringsomewhat different things, and the authors usually express a preference forthan what is beingone or the other types of measures based on criteria othermeasured.Such other criteria include reliability,breadth of contentsampling, efficiency, face validity, examinee preference, effect on students'study habits, as well as many other matters.The direct correlation approach has been,by far and away, the mostIt is simple to apply and yieldsfrequently used methodology in this area.data that are relatively easy to interpret.However, interpretation ofresults from studies using this approach is subject to some personalinclinations.For example, a correlation between free-response andchoice-type tests, corrected for attenuation, of .90 in one person's book ishigh enough to warrant the conclusion that the two tests are measuring thesame thing for all practical purposes, while in another's book it is lowenough to show that thetwo tests are not measuring precisely the same thing.Criterion Correlation.In the criterion correlation approach, bothfree-response and choice-type tests are correlated with some externalcriterion which is taken to be, in some sense, a better measure or preciselythe measure of the variable of interest.The type of test which yields thehigher correlation with the criterion is considered the better measure.Notethat in this approach one assumes at the outset that the two types of tests0

5(free-response and choice-type) are probably measuring somewhat differentthings, the only question being which yields the better approximation to theexternal criterion.The correction for attenuation may also be applied in this approachalthough the correction is usually made for unreliability in the test only,not in the criterion.Use ofthe correction for attenuation in this approachmay present a potentially thorny problem of interpretaton.Let us say that Frepresents a free-response measure, C a choice-type measure, and X acriterion; r(FF') .50, r(CC') .90, r(XF') .40, and r(XC) .50.At thispoint, the choice type test is better because it correlates more highly withthe criterion.However, when the correlatiors between the tests and thecriterion are corrected for unreliability in the tests, the free-response testbecomes better, i.e. correlates more highly with the criterion W(XF) .57,r'(XC) .53).But is it reasonable to suppose that the free-response testcan be made substantially more reliable than it already is?Although there isno simple solution to this type of problem, we should at least be aware of thedifficulty as we review studies of this type.A special problem encountered in the criterion correlation approach isthat of "criterion contamination" in which one of the measures beinginvestigated (free-response or choice-type) directly or indirectly affectsstatus on the criterion variable.(The problem of criterion Ltontamination haslong been discussed in clinical research on test validity.(1976) for a general treatment of the topic.)See AnastasiWhen the contamination isdirect, e.g. when the free-response measure being studied is an essay testused for a final exam which will contribute 50% to the final grade which willserve as the criterion, then the problem is usually recognized, althoughrarely does the author attempt to disentangle the effects of thecontamination.Potentially more hazardous for clear interpretation of resultsis indirect contamination in which certain irrelevant sources of9vari.,,ace

6affect status on both the c'4,1terion and one of the measures beinginvestigated, even though the latter does not directly enter into thecriterion.For example, final grade it1 a course may be determined by 10quizzes, all of the short answer essay variety;then, at the end of thecourse -rnd not entering into the determination of the grade, we obtain anessay and a multiple-choice measure of knowledge of course content andcorrelate scores on these measures with final grade.If we are willing togrant that the criterion itself is not perfect, i.e. not the best possiblemeasure of knowledge of course content because it is affected to some extentby abilities peculiar to taking essay tests (e.g. ability to bluff, snow,etc.) which, in turn, also influence status on the essay test beinginvestigated, then we have a case of indirect criterion contamination.Obviously, it is quite difficult to unravel the influence of all such indirectcontaminants but, again, we should at least be aware of this problem.Treatment Effect.A third possible methodology is to apply some treatmenttrait orto a group, the effect of which should be to increase scores on someability, then determine which of several measures is most sensitive indetecting the intended change.is considered the best.The measure which detects the largest changeUse of this technique prescinds entirely from anyassumption about psychometric equivalence of the measures being studied, butdoes have considerable intuitive appeal in educational contexts sinceeducation may be thought of as the applice-Aon of a treatmentLet us provide a practical example.A group of 200 studente is divided byrandom means into two subgroups of 100 each.One subgroup, call it thetreatment group, is taught economics one hour per day for two we2ks at the endof which both groups take an essay test and a multiple-choice test oneconomics.Which test better distinguishes the treatment group from thecontrol group?difference?Or are the two tests equally sensitive to the groupQuite obviously, the two tests could be measuring very differentJo

7things and still show equal differences between the groups or differencesfavoring etcher one or the other type of test.Actually this third methodology could be considered as a special case ofthe criterion correlation method in which the criterion is considered adichotomous variable (treatment 1, control 0), with results beinzexpressed as a biserial correlation.treat the two methods separately.However, it will be more convenient toIt might be noted that the treatmentmethodology clearly skirts the issue of criterion contamination:assignmentto the treatment group in no way depends on any test-taking ability eitherdirectly or indirectly.Given the usual uses of educational tests, the "treatment effect"methodology, while lacking psychometric precision, has a certain intuitiveappeal.It is surprising, therefore, that it has been used in only a fewstudies.Other Methodological Issues.Although the great majority of studies to beconsidered later employ one of the three methods just described, there are, asone might expect, a number of other methodological issues of a general naturewhich merit comment.First and foremost, it should be noted that while eachof the three basic methods just described has a simple, direct relevance tothe problem in question, they all seem to lack any high-powered, theoreticalunderpinnings.This difficulty has been attacked rather recently by Lord withhis discussions of T-equivalent measurements (Lord, 1971; Lord and Novick,1968).The exposition does not yet seem complete, but a few studies employingbasic notions from this line of reasoning have appeared, e.g. Traub and Fisher(1977) and Ward, et al (1980).this area.There appears to be much unfinished work inPerhaps of special importance is the investigation of equivalenceof measures for various subpopulations, since by definition T-equivalentmeasures must show equivalence within all subpopulations (Lord and Novick,1968).Only one study could be identified which treated this issue directly11

8(Longstreth, 1978) and one which treated it indirectly (Peters and Martz,1931).An appallingly large proportion of the research on the relationshipbetween free-response and choice-type tests is based on the proverbial"college sophomores in general psychology class" or their look-alikes, withnot the least hint that conclusions from the study might be limited to thisrather unusual segment of humanity.Educational researchers have apparentlynever outgrown their'belief in the infinite generalizability of results fromsuch subgroups:One of the most recent studies in our subject domain (Gay,1980) is based on two groups of 14 students each in one introductoryeducational research course.Finally, we note that on the issue of the effect of different testingmethodologies on students' study habits and student preference for differenttypes of tests, the evidence is generally "soft," being based mostly onstudents' self-reports.However, a few studies have attempted to investigatethese matters with more sophisticated techniques (Sax and Collet, 1968; Gay,1980).THE CORRECTION FOR ATTENUATIONThe correlation between two measures, r(XY), is limited, lessened, or"attenuated" by the imperfect reliability of the two measures, r(XX') andr(YY').If reliabilities of the two measures are known, the correlationbetween the measures may be "corrected for attenuation" or "corrected forunreliability" by use of the formula r(XY)/ V;TXX') r(YY'), yielding anestimate of the correlation between true scores on X and Y.As Lord andNovick (1968, p. 69) point out, the attenuation problem is of "fundamentalimportance" and ".is one which first motivated the development of testtheory as a distinct discipline."As might be expected, the correction for attenuation is used prolificallyin the literature on the relationship between free-response and choice-type

9measures.Indeed, it was precisely this problem which provided one of theearly applications for the attenuation formula, as well as for correlationmethodology in general; many authors of the 1920 era applied the correctionwith the delight of a new-found toy, although some authors then and even todayseemed oblivious to the formula itself as well as to the underlying notionthat imperfect reliability of a measure affects its relationship with othermeasures.It is assumed that readers of this paper have some familiarity with thetheoretical justification for and application of the correction forattenua,.2.on; hence, we do not intend to provide a review of the issue here,beyond the brief outline given Above and to note below two special problems inthe application of the correction.Readers wishing more information about thecorrection are referred to such standard sources as Guilford (1954), Gullikson(1950), Lord and Novick (1968), and Stanley (1971).Applications of the correction for attenuation often run afoul of onebasic assumption underlying the rationale for the correction, vis. thatdetermination of r(XY) and the two reliability coefficients, r(XX') and r(YY')are subject to the same sources of error variance.Specifically, it oftenhappens that the reliability coefficients are affected by fewer sources ofunreliability than is r(XY).For example, the X and Y measures, which in theparticular application of interest in this paper ordinarily represent afree-response measure and a choice-type measure, not only differ in format butalso are usually obtained at different times (say two weeks apart) andsometimes under different motivational circumstances (one of the measuresbeing a real final exam, the other being an experimental measure).Hence, avariety of sources of variation may be affecting r(XY), in addition to thedifference in test formats.In contrast, for purposes of applying thecorrection for attenuation in the latter situation, the study's author maycalculate the odd-even reliability of the choice type measure and the scorer

10reliability of the free-response measure, the resulting reliabilitycoefficients being (probably) substantially higher than appropriate for use inthe attenuation formula and the resulting corrected r(XY) being lower than itought to be.To comment extensively on the appropriate application of the correctionfor attenuation for each study taken up in later sections would be unwieldy,but the reader should at least be forewarned of the problem; and we willcomment on some apparently egregious misapplications of the formula.A second problem to which we should be alert is caused by attempts toequate the X and Y measures in terms of content, thus avoiding specificIn a surprising number of studies, thiscontent as one source of variation.problem is "solved" by using precisely the same questions (test items) infree-response and choice-type formats.Invariably, the free-response formatis presented first, followed by one or more choice-type formats, all using thesame item stems.While these item stems are physically the same when they areseen by examinees for a second, third, or fourth time, it is difficult tobelieve that *he stems are the same from time to time in terms of thepsychological and experiential make-up of the examinees.When one has beenasked four times in succession over a period of several weeks, "In what yeardid Columbus discover America?" is the only difference in content the factthat the question is followed by a fill-in blank the first time and threeOnly one study which used this type ofchoices of dates the, fourth time?design commented on the odd situation in which examinees are placed:and Fisher (1977, p. 360) note ".Traubthe difficulty that is encountered insustaining student motivation when tests are administered repeatedly" but theyneglect to speculate about how this obvious problem--which contributed toreducing their number of useable cases by half--might have affected theirconclusions.We might note that both of the problems just reviewed (undercorrecting for14

11attenuation and repeated use of identical content with an unknown effect uponexaminees) would tend to reduce the reported degree of correspondence betweenfree-response and choice-type measures.With respect to the first problem, itis sometimes possible, by reference to information from outside a particularstudy, to make an intelligent guess about the magnitude of theundercorrection.For example, multiple-choice tests (in the cognitive domain)with split-half reliabilities of .90 usually have alternate form reliabilitiesof about .85; so, if a split-half reliability was used in applying thecorrection for attenuation, and it seems more appropriate to use an alternateform reliability, it may be possible to recalculate the correction.Withrespect to the second problem, we do not know of any method to estimate itseffect; furthermore, its effect may vary substantially from one study toanother.Contrasted with these latter problems besetting the interpretationof a disattenuated correlation (r'), discussions of the theoretically mostdefensible method for testing whether r' differs significantly from 1.00 (e.g.Forsyth and Feldt, 1970; Lord, 1957; McNemar, 1958) seem to pale bycomparison.In the vast majority of instances, it seems r' can be interpretedonly in a rather rough-and-tumble fashion.THE EARLY YEARSApproximately two-thirds (some 40 studies) of all published research onthe equivalence of free-response and choice-type measures was conducted in the1920's and 1930's.By the end of this era, there were some well-formulatedgeneralizations about the equivalence of the two types of measures; thesegeneralizations were passed along in the textbooks on tests and measurements,but gradually references to the original studies began to cease so thateventually--even up to the present--textbook recommendations began to soundmore like doctrinaire nostrums rather than empirically based conclusions.The 1940','50's, and '60's saw relatively few published studies on the

12relationship between free-response and choice-type tests.The 1970's,especially the latter part of the decade, and on into the early '80'switnessed a renewed interest in the issue, with some new twists to both theformulation of the question and the methodology employed for investigatingit.In addition, recent years have seen the emergence of machine scoringtechnology, which has provided a somewhat novel flavor to the topic, whileleaving its substance unaffected.In the light of this rather peculiarhistorical pattern of research on our topic, it will be convenient to dividethe studies into those of the early years (the 20's and 30's) and the lateryears (1940'to the present).And it will be helpful to introduce the studiesfrom the early years with a brief historical sketch.Historical Background for Early Studies.Retracing the history ofinvestigations on the relationship between free-response and choice-typeinmeasures involves a nostalgic trip through a nether world of educationwhich professors "regarded" (scored, graded) their students' papers;instructors, in addition to asking the most vaguely worded essay questions,could fire off two hundred items the likes of "Name all the states whichborder on Kansas" and "What mythological beauty was the cause of the Trojanwar?" with nary a twinge of conscience about slighting higher mentalprocesses; and woe betide the student who didn't correctly answer 80% of suchquestions when his paper was regarded.What, in this long-gone world, motivated the emergence of what was thencalled the "new type" test, i.e. the choice-type test?It is supposed by manythat choice-type tests are the product of machine scoring demands.could be more absurd:NothingMachine scoring, even in its most primitive state, diduntilnot even exist until the mid-1930's and did not become widely available.the mid-1950's (Baker, 1971; DuBois, 1968).Others suppose that the new typetest was developed as a result of some passion to engage in mass, large-scaletesting.Equally wrong.It is, of course, true that development of the first

13objectively scored intelligence tests, i.e. the various Otis tests, wasmotivated by a need and/or desire for mass testing (DuBois, 1968; Robertson,1972).However, the early literature on educational testing is remarkablyfree from any reference to the need or desire for mass testing.Indeed, evenreferences to the work of Otis were rather rare in the literature oneducational uses of the new type achievement test although actual use of theOtis tests, it was apparent from these same sources, was widespread.The key concern in early investigations of the relationship betweenfree-response and choice-type tests was that of reliability.Odell (1928, pp.5-6) states the concern succinctly:Undoubtedly the chief cause contributing to raise the question[the best form of examinations] was the publication of the resultsfrom a number of investigations which showed, or appeared to show,great unreliabiity and variability of the marks given examinationpapers by teachers. Prominent in making the studies.referred to wereJohnson, Starch and Elliott, Kelley, and Dearborn. Their work, andalso that of others along this line, is too well known and hasproduced too similar results to justify detailed accounts of thevarious studies here.Odell goes on to illustrate some results from the Starch and Elliottreports, which appeared in 1912 and 1913 (without ever giving exactreferences).This practice of referring to the "well-known fact that ." or"the many studies showing that." traditional, free-response examinationsthewere unreliably scored without citing particular studies is rampant inliterature of the day.We may assume that the problem was widely discussed inprofessional meetings and the informal literature of that period.It is important to note in this connection that the earliest studies ofinterest in our review were not concerned primar

DOCUMENT RESUME. ED 224 811 TM 820 813. AUTHOR Hogan, Thomas P. TITLE Relationship between