Comparison Of Free-Response Multiple-Choice Of Verbal .

Transcription

AComparison ofFree-Response and Multiple-Choice Formsof Verbal Aptitude TestsWilliam C. WardEducational Testing ServiceThree verbal item types employed in standardizedtests were administered in four formats—aconventional multiple-choice format and three formats requiring the examinee to produce rather thansimply to recognize correct answers. For two itemtypes—Sentence Completion and Antonyms—theresponse format made no difference in the patternof correlations among the tests. Only for a multi-aptitudeple-answer open-ended Analogiestestwereany sys-tematic differences found; even the interpretation ofthese is uncertain, since they may result from thespeededness of the test rather than from its response requirements. In contrast to several kinds ofproblem-solving tasks that have been studied, discrete verbal item types appear to measure essentially the same abilities regardless of the format inwhich the test is administered.Carlson, 1980). Comparable differences were obtained between free-response and machine-scorable tests employing nontechnical problems,which were designed to simulate tasks requiredin making medical diagnoses (Frederiksen,Ward, Case, Carls®n9 Samph, 1981).There is also suggestive evidence that the useof free-response items could make a contribution in standardized admissions testing. Theopen-endedbehavioral scienceproblems werepotential predictors of theand accomplishments ofactivitiesprofessionalin psychology; thestudentsfirst-year graduateGraduate Record Examination Aptitude andAdvanced Psychology tests are not good predicfound to havesometors of such achievementsTests in which an examinee must generate answers may require different abilities than dotests in which it is necessary only to chooseamong alternatives that are provided. A free-response test of behavioral science problem solving, for example, was found to have a very lowcorrelation with a test employing similar problems presented in a machine-scorable (modified multiple-choice) format; it differed fromthe latter in its relations to a set of referencetests for cognitive factors (Ward, Frederiksen, &as(Frederiksen & Ward,1978).Problem-solving tasks like these, however,provide very inefficient measurement. They require a large investment of examinee time toproduce scores with acceptable reliability, andthey yield complex responses, the evaluation ofwhich is demanding and time consuming. It wasthe purpose of the present investigation to explore the effects of an open-ended format withitem types like those used in conventional examinations. The content area chosen was verbalknowledge and verbal reasoning, as representedby item types-Antonyms, Sentence Completion, and Analogies.The selection of these item es has severalbases. First, their relevance for aptitude assess1Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.May be reproduced with no cost by students and faculty for academic use. Non-academic reproductionrequires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

2special justification, given thatone-halfof verbal abilitythey uptests such as the Graduate Record Examinationand the Scholastic Test (SAT).Thus, if it can be shown that recasting theseitem types into an open-eiided format makes asubstantial difference in the abilities they measure, a st ®n s will be made for the irriportance of the response format in themix of items that enter into tests.Second, such produce reliable withrelatively short tests. Finally, open-ended formsof these item types require only single-word or,in the case of two-word answers.should thus be easy to score, incomparison with free-response problems whoseresponses may be several sentences in andmay embody two or three ideas. Although not solving the difficulties inherent inthe use of open-ended in large-scale testing,therefore, they would to some to reduce their magnitude.Surprisingly, no published comparisons ofand multiple-choice of theseitem types are available. Several investigatorshave, however, examined the effects of responseformat on Synonyms items-items in which theexaminee must choose or e t a word withessentially the same meaning as a word( e Watts 1967; Traub Fisher, ‘ 9Vernon, 1962). All found high correlationsacross formats, but only Traub and Fisher atto answer the of whether theabilities measured in the two formats were identical or only related. They concluded that thetest and does affect the attributetheattribute measured abyfactortest and that there was evidence of a factorspecific to open-ended verbal items. Unfortunately, they did not have scores on a sufficientvariety of to provide an unambiguous testfor the existence of a verbal factor.The present study was to allow a factor-analytic of the influence of response format. Each of three stem wasin each of four formats, varied in the degree to which they require of anment needsnoswers. It was thus possible to examine the fit ofthe data to each of two &dquo;ideaf9 of factorstructure: one in which only item-typewould be found, tat of a giventype essentially the same thing regard less of the format; and one involvingonly format factors, indicating that the responserequirementsof the taskareofimpor-tance than are differences in the kind of k ®v tested.Methodof the TestsThree item were employed. Antonyms t w given in the standard multiplechoice format) required the to selectthe one of five words that was most nearly opposite in to a given word. Completions required the identification of the oneword rh 9 when into a blank space ina sentence, best fit the of the sentenceas a whole. Analogies, f 1y9 . d for theselection of the pair of words expressing arelationship to that expressed in a givenpair.oThree formats in addition to the multiplechoice one were used. For Antonyms, for example, the &dquo;single-answer&dquo; format required theexaminee to think of an and to writethat word in an answer space, The &dquo;multiple-ar,swer&dquo; format was still more the examinee was to think of and write up to three dif ferent for each word given. Finally,the ‘gk y st 9 format the examinee tothink of an opposite, to locate this word in a 90item alphabetized and to record its numberon the answer sheet. This latter format was included as a machine-scorable for atruly f p n test.With two all itemwere ones single-word Theexceptions were the single-answer multiple-Analogiestests. Here the examineeDownloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.May be reproduced with no cost by students and faculty for academic use. Non-academic reproductionrequires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/was

3to produce pairs of words having thesame relationship to one another as that shownby the two words in the stem of the question.Instructions for each test paraphrased closelyemployed in the GRE Aptitude Test, except as dictated by the specific require-The testsrequiredwereorder, subjectpresented in arandomizedto the restriction no twosuc-ments of each format, With each set of instructions was given one question and a briefrationale for the answer or answers suggested.for the tests, two or threefully acceptable answers were for eachcessive tests should either the same itemtype or the response format. Four systematic variations of this order employed topermit an of and adjustment for possible practice or effect. Each of the fourgroups tested, including Sl to 60 subjects,received tests in one of these sequences; the remainder of the sample, in groups of 30 to40, all given in the first of the foursample question.orders.The tests varied somewhat in number of itemsand in time limits. Each testconsisted of 20 items to be in 12 minutes. Slightly times (15 were allowed for forms including 20 or 20keylist The multiple-answer allowed still more time per item-15 minutes for15 Antonyms or Analogies or for 18 SenCompletion items. On the basis of extensive it was that these timelimits would be to avoid problems oaftest and that the number of itemswould be sufficient to scores with reliabilities on the order of .7.Test t Subjects 315 paid volunteers ® .state university. more than te ®thirds were juniors and seniors.The small number (13’7o) for whom GRE Aptitude Test scores were obtained were a somewhatselect group9 with means of 547, and 616 onthe Verbal, and Analyticrespectively t appears that the sample is asomewhat more able one than college studentsin general but probably less select the graduate school applicant pool.Each student participated in one 4-hour testsession. Included in the session were 12 testsall combinations of the three itemwith four response and a briefquestionnaire to the student’s academicbackground, accomplishments, and interests.o ® For each of the open-ended tests, scoring keysdeveloped that distinguished two ofappropriateness of an answer. Answers in oneset were judged fully acceptable, while those inthe second were of marginal appropriateness.An example of the latter would be an Antonymsresponse that identified the evaluationby a word but failed to an imnuance or the force of the evaluation, Itwas througha trial that partial credits were unnecessary for two of the keylist tests-Antonyms and Analogies. Responsesto the remaining tests were coded topermit computer of several differentscores, on the credit to be givenmarginallyPreliminary scoring were checked forby an examination of about 20%toof theanswersheets, Most of the tests were then d by highly clerk andby her Two tests, however,presented more complex problems. Forboth single-answer multiple-answer Analogies, the scoring keys consisted of rationalesand rather than a list of possible answers. Many decisions thereforeinvolved a substantial exercise of d to research assistant scored each of these tests, andthe author scored 25 answer sheets of each indeTotal scores derived from the twoDownloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.May be reproduced with no cost by students and faculty for academic use. Non-academic reproductionrequires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

4scorings correlated .95 forthe other.onetest and .97 forResetsResults c la of data. No instances were foundsubjects appeared not to take their taskseriously. Three answer sheets were missing orspoiled; sample mean scores were forthese. On 32 occasions a subject failed to atin whichtempt at least half the items on a test; but no individualsubject was responsible for more thanappeared that data from all sub-two of these. Itjects were of acceptable quality.Score derivation, The three multiple-choicewere scored using a standard correction forguessing: for a five-choice item, the score wasnumber correct minus one-fourth the numberincorrect. Two of the keylist tests were simplyscored for number correct. It would have beenpossible to treat those tests as 90-alternative,a itnpi h®ice tests and to apply the guessingcorrection, but the effect on the scores wouldhave been of negligible magnitude.For the remaining tests, scores were generatedin several ways. In one, scoring credit was givenonly for answers deemed fully acceptable; in asecond, the same credit was given to both fullyand acceptable answers; and in athird, marginal answers received half the creditgiven to fully acceptable ones. This third approach was found to yield slightly more reliablescores than either of the others and was therefore employed for all further analyses.Test order. Possible differences amonggroups receiving the tests different orderswere examined in two ways. One analysis wasconcerned with the level of performance; another considered the standard error of measurement, a statistic that informationabout both the standard deviation and the reliability of a test score and that indicates the precision of measurement. In neither case were theresystematic differences associated with the orderin which the tests were administered,. Order wastherefore in all further analyses.testsTestTest means and standard deshown in Table 1. Most of the testswere of middle difficulty for this s , pl 9 two ofthe keylist tests were easy, whereas multiplechoice Antonyms was very difficult. Means forthe multiple-answer tests were low in relation tothe maximum possible score but represent oneto one-and-a-half fully acceptable answers peritem.Test speededness. Tests such as the GREAptitude Test are considered unspeeded if atleast 75% of the examinees attempt all items andif virtually everyone attempts at least threefourths of the items. By these criteria only one ofthe tests, multiple-answer Analogies, had anyproblems with speededness: About 75% of thesample reached the last item, but I4Vo failed toattempt the 12th item, which represents thethree-fourths point. For all the remaining tests,95% or more of the subjects reached at least allbut the final two items. Table I shows the percent of the sample completing each test.Test a l . Reliabilities (coefficient alpha) are also shown in Table 1. ey rangedfrom .45 to .80, with a median of .69. There whereno differences in reliabilities associated with theresponse format of the test-the medians rangedfrom .68 for multiple-choice tests to .75 for multiple-answer forms. There were differences associated with item type; medians were .75 forAntonyms, .71 for Sentence Completions, and.58 for Analogies. The least reliable of all thetests was the multiple-choice Analogies. The differences apparently represent somewhat lesssuccess in creating good analogies items ratherthan any differences inherent in the open-endeddifficulty.viationsareformats. ® the C® ® a ® tests.Zero-ordercor-relations the 12 are shown in the upper part of Table 2. The correlations from.29 to .69, with a of .53. The seven lowestcoefficients in the table, the only ones below.40, are correlations involving the multiple-answer test. Correlations forDownloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.May be reproduced with no cost by students and faculty for academic use. Non-academic reproductionrequires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

5DescriptiveZero-OrderDecimalabove the mainattenuationTable 1Statistics for TestsTable 2and Attenuated Correlations1estsZero-order correlations arewhile correlations corrected forpoints omitted.diagonal,are presentedAmongpresentedbelow.Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.May be reproduced with no cost by students and faculty for academic use. Non-academic reproductionrequires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

6attenuation are shown in the lower part of thetable; the correction is based on coefficientalpha reliabilities. The correlations from.45 to .97 and have a median of .80.These coefficients indicate that the varioustests share a substantial part of their true variance, but they do not permit a conclusion as towhether there are systematic differences amongthe Three analyses that .ddb ss this questi xlare presented below.analyses. A preliminary principalcomponents analysis produced the set of eigenFactorvaluesdisplayed in Table 3. The first componentTable 3Principal Componentsofthe Correlations Matrix57%® of the totalwhilethenextvariance,largest accounted for7 ®ofthevariance.onlyone rule of thumbfor number of that of thenumber of eigenvalues greater than there isonly a single factor represented in these results.another, that of differences in ofsuccessive eigenvalues, there is some evidencefor a second factor but none at all for more thanwasverylarge, forwasoriginally totvv®aItuse aconfirma-tory factor analytic approach to the analysisin order to contrast two d .l®ized models of test relations-one involvingthree item-type factors and one four(Jöreskog, 1970)response-format factors. In view of the ofthe principal components analysis, however,either of these would clearly be a distortion ofthe data. It was decided, therefore, to use an exploratory factor analysis, which could be followed by confirmatory analyses comparingsimpler models if such a comparison seemedwarranted from the results. The analysis was aprincipal axes factor analysis with iterated communalities.A varimax (orthogonal) rotation of the twofactor solution produced unsatisfactory results-10 of the 12 scores had appreciable loadings on both factors. The results of the oblimin(oblique) rotation for two factors are presentedin Table 4. The two factors were correlated (r .67). Ten of the 12 scores had theirhighest loading on Factor I, one (single-answerAnalogies) divided about equally thetwo, and only (multiple-answer Analogies)had its loading on the second factor. ®r tw® it typ s9 r t n e CompletionAntonyms, these results leave no ambiguity as tothe effects of response format. The use of anopen-ended format no in theattribute measures the test. The interpretation for the Analogies is less clear. Thesecond factor is (just under 5% of the common factor variance), it is poorly defined,with only one test having its primary loading onthat factor. the one test that did loadheavily on Factor 11 also the only test in thebattery that was at all There is a reasonable of Factor II as a speedfactor (Donlon, 1980); the rank-order correlation between Factor III loadings the numberof subjects to attempt the last item of atest was .80 (p .01).Factor analyses also performed takinginto account the academic level of the student.The sample included two groups large enough tobe considered for separate analyses -seniors(l l m 75 d juniors (N 141). For each groupa one-factor solution was indicated. A combinedanalysis was also carried out after formean and variance dl r r es 1 the data forthe two groups. The eigenvalues suggested either Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.May be reproduced with no cost by students and faculty for academic use. Non-academic reproductionrequires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

7Table 4Factor Pattern for Two-Factortwo-factor solution; the two-factorsolution, however, all tests having theira one- or ahighest loading on the first factor only multiple-answer Analogies an division of its variance between the two factors.Thus, there was no strong evidence for the existence of a factor the data. There wereweak indications that the multiple-answerAnalogies and, to a much lesser extent, thesingle-answer Analogies provided somewhatdistinct measurement from the remainder of thetests in the evidence is clear that SenCompletion Antonyms item typesmeasure the same attribute of the format in which the item is administeredMultitrait-multimethod analysis. The datamay also be considered within the frameworkprovided by multitrait-multimethod analysis1959). of the three(Campbell &item types a &dquo;trait,&dquo; while each ofthe four response formats constitutes a 66 eth®old.&dquo; The data were following a schemesuggested by and Werts (1966). All thecorrelations relevant for each werecorrected for attenuation and then us-Analysis t ® transformation, Results aresummarized in Table 5.Each row in the upper of the table provides the average of all those correlations thatrelations for a item as measured in different formats and of all thosecorrelations that relations betweenthat item and other item when the twotests different response formats. Thus,for the Sentence item the entryin the first column is an average of all six correlations among Sentence Completion scoresfrom the four formats. The in the secondcolumn is an average of 24 correlations: for eachof four Sentence Completion scores, the six correlations representing relations to each item typeother than Sentence Completion in each of threeformats. The lower part of the table is organizedit for each response format a of average correlations withinformat with those between formats for all testFisheespairs different item types.Results in the upper of the tablethat therewas sometraitbothforshowvariance associated withSentenceCompletionandDownloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.May be reproduced with no cost by students and faculty for academic use. Non-academic reproductionrequires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

8Table 5Multitrait-Multimethod Summary ofareAverage Correlations*By Mann-Whitney U Test, the two entries insignificantly different at the 5% level ofarowconfidence.®Antonyms item types by n® btneyl test,p .05). Analogies tests did not, however, relateto one another any more strongly they related to tests of other item types.The lower part of the table shows differencessattributable to response format. There is an apparent tendency toward a difference in favor ofstronger relations among multiple-choice teststhan those tests have with tests in other formats,but this tendency did not approach significance For the truly open-ended response forthere were no differences whatsoever. Likethe factor analyses, this approach to correlational comparisons showed no tendency for openended tests to cluster according to the responseformat; to the slight degree that any differenceswere found, they represented clustering on thebasis of the item type rather than the responseformat employed in a test.Correlations corrected for &dquo;alternate formsreliabilities. f he ultltr it-multimeth®d correlational comparison made use of internal consistency reliability coefficients to correct correlations for their unreliability. Several interestingcomparisons can also be made using a surrogatefor alternate forms reliability coefficients. Thebattery, of course, contained only one instanceof each item-type by response-format combina&dquo;so that no true alternate form examinationscould be made. It may be reasonable, however,to consider the two truly open-ended forms of atest-multiple-answer and sin le r sw r® .stwo forms of the same test given under &dquo;open&dquo;conditions, and the two remaining f®r s- 1 tiple-choice and keylist-as two forms of thesame test given under &dquo;closed&dquo; conditions. Onthis assumption, relations across open andclosed formats for a given item type can be estimated by the average of the four relevant correlations and corrected for reliabilities represented by the correlations within open and within closed formats.The corrected correlations were .97 for Sentence Completion, .88 for Analogies, and 1.05for Antonyms. It appears that relations acrossthe two kinds of formats did not differ from 1.0,except for error in the data, for two item types.Analogies tests may fail to share some of theirreliable variance across open and closed formatsbut still appear to share most of it.tion,with V mdablesStudents completed a questionnaire dealingwith their academic background, accomplishments, and interests. Included were questionsDownloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.May be reproduced with no cost by students and faculty for academic use. Non-academic reproductionrequires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

9concerning (1) plans for graduate school attendance and advanced degrees, (2) undergraduategrade-point average overall and in the majorfield of study, (3) preferred career activities, (4)self-assessed skills and competencies within themajor field, and (5) independent activities andaccomplishments within the current academicyear. Correlations were obtained between questionnaire variables and scores on the 12 verbaltests.Most of the correlations were very low. Onlyfour of the questions produced a correlationwith any test as high as .20; these were level ofplanned, self-reported grade-point average (both overall and for the major field ofstudy), and the choice of writing as the individual’s single most preferred professional activity.No systematic differences in correlations associated with item type or response format were evident.Information was also available on the student’s and year in school. No significantcorrelations with gender were obtained. Advanced students tended to obtain higher testscores, with no evidence of differencesamong the tests in the magnitude of the relations.GRE Aptitude Test were available for asmall number of students N ® 41). Correlationswith the GRE Verbal score were substantial inmagnitude, ranging from .50 to .74 with a median of .59. Correlations with the GRE Quantitative and Analytical scores were lower but still appreciable, having medians of .36 and A7, respecHere also there were no systematic differences associated with item types or test formats.These results, like the analyses of correlationsamong the experimental tests, suggest that response format has little effect on the nature ofthe attributes measures the item types underexamination.DiscussionThis study has shown that it is possible to develop open-ended forms of several verbal aptitude item types that are approximately as good,in terms of score reliability, as multiple-choiceitems and that require only slightly greater timelimits than do the conventional items. Theseopen-ended items, however, provide littlenewinformation. There was no evidence whatsoeverfor a general factor associated with the use of afree-response format. There was strong evidenceagainst any difference in the abilities measuredby Antonyms or Sentence Completion items as afunction of the response format of the task. OnlyAnalogies presented some ambiguity in interpretation, and there is some reason to suspect thatthat difference should be attributed to the slightspeededness of the multiple-answer Analogiestest employed.It is clear that an open-ended response formatwas not in itself sufficient to determine whatthese tests measured. Neither the requirement togenerate a single response, nor the more difficulttask ofproducinganswerstoities thatmance.anandwritingseveral differentitem, could alone change the abil-wereimportantfor successfulperfor-What, are the characteristics of anitem that will measure different attributes depending on the response format employed? Acomparison of the present tests with those empl®yed ln the earlier problem-solving research ofWard et al. (1980) and Frederiksen et al. (1981)suggests a number of possibilities. In the problem-solving work, subjects had to read and tocomprehend passages containing a number ofitems of information relevant to a problem. Theywere required to determine the relevance of suchinformation for themselves and often to applyreasoning and inference to draw conclusionsfrom several items of information. Moreover,they needed to draw on information not presented -specialized knowledge concerning thedesign and interpretation of research studies, forthe behavioral problems, and more general knowledge obtained from everyday life experiences, for the nontechnical problems. Finally, subjects composed responses that often entailed relating several complex ideas to one an-other.The verbal aptitude items, in contrast, aremuch more self-contained. The examinee hasDownloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227.May be reproduced with no cost by students and faculty for academic use. Non-academic reproductionrequires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

10deal with the meaning of one word, of aofpair words, or at most of the elements of ashort sentence. In a sense, the statement of theproblem includes a specification of what information is relevant for a solution and of whatkind of solution is appropriate. Thus, the verbaltests might be described as &dquo;well-structured&dquo;and the problem-solving tests as &dquo;ill-structured&dquo;problems (Simon, 1973). The verbal tests also, ofcourse, require less complex responses-a singleword or, at most, a pair of words.Determining which of these features are critical in distinguishing tests in which an openended format makes a difference will requirecomparing a number of different item types inmultiple-choice and free-response formats. Itwill be of particular interest to develop itemtypes that eliminate the confounding of complexity in the information search required by aproblem with complexity in the response that isto be produced.For those concerned with standardized aptitude testing, the present results indicate that oneimportant component of existing tests amountsto sampling from a broader range of possibletest questions than had previously been demonstrated. The discrete verbal item types presentlyemployed by the GRE and other testing programs appear to suffer no lack of generality because of exclusive use of a multiple-choice format ; for these item types at least, use of openended questions would not lead to measurementof a noticeably different ability cutting acrossthe three item types examined here. It remainsto be seen whether a similar statement can bemade about other kinds of questions employedin the standardized tests and whether there areways in which items that will tap &dquo;creative&dquo; or&dquo;divergent thinking&dquo; abilities can be presentedso as to be feasible for inclusion in large-scaleod matrix. Psychological Bulletin, 1959, 56,81-105.Donlon, T. F. An exploratory study of the implications of test speededness. (GRE Board Professional Report GREB No. 76-9P). Princeton NJ: Educational Testing

1 A Comparison of Free-Response and Multiple-Choice Forms of Verbal Aptitude Tests William C. Ward Educational Testing Service Three verbal item types employed in standardized aptitude tests were administered in four formats—a conventional multiple-choice format and three for- mats requiring the examinee to