THE DESIGN OF EXPERIMENTS - Jas Sekhon, UC Berkeley

Transcription

THE DESIGN OF EXPERIMENTSIChapters 1 and 2 from The Design ofExperiments by Ronald A. Fisher. 8thEdition, 1966.IINTRODUCTION1. The Grounds on which Evidence is Disputedany scientific conclusion is supposed to be provedon experimental evidence, critics who still refuse toaccept the conclusion are accustomed to take one oftwo lines of attack. They may claim that the interpretation of the experiment is faulty, that the resultsreported are not in fact those which should have beenexpected had the conclusion drawn been' justified, orthat they might equally well have arisen had the conclusion drawn been false. Such criticisms of interpretation are usually treated as falling within the domain ofstatistics. They are often made by professed statisticians. against the work of others whom they regard as ignorantof or incompetent in statistical technique; and, sincethe interpretation of any considerable body of data islikely to involve computations, it 'is natural enoughthat questions involving the logical implications of theresults of the arithmetical processes employed, should.be relegated to the statistician. At least I make nocomplaint of this convention. The statistician cannotevade the responsibility for understanding the processeshe applies or recommends. My immediate point isthat the questions involved can be dissociated from allthat is strictly technical in the statistician's craft, and,when so detached, are questions only of the right use ofWHENA

INTRODUCTIONtND.UCTIONhuman reasoning powers, with which all intelligentpeople, who hope to be intelligible, are equally concerned, and on which the statistician, as such, speakswith no special authority. The statistician cannotexcuse himself from the duty of getting his head clearon the principles of scientific inference, but equally noother thinking man can avoid a like obligation.The other type of criticism to which experimentalresults are exposed is. that the experiment itself wasill designed, or, of course, badly executed. If wesuppose that the experimenter did what he intendedto do, both of these points come down to the questionof the design, or the logical structure of the experiment.This type of criticism is usually made by what I mightcall a heav)TIveight authority.· Prolonged experience,or at least the long possession of a scientific reputation,is almost a pre-requisite for developing successfully thisline of attack. Technical details are seldom in evidence.The authoritative assertion "His controls are totallyinadequate" must have temporarily discredited manya promising line of work; and such an authoritarianmethod of judgment must surely continue, humannature being what it is, so long as theoretic 1 notionsof the principles of experimental design are lacking-·notions just as clear and explicit as we are accustomed.to apply to technical details.Now the essential point is that the two sorts ofcriticism I have mentioned are aimed only at differentaspects of the same whole, although they are usuallydelivered by different sorts of people and in very differentlanguage. If the design of an experiment is faulty,any method of interpretation \i;hich makes it out to bedecisive must be faulty too. It is true that there are agreat many experimental procedures which are welldesigned in that they may lead to decisive conclusions,but on other occasions may fail to do so; in such cllSes,if decisive conclusions are in fact drawn when theyare unjustified, we may say that the fault is wholly inthe interpretation, not in the design. But the fault ofinterprets:tion, even in these cases, lies in overlookingthe characteristic features of the design which lead tothe result being sometimes inconclusive, or conclusiveon some questions but. not on all. To understandcorrectly the one aspect of the problem is to understandthe other. Statistical procedure and experimentaldesign are only two different aspects of the same whole,and that whole comprises all the logical requirementsof the complete process of adding to natural knowledgeby experimentation.32. The Mathematical Attitude towards InductionIn the foregoing paragraphs the subject-matter ofthis book has been regarded from the point of view ofan experimenter, who wishes to carry out his workcompetently, and having done so wishes to safeguardhis results, so far as they are validly established, fromignorant criticism by different sorts of superior persons.I have assumed, as the experimenter always doesassume, that it is possible to draw valid inferences fromthe results of experimentation; that it is possible toargue from consequences to causes, from observationsto hypotheses; as a statistician would say, from asample to the population from which the sample wasdravm, or, as a logician might put it, from the particularto the general. It is, however, certain that manymathematicians, if pressed on the point, would say thatit is not possible rigorously to argue from the particularto the general; that all such arguments must involvesome sort of guesswork, which they might admit to beplausible guesswork, but the rationale of which, they

5INTRODUCTIONINDUCTIONwould be unwilling, as mathematicians, to discuss.We may at once admit that any inference from theparticular to the general must be attended with somedegree of uncertainty, but this is not the same as toadmit that such inference cannot be absolutely rigorous,for the nature and degree of the uncertainty may itselfbe capable of rigorous expression. In the theory ofprobability, as developed in its application to games ofchance, we have the classic example proving this possibility. If the gamblers' apparatus are really true orunbiased, the probabilities of the different possibleevents, or combinations of events, can be inferred by arigorous deductive argument, although the outcome ofany particular game is recognised to be uncertain. Themcre fact that inductive inferences are uncertain cannot,therefore, be accepted as precluding perfectly rigorousand unequivocal inference.Naturally, writers on probability have made deter.mined efforts to include the problem of inductiveinference within the ambit of the theory of mathematicalprobability, developed in discussing deductive problemsarising in games of chance. To illustrate how muchwas at one time thought to have been achieved in thisway, I may quote a very lucid statement by Augustusde Morgan, published in 1838, in the preface to hisessay on probabilities in The Cabinet CycloptEdia. Atthis period confidence in the theory of inverse probability, as it was called, had reached, under the influenceofkaplace, its highest point. Boole's criticisms had notyet been made, nor the more decided rejection of thetheory by Venn, Chrystal, and later writers. De Morganis speaking of the advances in the theory which wereleading to its wider application to practical problems."There was also another circumstance which· stoodin the way of the first investigators, namely, the nothaving considered, or, at least, not having discoveredthe method of reasoning from the happening of anevent to the probability of one or another cause. Thequestions treated) in the third chapter of this workcould not therefore be attempted by them. Given anhypothesis presenting the necessity of one or anotherout of a certain, and not very large, number of consequences, they could de,termine the chance that anygiven one or other of those consequences should arrive;. but given an event as having happened, and whichmight have been the consequence of either of severaldifferent causes, or explicable by either of severaldifferent hypotheses, they could not infer the probabilitywith which the happening of the event should cause thedifferent hypotheses to be viewed. But, just as innatural philosophy the selection of an hypothesis bymeans of observed facts is always preliminary to anyattempt at deductive discovery; so in the applicationof the notion of probability to the actual affairs of life,the process of reasoning from observed events to theirmost probable antecedents mustgo before the directuse of any such antecedent, cause, hypothesis, or whatever it may be correctly termed. These two obstacles,therefore, the mathematical difficulty, and the want ofan inverse method, prevented the science from extendingits views beyond problems of that simple nature whichgames of chance present."Referring to the inverse method, he later adds:"This was first used by the Rev. T. Bayes, and theauthor, though now almost forgotten, deserves the mosthonourable remembrance from all who treat the historyof this science."4,l,!

, 6INTRODUCTION3. The Rejection of InverseP obabllityWhatever may have been true in r838, it is certainlynot true to-day that Thomas Bayes is almost forgotten.That he seems to have been the first man in Europeto have seen the importance of developing an exact al1dquantitative theory of inductive reasoning; of arguingfrom observational facts to the theories which mightexplain them, is surely a sufficient claim to a place inthe history of science. But he deserves honourableremembrance for one fact, also, in addition to thosementioned. by de Morgan. Having perceived theproblem and devised an axiom which, if its truth weregranted, would bring inverse inferences within the scopeof the theory of mathematical 'probability, he wassufficiently critical of its validity to try to avoid the'axiomatic approach, and, perhaps for the same reason,to withhold his entire treatise from publication until hisdoubts should have been satisfied. In the event, thework was published after his death by his friend, Price,and we cannot say what views he ultimately held on thesubject.The discrepancy of opinion among historical writerson probability is'so great that to mention the subject isunavoidable. It would, however, be out of place hereto argue the point in detail. I will only state threeconsiderations which will explain why, in the practicalapplications of the subject, I shall not assume the truthof Bayes' axiom. Two of these reasons would, I think,be generally admitted, but the first, I can well imagine,might be indignantly repudiated in some quarters.The first is this: The axiom leads to apparent mathematical contradictions. In explaining these contradictions away, advocates of inverse probability seemforced to regard mathematical probability, not as anobjective quantity measured by observablefrequencies, butLOGIC OF THE LABORATORY7as measuring merely psychological tendencies, theoremsrespecting which are useless for scientific purposes.My second reason is that it is the nature of an axiomthat its truth should be apparent to any rational mindwhich fully apprehends its meaning. The axiom ofBayes has certainly been fully apprehended by a goodmany rational minds, including that of its author,without carrying this conviction of necessary truth.This, alone, shows that it cannot be accepted as theaxiomatic basis of a rigorous argument.My third reason is that inverse probability has beenonly very rarely used in the justification of conclusionsfrom experimental facts, although the theory has beenwidely taught, and is widespread in the literature ofprobability. Whatever the reasons are which giveexperimenters confidence that they can draw valid conclusions from their results, they seem to act just aspowerfulJy whether the experimenter has heard of thetheory of inverse probability or not.4. The Logic of the LaboratoryIn fact, in the course of this book, I propose toconsider a number of different types of experimentation,with especial reference to their logical structure, and toshow that when the appropriate precautions are takento make this structure complete, entirely valid inferencesmay be drawn from them, without using the disputedaxiom. If this can be done, we shaJl, in the course ofstudies having directly practical aims, have overcomethe theoretical difficulty of inductive inferences.Inductive inference is the only process known tous by which essentialJy new knowledge comes into theworld. To make clear the authentic conditions of itsvalidity is the kind of contribution to the intellectualdevelopment of mankind which we should expect

8INTRODUCTIONLOGIC OF THE LABORATORYexperimental science would ultimately supply. Menhave always been capable of some mental processes ofthe kind we call "learning by experience." Doubtlessthis experience was often a very imperfect basis, andthe reasoning processes used ih interpreting it were veryinsecure; but there must have been in these processesa sort of embryology of knowledge, by which newknowledge was gradually produced. Experimentalobservations are only experience carefully planned inadvance, and designed to form a secure basis of newknowledge; that is, they are systematically related tothe body of knowledge already acquired, and the resultsare deliberately observed, and put on record accurately.As the art of experimentation advances the principlesshould become clear by virtue of which this planningand designing achieve their purpose.It is as well to remember in this connection that theprinciples and method of even deductive reasoning wereprobably unknown for several thousand years after theestablishment of prosperous and cultured civilisations.We take a knowledge of these principles for granted,only because geometry is universally taught in schools.The method and material taught is essentially that ofEuclid's text-book of the third century B.C., and noone can make any progress in that subject withoutthoroughly familiarising his mind with the requirementsof a precise deductive argument. Assuming the axioms,the body of their logical consequences is built upsystematically and without ambiguity. Yet it is certainlysomething of an accident historically that this particulardiscipline should have become fashionable in the GreekUniversities, and later embodied in the curricula ofsecondary education. It would be difficult to overstatehow much the liberty of human thought has owed tothis fortunate circumstance. Since Euclid's time therehave been very long periods during which the right ofunfettered individual, judgment has been successfullydenied in legal, moral, and historical questions, but inwhich it has, none the less, survived, so far as purelydeductive reasoning is concerned, within the shelter ofapparently harmless mathematical studies.The liberation of the human intellect must, however,remain incomplete so long as it is free only to work outthe consequences of a prescribed body of dogmaticdata, and is denied the access to unsuspected truths,which only direct observation can give. The development of experimental science has therefore done muchmore than to multiply the technical competence ofmankind; and if, in these introductory lines, I haveseemed to wander far from the immediate purpose ofthis book, it is only because the two topics with whichwe shall be concerned, the arts of experimental designand of the valid interpretation of experimental results,in so far as they can be technically perfected, mustconstitute the core of this claim to the exercise of fullintellectual liberty.The chapters which follow are designed to illustratethe principles which are common to all experimentation,by means of examples chosen for the simplicity withwhich these principles are brought out. Next, to exhibitthe principal designs which have been found successfulin 'that field of experimentation, namely agriculture, inwhic\.! questions of design have been most thoroughlystudied, and to' illustrate their applicability to otherfields of work. Many of the most useful designs areextremely simple, and these deserve the greatest attention, as showing in what ways, and on what occasions,greater elaboration may be advantageous. The carefulreader should be able to satisfy himself not only, indetail, why some experiments have a complex structure,rIi,.i:9

'0INTRODUCTIONbut also how a complex observational record may behandled with intelligibility and precision.The subject is a new one, and in many ways themost that the author can hope· is to suggest possiblelines of attack on the problems with which others areconfronted. Progress in recent years has been rapid,and the few sections devoted to the subject in the author'sStatistical Methods for Research Workers, first publishedin 1925, have, with each succeeding edition, come toappear more and more inadequate. On purely statisticalquestions the reader must be referred to that book;on logic, and the analysis of meaning, to StatisticalMethods and Scientific Inference. The present volumeis an attempt to do more thorough justice to the problemsof planning and foresight with which the experimenteris confronted.IITHE PRINCIPLES OF EXPERIMENTATION,ILLUSTRATED BY A PSYCHO-PHYSICAL. EXPERIMENT5. Statement of ExperimentREFERENCES AND OTHER READING(1763). An essay toward, solving a problem in the doctrineof chances. Philosophical Transactions of the Royal Society, liii.37 ·A. DE MORGAN (1838). An essay on probabilities and on theirapplication to life contingencies and insurance offices. Preface,vi. Longman & Co.R. A. FISHER ('930). Inve,se probability. Proceeding' of theCambridge Philosophical Society, xxvi. 528-535.R. A. FISHER ('932), Inverse probability and the use of likelihood.Proceedings of the Cambridge Philosophical Society, xxviii.c 257- 26 1.'R. A, FISHER ('935). The logic of inductive inference. JournalRoyal Statistical Society, xcviii. 39-54.R, A. FISHEl' (r936). Uncertain inference. Proceedings of theAmerican Academy of Arts and Science" 7L 245-258.R. A. FISHER ('925-r963). Statistical methods for research workers,Oliver and Boyd Ltd., Edinburgh.R. A. FISHER (1956, '959) Statistical methods and scientific inference.Oliver and Boyd Ltd., Edinburgh.T.BAVESiIj,jA LADY declares that by tasting a cup of tea made withmilk she can discriminate whether the milk or the teainfusion was first added to the cup. We will considerthe problem of designing an experiment by means ofwhich this assertion can be tested. For this purposelet us first lay down a simple form of experiment with aview to studying its limitations and its characteristics,both those which appear to be essential to the experimental method, when well developed, and those whichare not essential but auxiliary.Our experiment consists in mixing eight cups oftea, four in one way and four in the other, and presentingthem to the subject for judgment in a random order.The subject has been told in advance of what the testwill consist, namely that she will be asked to taste eightcups, that these shall be four of each kind, and thatthey shall be presented to her in a random order, thatis in an order not determined arbitrarily by humanchoice, but by the actual manipulation of the physicalapparatus used in games'of chance, cards, dice, roulettes,etc" or, more expeditiously, from a published collection.of random sampling numbers purporting to give theactual results of such manipulation. Her task is todivide the 8 cups into two sets of 4, agreeing, if possible,with the treatments received.IItIIII

12THE PRINCtrLES OF EXPERIMENTATION6. Interpretation and its Reasoned BasisIn considering the appropriateness of any proposedexperimental design, it is always needful to forecast allpossible results of the experiment, and to have decidedwithout ambiguity what interpretation shall be placedupon each one of them. Further, we must know bywhat argument this interpretation is to be sustained.In the present instance we may argue as follows. Thereare 70 ways of choosing a group of 4 objects out of 8.This may be demonstrated by an argument familiar tostudents of " permutations and combinations," namely,that if we were to choose the 4 objects in succession we .should have successively 8, 7, 6, 5 objects to choosefrom, and could make our succession of choices in8 X 7 X 6 X 5, or 1680 ways. But in doing this we havenot only chosen every possible set of 4, but every possibleset in every possible order; and since 4 objects can bearranged in order in 4 x 3 x 2 X 1, or 24 ways, we mayfind the number of possible choices by dividing 1680by 24. The result, 70, is essential to our interpretationof the experiment. At best the subject can judge rightlywith every cup and, knowing that 4 are of each kind,this amounts to choosing, out of the 70 sets of 4 whichmight be chosen, that particular one which is correct.A subject without any faculty of discrimination w01.\ldin fact divide the 8 cups correctly into two sets of 4 inone trial out of 70, or, more properly, with a frequencywhich would approach 1 in 70 more and more nearlythe more often the test Were repeated. Evidently thisfrequency, with which unfailing success would beachieved by a person lacking altogether the facultyunder test, is calculable from the number of cups used.The odds could be made much higher by enlarging theexperiment, while. if the experiment were much smallerSIGNIFICANCE13even the greatest possible success would give odds solow chat the result might, with considerable probability,be ascribed to chance.7. The Test of SignificanceIt is open to the· experimenter to be more or lessexacting in respect of the smallness of the probabilityhe would require before he would be willing to admitthat his observations have demonstrated a positiveresult. It is obvious that an experiment would be uselessof which no possible result would satisfy him. Thus,if he wishes to ignore results having probabilities ashigh as 1 in 2o-the probabilities being of coursereckoned from the hypothesis that the phenomenonto be demonstrated is in fact absent-then it would beuseless for him to experiment with only 3 cups of teaof each kind. For 3 objects can be chosen out of 6 inonly 20 ways, and therefore complete success in thetest would be achieved without sensory discrimination,i.e. by " pure chance," in an average of 5 trials out ofroo. It is usual and convenient for experimenters totake 5 per cent. as a standard level of significance, inthe sense that they are prepared to ignore all resultswhich fail to reach this standard, and, by this means,to eliminate from further discussion the greater part ofthe fluctuations which chance causes have introducedinto their experimental results. No such selection caneliminate the whole of the possible effects· of chancecoincidence, and if we accept this convenient convention,and agree that an event which would occur by chanceonly once in 70 trials is decidedly " significant," in thestatistical sense, we thereby admit that no isolatedexperiment, however significant in itself, can suffice forthe experimental demonstration of any natural phenomenon; for the "one chance in a million"· will

THE PRINCIPLES OF EXPERIMENTATIONNULL HYPOTHESISundoubtedly occur, with no less and no more than itsappropriate frequency, however surprised we may bethat it should occur to us. In order to assert that anatural phenomenon is experimentally demonstrablewe need, not an isolated record, but a reliable methodof procedure. In relation to the test of significance,we may say that a phenomenon is experimentallydemonstrable when we know how to conduct an experiment which will rarely fail to give us a statisticallysignificant result.Returning to the possible results of the psychophysical experiment, having decided that if every cupwere rightly classified a significant positive result wotddbe recorded, or, in other words, that we should admitthat the lady had made good her claim, what shouldbe our conclusion if, for each kind of cup, her judgmentsare 3 right and I wrong? We may take it, in thepresent discussion, that any error in one set of judgmentswill be compensated by an error in the other, since itis known to the subject that there are 4 cups of eachkind. In enumerating the number of ways of choosing4 things out of 8, such that 3 are right and I wrong,we may note that the 3 right may be chosen, out of the4 available, in 4 ways and, independently of this choice,that the 1 wrong may be chosen, out of the 4 available,also in 4 ways. So that in all we could make a selectionof the kind supposed in 16 different ways. A similarargument shows that, in each kind of judgment, 2 maybe right and 2 wrong in 36 ways, 1 right and 3 wrongin 16 ways and none right and 4 wrong in 1 way only.It should be noted that the frequencies of these fivepossible results of the experiment make up together,as it is obvious they should, the 70 cases out of 70.It is obvious, too, that 3 successes to 1 failure,although showing a bias, or deviation, III the rightdirection, could not be judged as statistically significantevidence of a real sensory discrimination. For itsfreq uency of chance occurrence is 16 in 70, or morethan 20 per cent. Moreover, it is not the best possibleresult, and in judging of its significance we must takeaccount not only of its own frequency, but also of thefrequency of any better result. In the present instance" 3 right and I wrong" occurs 16 times, and" 4 right"occurs once in 70 trials, making 1 7 cases out of 70 asgood as or better than that observed. The reason forincluding cases better than that observed becomesobvious on considering what our conclusions wouldhave been had the case of 3 right and 1 wrong only1 chance, and the case of 4 right 16 chances of occurrenceout of 70. The rare case of 3 right and I wrong couldnot be judged significant merely because it was rare,seeing that a higher degree of success would frequentlyhave been scored by mere chance.14I!,Ii158. The Null HypothesisOur examination of the possible results of theexperiment has therefore led us to a statistical test ofsignificance, by which these results are divided intotwo classes with opposed interpretations. Tests ofsignificance are of many different kinds, which neednot be considered here. Here we are only concernedwith the fact that the easy calculation in permutationswhich we encountered. and which gave us our test ofsignificance, stands for something present in everypossible experimental arrangement; or, at least, forsomething required in its interpretation. The twoclasses of results which are distinguished by our test ofsignificance are, on the one hand, those which show asignificant discrepancy from a certain hypothesis;namely, in this case, the hypothesis that the judgments

16THE l'lUNCIPLES OF EXPERIMENTATIONgiven are in no way influenced by the order in whichthe ingredients have been added; and on the otherhand, results which show no significant discrepancyfrom this hypothesis. This hypothesis, which mayormay not be impugned by the result of an experiment,is again characteristic of all experimentation. Muchconfusion would often be avoided if it were explicitlyformulated when the experiment is designed. In relationto any experiment we may speak of this hypothesis asthe " null hypothesis," and it should be noted that thenull hypothesis is never proved or established, but ispossibly disproved, in the course of experimentation.Every experiment may be said to exist only in orderto give the facts a chance of disproving the null hypothesis.It might be argued that if an experiment can disprove the hypothesis that the subject possesses no sensorydiscrimination between two different sorts of object, itmust therefore be able to prove the opposite hypothesis,that she can make some such discrimination. But thislast hypothesis, however reasonable or true it may be,is ineligible as a null hypothesis to be tested by experiment, because it is inexact. If it were asserted that thesubject would never be "'Tong in her judgments weshould again have an exact hypothesis, and it is easyto see that this hypothesis could be disproved by asingle failure, but could never be proved by any finiteamount of experimentation. It is evident that the nullhypothesis must he exact, that is free from vaguenessand ambiguity, because it must supply the basis of the" problem of distribution," of which the test of significance is the solution. A null hypothesis may, indeed,contain arbitrary elements, and in more complicatedcases often does so: as, for example, if it should assertthat the death-rates of two groups of animals are equal.RANDOMISATION17w.ithout specifying what these death-rates actually are.In such cases it is evidently the equality rather thanany particular values of the death-rates that the experiment is designed to test, and possibly to disprove.In cases involving statistical "estimation" theseideas may he extended to the simultaneous considerationof a series of hypothetical possibilities. The notion ofan error of the so-called" second kind," due to acceptingthe null hypothesis H when it is false" may then begiven a meaning in reference to the quantity to beestimated. It has no meaning with respect to simpletests of significance, in which the only available expectations are those which flow from the null hypothesisbeing true. Problems of the more elaborate typeinvolving estimation are discussed in Chapter IX.9. Randomisation; the Physical Basis of the Validityof the TestvVe have spoken of the experiment as testing acertain null hypothesis, namely, in this case, that thesubject possesses no sensory discrimination whatever ofthe kind claimed; we have, too, assigned as appropriateto this hypothesis a certain frequency distribution ofoccurrences, based on the equal frequency of the 70possible ways of assigning 8 objects to two classes of4 each; in other words, the frequency distributionappropriate to a classification by pure chance. Wehave now to examine the physical conditions of theexperimental technique needed to justify the assumpti?nthat, if discrimination of the kind under test is absent,the result of the experiment will he wholly governedby the laws of chance. It is easy to see that it mightwell be otherwise. If all those cups made with themilk first had sugar added, while those made with thetea first had I)one. a very obvious difference in flavourB

,8THE PRINCIPLES OF EXPERIMENTATIONRANDOMISATIONwould have been introduced which

THE DESIGN OF EXPERIMENTS I INTRODUCTION 1. The Grounds on which Evidence is Disputed WHEN any scientific conclusion is supposed to be proved on experimental evidence, critics who still refuse to accept the conclusion are accustomed to take one of two lines of attack. They may claim that t