Static And Dataflow Analysis - Electrical Engineering And Computer Science

Transcription

Static andDataflowAnalysis(two part lecture)

The Story So Far Quality assurance is critical to softwareengineering.Testing is the most common dynamic approachto QA. But: race conditions, information flow, profiling Code review and code inspection are the mostcommon static approaches to QA.What other static analyses are commonly usedand how do they work?2

One-Slide Summary Static analysis is the systematic examinationof an abstraction of program state space withrespect to a property. Static analyses reasonabout all possible executions but they areconservative.Dataflow analysis is a popular approach tostatic analysis. It tracks a few broad values(“secret information” vs. “publicinformation”) rather than exact information. Itcan be computed in terms of a local transferof information.3

Fundamental Concepts Abstraction Capture semantically-relevant details Elide other details Handle “I don't know”: think about developersPrograms As Data Programs are just trees, graphs or stringsAnd we know how to analyze and manipulate those(e.g., visit every node in a graph)4

goto fail;5

“Unimportant” SSL Examplestatic t *ctx, bool isRsa,SSLBuffer signedParams,uint8 t *signature,UInt16 signatureLen) {OSStatus err; if ((err SSLHashSHA1.update(&hashCtx, &serverRandom)) ! 0)goto fail;if ((err SSLHashSHA1.update(&hashCtx, &signedParams)) ! 0)goto fail;goto fail;if ((err SSLHashSHA1.final(&hashCtx, &hashOut)) ! 0)goto fail; ashCtx);return err;}6

Linux Driver Example/* from Linux 2.3.99 drivers/block/raid5.c */static struct buffer head *get free buffer(struct stripe head * sh,int b size) {struct buffer head *bh;unsigned long flags;save flags(flags);cli(); // disables interruptsif ((bh sh- buffer pool) NULL)return NULL;sh- buffer pool bh - b next;bh- b size b size;restore flags(flags); // enables interruptsreturn bh;}7

Could We Have Found Them? How often would those bugs trigger? Linux example: What happens if you return from a device driverwith interrupts disabled?Consider: that's just one function in a 2,000 LOC file in a 60,000 LOC module in the Linux kernel Some defects are very difficult to find viatesting or manual inspection8

9

Many Interesting Defects are on uncommon or difficult-to-exerciseexecution paths Thus it is hard to find them via testingExecuting or dynamically analyzing all pathsconcretely to find such defects is not feasibleWe want to learn about “all possible runs” ofthe program for particular properties Without actually running the program! Bonus: we don't need test cases!10

Static Analyses Often Focus On Defects that result from inconsistentlyfollowing simple, mechanical design rules Security: buffer overruns, input validation Memory safety: null pointers, initialized data Resource leaks: memory, OS resources API Protocols: device drivers, GUI frameworks Exceptions: arithmetic, library, user-defined Encapsulation: internal data, private functions Data races (again!): two threads, one variable11

How And Where Should We Focus?12

Static Analysis Static analysis is the systematic examinationof an abstraction of program state space An abstraction is a selective representation ofthe program that is simpler to analyze Static analyses do not execute the program!Abstractions have fewer states to exploreAnalyses check if a particular property holds Liveness: “some good thing eventually happens” Safety: “some bad thing never happens”13

Syntactic Analysis Example Find every instance of this pattern:public foo() { logger.debug(“We have ” conn “connections.”);}public foo() { if (logger.inDebug()) {logger.debug(“We have ” conn “connections.”);}} What could go wrong? First attempt:grep logger\.debug -r source dir14

Abstraction: Abstract Syntax Tree An AST is a treerepresentation of thesyntactic structure ofsource code Parsers convertconcrete syntax intoabstract syntaxExample: 5 (2 3) 5 23Records onlysemantically-relevantinformation Abstracts away (, etc.15

Programs As Data “grep” approach: treat program as string AST approach: treat program as tree The notion of treating a program as data isfundamentalIt relates to the notion of a Universal TuringMachine. A Turing Machine description (finitestate controller, initial tape) can itself berepresented as a string (and thus placed on atape as input to another TM)16

Dataflow Analysis Dataflow analysis is a technique for gatheringinformation about the possible set of valuescalculated at various points in a programWe first abstract the program to an AST or CFGWe then abstract what we want to learn (e.g.,to help developers) down to a small set ofvaluesWe finally give rules for computing thoseabstract values Dataflow analyses take programs as input17

Two Exemplar Analyses Definite Null Dereference “Whenever execution reaches *ptr at programlocation L, ptr will be NULL”Potential Secure Information Leak “We read in a secret string at location L, but thereis a possible future public use of it”18

Discussion These analyses are not trivial to check“Whenever execution reaches” “all paths” includes paths around loops and through branchesof conditionalsWe will use (global) dataflow analysis to learnabout the program Global an analysis of the entire method body, notjust one { block }19

Analysis ExampleIs ptr always null when it is dereferenced?ptr new AVL();if (B 0)ptr 0;X 2 * 3;print(ptr- data);20

Correctness (Cont.)To determine that a use of x is always null, we mustknow this correctness condition:On every path to the use of x, the lastassignment to x is x : 0 **21

Analysis Example RevisitedIs ptr always null when it is dereferenced?ptr new AVL();if (B 0)ptr 0;X 2 * 3;print(ptr- data);22

Static Datalfow AnalysisStatic dataflow analyses share several traits: The analysis depends on knowing a property P at aparticular point in program executionProving P at any point requires knowledge of theentire method bodyProperty P is typically undecidable!23

Undecidabilityof Program Properties Rice’s Theorem: Most interesting dynamic properties ofa program are undecidable: Does the program halt on all (some) inputs? Is the result of a function F always positive? Assume we can answer this question preciselyOops: We can now solve the halting problem.Take function H and find out if it halts by testing functionF(x) { H(x); return 1; } to see if it has a positive resultContradiction!Syntactic properties are decidable! This is called the halting probleme.g., How many occurrences of “x” are there?Programs without looping are also decidable!24

Looping Almost every important program has a loop Often based on user inputAn algorithm always terminatesSo a dataflow analysis algorithm mustterminate even if the input program loopsThis is one source of imprecision Suppose you dereference the null pointer on the500th iteration but we only analyze 499 iterations25

Conservative Program Analyses We cannot tell for sure that ptr is always null So how can we carry out any sort of analysis?It is OK to be conservative.26

Conservative Program Analyses We cannot tell for sure that ptr is always null So how can we carry out any sort of analysis?It is OK to be conservative. If the analysis depends onwhether or not P is true, then want to know either– P is definitely true– Don’t know if P is true Let's call this truthiness27

Conservative Program Analyses It is always correct to say “don’t know” We try to say don’t know as rarely as possibleAll program analyses are conservativeMust think about your software engineering process Bug finding analysis for developers? They hate “falsepositives”, so if we don't know, stay silent. Bug finding analysis for airplane autopilot? Safety iscritical, so if we don't know, give a warning.28

Definitely Null AnalysisIs ptr always null when it is dereferenced?ptr new AVL();ptr 0;if (B 0)if (B 0)ptr 0;X 2 * 3;print(ptr- data);foo myAVL;ptr 0;print(ptr- data);29

Definitely Null AnalysisIs ptr always null when it is dereferenced?ptr new AVL();ptr 0;if (B 0)if (B 0)ptr 0;X 2 * 3;print(ptr- data);foo myAVL;ptr 0;print(ptr- data);30

Definitely Null AnalysisIs ptr always null when it is dereferenced?ptr new AVL();ptr 0;if (B 0)if (B 0)ptr 0;X 2 * 3;print(ptr- data);No, not always.foo myAVL;ptr 0;print(ptr- data);Yes, always.On every path to the use of ptr, thelast assignment to ptr is ptr : 0 **31

Definitely Null Information We can warn about definitely null pointers at anypoint where ** holdsConsider the case of computing ** for a singlevariable ptr at all program pointsValid points cannot hide!We will find you! (sometimes)32

Definitely Null Analysis (Cont.) To make the problem precise, we associateone of the following values with ptr at everyprogram point Recall: abstraction and propertyvalueinterpretation#This statement isnot reachablecX constant c*Don’t know if X is aconstant33

ExampleGet out a piece of paper. Let's fill in these blanks now.X : 3X X B 0X *X X Y : Z WY : 0X : 4X A : 2 * XX X Recall: # not reachable, c constant, * don't know.34

Example AnswersX : 3X 3X 3B 0X *X 3X 3Y : Z WY : 0X : 4X 4A : 2 * XX *X 3X *35

Real-World Languages The official language of Sri Lanka and Singapore isspoken by over 66 million and boasts a richliterature stretching back over 2000 years. Unlikemost Indian languages, it does not distinguishbetween aspirated and unaspirated consonants. Ituses suffices to mark number, case and verb tenseand uses a flexible S-O-V ordering. It usespostpositions rather than prepositions. Example: வணகககமக36

Fictional Magicians In Greek Mythology, thissorceress transforms herenemies into animals. InHomer's Odyssey she tangleswith Odysseus (who defeatsher magic); she ultimatelysuggests that he travelbetween Scylla and Charybdisto reach Ithaca.37

Psychology: Predictions You are asked to read about a conflict and aregiven two alternative ways of resolving it.You are then asked to do: Say which option you would pick Guess which option other people will pick Describe the attributes of a person who wouldchoose each of the two options(Actually, let's be more specific )38

Psychology: Prediction Would you be willing to walk around campusfor 30 minutes holding a sign that says “Eat atJoe's”? (No information about Joe's restaurant is provided,you are free to refuse, but we claim you will learn“something useful” from the study.)Would you do it?39

Psychology: False Consensus Effect Of those who agreed to carry the sign, 62%thought others would also agreeOf those who refused, 67% thought otherswould also refuseWe think others will do the same as us,regardless of what we actually do We make extreme predictions about thepersonalities of those who chose differently But choosing “like me” does not imply anything: it'scommon!“Must be something wrong with you!”40

Psychology: False Consensus Effect Replications with 200 college students, etc. [Kathleen Bauman, Glenn Geher. WE think you agree: the detrimentalimpact of the false consensus effect on behavior. J. CurrentPsychology, 2002, 21(4). ]Implications for SE: Myriad, whenever youdesign something someone else will use.Example: Do you think this static analysisshould report possible defects or certaindefects? By the way, what do you think themajority of our customers want?41

Using Abstract Information Given analysis information (and a policy aboutfalse positives/negatives), it is easy to decidewhether or not to issue a warning Simply inspect the x ? associated with a statementusing xIf x is the constant 0 at that point, issue a warning!But how can an algorithm compute x ?42

The IdeaThe analysis of a complicated program can beexpressed as a combination of simple rulesrelating the change in information betweenadjacent statements43

Explanation The idea is to “push” or “transfer” informationfrom one statement to the nextFor each statement s, we compute informationabout the value of x immediately before andafter sCin(x,s) value of x before sCout(x,s) value of x after s44

Transfer Functions Define a transfer function that transfersinformation from one statement to another45

Rule 1X ?x : cX cCout(x, x : c) c if c is a constant46

Rule 2X #sX #Cout(x, s) # if Cin(x, s) #Recall: # “unreachable code”47

Rule 3X ?x : f( )X *Cout(x, x : f( )) *This is a conservative approximation! It might be possibleto figure out that f(.) always returns 0, but we won't even try!48

Rule 4X ay : . . .X aCout(x, y : ) Cin(x, y : ) if x y49

The Other Half Rules 1-4 relate the in of a statement to the out ofthe same statement Now we need rules relating the out of one statementto the in of the successor statement they propagate information across statementsto propagate information forward along pathsIn the following rules, let statement s haveimmediate predecessor statements p1, ,pn50

Rule 5X ?X ?X *sX ?X *if Cout(x, pi) * for some i, then Cin(x, s) *51

Rule 6X cX ?X dsX ?X *if Cout(x, pi) c and Cout(x, pj) d and d cthen Cin (x, s) *52

Rule 7X cX #X csX #X cif Cout(x, pi) c or # for all i,then Cin(x, s) c53

Rule 8X #X #X #sX #X #if Cout(x, pi) # for all i,then Cin(x, s) #54

Static Analysis Algorithm For every entry s to the program, setCin(x, s) * Set Cin(x, s) Cout(x, s) # everywhere else Repeat until all points satisfy 1-8:Pick s not satisfying 1-8 and update using theappropriate rule55

The Value # To understand why we need #, look at a loopX : 3X 3X *X 3B 0X 3Y : Z WY : 0X 3A : 2 * XA B56

The Value # To understand why we need #, look at a loopX : 3X 3X *X 3B 0X 3Y : Z WY : 0X 3X ?A : 2 * XX ?X ?A B57

The Value # (Cont.) Because of cycles, all points must have values atall times during the analysisIntuitively, assigning some initial value allows theanalysis to break cyclesThe initial value # means “we have not yetanalyzed control reaching this point”58

Sometimesall pathslead to thesame place.Thus youneed #.59

Another ExampleX : 3B 0Y : Z WLet's do it on paper!Analyze the value of X Y : 0A : 2 * XX : 4A B60

Another Example: AnswerX : 3X #3X #B 03X #Y : Z WX #X *3Y : 03X #A : 2 * XX : 4A BX #43 *X #3 *X #4Must continueuntil all rulesare satisfied !61

OrderingsWe can simplify the presentation of theanalysis by ordering the values# c *Making a picture with “lower” values drawnlower, we get * -101 I am calleda lattice!#62

Orderings (Cont.) * is the greatest value, # is the least Let lub be the least-upper bound in this ordering All constants are in between and incomparablecf. “least common ancestor” in Java/C Rules 5-8 can be written using lub:Cin(x, s) lub { Cout(x, p) p is a predecessor of s }63

Termination Simply saying “repeat until nothing changes”doesn’t guarantee that eventually nothingchangesThe use of lub explains why the algorithmterminates Values start as # and only increase# can change to a constant, and a constant to *– Thus, C (x, s) can change at most twice64

Number CrunchingThe algorithm is polynomial in program size:Number of steps Number of C ( .) values changed * 2 2(Number of program statements) * 265

“Potential Secure Information Leak”AnalysisCould sensitive information possibly reach an insecure use?str : get password()If B 0str : sanitize(str)Y : 0display(str)In this example, the password contents canpotentially flow into a public display(depending on the value of B)66

Live and Dead The first value of x isdead (never used)X : 3The second value of x islive (may be used)X : 4Liveness is an importantconcept We can generalize it toreason about“potential secureinformation leaks”Y : X67

Sensitive InformationA variable x at stmt s is a possible sensitive (highsecurity) information leak if There exists a statement s’ that uses xThere is a path from s to s’That path has no intervening low-security assignmentto x68

Computing Potential Leaks We can express the high- or low-security status ofa variable in terms of information transferredbetween adjacent statements, just as in our“definitely null” analysisIn this formulation of security status we only careabout “high” (secret) or “low” (public), not theactual value We have abstracted away the valueThis time we will start at the public display ofinformation and work backwards69

Secure Information Flow Rule 1X truedisplay(x)X ?Hin(x, s) true if s displays x publiclytrue means “if this ends up being a secret variablethen we have a bug!”70

Secure Information Flow Rule 2X falsex : sanitize(x)X ?Hin(x, x : e) false(any subsequent use is safe)71

Secure Information Flow Rule 3X asX aHin(x, s) Hout(x, s) if s does not refer to x72

Secure Information Flow Rule 4pX ?X ?X trueX trueX ?Hout(x, p) { Hin(x, s) s a successor of p }(if there is even one way to potentially have a leak,we potentially have a leak!)73

Secure Information Flow Rule 5(Bonus!)Y ax : yX aHin(y, x : y) Hout(x, x : y)(To see why, imagine the next statement isdisplay(x). Do we care about y above?)74

Algorithm Let all H ( ) false initially Repeat process until all statements s satisfyrules 1-4 :Pick s where one of 1-4 does not hold and updateusing the appropriate rule75

Secure Information Flow ExampleX : passwd()X : sanitize(X)B 0H(X) falseH(X) falseH(X) falseH(X) falseY : Z WY : 0H(X) falseH(X) falseH(X) falsedisplay(X)X : passwd()A BH(X) falseH(X) falseH(X) falseH(X) false76

Secure Information Flow ExampleX : passwd()X : sanitize(X)B 0H(X) falseH(X) falseH(X) falseH(X) falseY : Z WY : 0H(X) falseH(X) falseH(X) falsedisplay(X)X : passwd()A BH(X) TRUEH(X) falseH(X) falseH(X) false77

Secure Information Flow ExampleX : passwd()X : sanitize(X)B 0H(X) TRUEH(X) falseH(X) TRUEH(X) TRUEY : Z WY : 0H(X) TRUEH(X) TRUEH(X) TRUEdisplay(X)X : passwd()A BH(X) TRUEH(X) TRUEH(X) TRUEH(X) TRUE78

Secure Information Flow ExampleX : passwd()leakelsib eresoNo p rting hStaX : sanitize(X)B 0H(X) TRUEH(X) TRUEH(X) TRUEY : Z WY : 0H(X) TRUEH(X) TRUEH(X) TRUEdisplay(X)X : passwd()AKELLE curityBISPOS high-se heregFrome startinvaluH(X) falseA BH(X) TRUEH(X) TRUEH(X) TRUEH(X) TRUE79

Termination A value can change from false to true, but notthe other way aroundEach value can change only once, so terminationis guaranteedOnce the analysis is computed, it is simple toissue a warning at a particular entry point forsensitive information80

Static Analysis Limitations Where might a static analysis go wrong?If I asked you to construct the shortestprogram you can that causes one of our staticanalyses to get the “wrong” answer, whatwould you do?81

Static Analysis Discuss with your neighbor; I will call on youYou are asked to design a static analysis todetect bugs related to file handles A file starts out closed. A call to open() makes it open;open() may only be called on closed files. read() andwrite() may only be called on open files. A call toclose() makes a file closed; close may only be calledon open files.Report if a file handle is potentially used incorrectly What abstract information do you track? What do your transfer functions look like?82

Abstract Information We will keep track of an abstract value for agiven file handle variableValues and Interpretations*file handle state is unknown#haven't reached here yetclosed file handle is closedopenfile handle is open83

Rules Previously: “null ptr” Now: “file handles”f closedptr 0read(f)*ptrReportError!ReportError!84

Rules: openf * or openf closedopen(f)f openopen(f)ReportError!85

Rules: closef * or closedf openclose(f)f closedclose(f)ReportError!86

Rules: read/write(write is identical)f * or closedf openread(f)f openread(f)ReportError!87

Rules: Assignmentf af ag : fg : ff ag a88

Rules: Multiple Possibilitiesf af #f bf af af *f a89

A Tricky Programstart:switch (a)case 1: open(f); read(f); close(f); goto startdefault: open(f);do {write(f) ;if (b): read(f);else: close(f);} while (b)open(f);close(f);90

lose(f)#close(f)###read(f)##close(f)#open(f)91

lose(f)#open(f)92

)open#closedread(f)open#close(f)#open(f)93

)open*closedread(f)open*close(f)#open(f)94

ad(f)**close(f)*open(f)95

ad(f)**close(f)*open(f)96

Is There Really A Bug?start:switch (a)case 1: open(f); read(f); close(f); goto startdefault: open(f);do {write(f) ;if (b): read(f);else: close(f);} while (b)open(f);close(f);97

Forward vs. Backward AnalysisWe’ve seen two kinds of analysis:Definitely null (cf. constant propagation) is aforwards analysis: information is pushed frominputs to outputsSecure information flow (cf. liveness) is abackwards analysis: information is pushed fromoutputs back towards inputs98

Questions? How's the homework going?99

Static analysis is the systematic examination of an abstraction of program state space with respect to a property. Static analyses reason about all possible executions but they are conservative. Dataflow analysis is a popular approach to static analysis. It tracks a few broad values ("secret information" vs. "public