Towards A Universal Code Formatter Through Machine Learning PDF Free Download

1y ago

16 Views

1 Downloads

1.30 MB

15 Pages

Report/dmca

Download PDF

Transcription

Towards a Universal Code Formatterthrough Machine LearningTerence ParrJurgen VinjuUniversity of San Francisco, USAparrt@cs.usfca.eduCentrum Wiskunde & Informatica, NetherlandsJurgen.Vinju@cwi.nlAbstractBecause the value of a particular code formatting style isa subjective notion, often leading to heated discussions, formatters must be highly configurable. This allows, for example, current maintainers of existing code to improve their effectiveness by reformatting the code per their preferred style.There are plenty of configurable formatters for existing languages, whether in IDEs like Eclipse or standalone tools likeGnu indent, but specifying style is not easy. The emergentbehavior is not always obvious, there exists interdependencybetween options, and the tools cannot take context information into account [13]. For example, here are the optionsneeded to obtain K&R C style with indent:There are many declarative frameworks that allow us to implement code formatters relatively easily for any specific language, but constructing them is cumbersome. The first problem is that “everybody” wants to format their code differently, leading to either many formatter variants or a ridiculous number of configuration options. Second, the size ofeach implementation scales with a language’s grammar size,leading to hundreds of rules.In this paper, we solve the formatter construction problemusing a novel approach, one that automatically derives formatters for any given language without intervention from alanguage expert. We introduce a code formatter called C ODE B UFF that uses machine learning to abstract formatting rulesfrom a representative corpus, using a carefully designed feature set. Our experiments on Java, SQL, and ANTLR grammars show that C ODE B UFF is efficient, has excellent accuracy, and is grammar invariant for a given language. It alsogeneralizes to a 4th language tested during manuscript preparation.-nbad -bap -bbo -nbc -br -brs -c33 -cd33 -ncdb -ce-ci4 -cli0 -cp33 -cs -d0 -di1 -nfc1 -nfca -hnl -i4-ip0 -l75 -lp -npcs -nprs -npsl -saf -sai -saw -nsc-nsob -nssNew languages pop into existence all the time and eachone could use a formatter. Unfortunately, building a formatter is difficult and tedious. Most formatters used inpractice are ad hoc, language-specific programs but thereare formal approaches that yield good results with less effort. Rule-based formatting systems let programmers specifyphrase-formatting pairs, such as the following sample specification for formatting the COBOL MOVE statement usingASF SDF [3, 12, 13, 15].Categories and Subject Descriptors D.2.3 [Software Engineering]: Coding - Pretty printersKeywords Formatting algorithms, pretty-printer1.IntroductionMOVE IdOrLit TO Id-list from-box( H [ "MOVE"H ts 25 [to-box(IdOrLit)]H ts 49 ["TO"]H ts 53 [to-box(Id-list)] ])The way source code is formatted has a significant impacton its comprehensibility [9], and manually reformatting codeis just not an option [8, p.399]. Therefore, programmers needready access to automatic code formatters or “pretty printers”in situations where formatting is messy or inconsistent. Manyprogram generators also take advantage of code formatters toimprove the quality of their output.This rule maps a parse tree pattern to a box expression. A setof such rules, complemented with default behavior for theunspecified parts, generates a single formatter with a specificstyle for the given language. Section 6 has other related work.There are a number of problems with rule-based formatters. First, each specification yields a formatter for one specific style. Each new style requires a change to those rulesor the creation of a new set. Some systems allow the rules tobe parametrized, and configured accordingly, but that leadsto higher rule complexity. Second, minimal changes to theassociated grammar usually require changes to the format-Permission toto makemake digitaldigital oror hardhard copiescopies ofof allall oror partpart ofof thisthis workwork forfor personalPermissionpersonal ororclassroom useuse isis grantedgranted withoutwithout feefee providedprovided thatthat copies are not made or distributedclassroomforforprofitprofitoror commercialcommercial advantageadvantage andand thatthat copiescopies bearbear this notice and the full citationononthethefirstfirst page.page. CopyrightsCopyrights for components of this work owned by others than To copycopy otherwise,otherwise, oror republish,republish,must be honored. Abstracting with credit is permitted. Tototopostpost onon serversservers oror toto redistributeredistribute toto lists,lists, requiresrequires priorprior specificspecific permissionpermission and/orand/or aafee. Request permissions from Permissions@acm.org.fee. Request permissions from permissions@acm.org.SLE’16,3131-November– andsSLE ’16,,OctoberOctoberNetherlands.c 2016 ACM.978-1-4503-4447-0/16/10. 15.00c 2016CopyrightACM [to be supplied]. . . 997364.2997383137

ting rules, even if the grammar changes do not affect the language recognized. Finally, formatter specifications are big.Although most specification systems have builtin heuristicsfor default behavior in the absence of a specification for agiven language phrase, specification size tends to grow withthe grammar size. A few hundred rules are no exception.Formatting is a problem solved in theory, but not yet inpractice. Building a good code formatter is still too difficultand requires way too much work; we need a fresh approach.In this paper, we introduce a tool called C ODE B UFF [11] thatuses machine learning to produce a formatter entirely froma grammar for language L and a representative corpus written in L. There is no specification work needed from the userother than to ensure reasonable formatting consistency withinthe corpus. The statistical model used by C ODE B UFF firstlearns the formatting rules from the corpus, which are thenapplied to format other documents in the same style. Different corpora effectively result in different formatters. From auser perspective the formatter is “configured by example.”MonVerificationMaster.sql (trained with sqlite grammar onsqlclean corpus):SELECT DISTINCTt.server name, t.server id, ’Message Queuing Service’ AS missingmonitorsFROM t server t INNER JOIN t server type assoc tsta ON t.server id tsta.server idWHERE t.active 1 AND tsta.type id IN (’8’)AND t.environment id 0AND t.server name NOT IN(SELECT DISTINCT l.addressFROM ipmongroups g INNER JOIN ipmongroupmembers m ON g.groupid m.groupidINNER JOIN ipmonmonitors l ON m.monitorid l.monitoridINNER JOIN t server t ON l.address t.server nameINNER JOIN t server type assoc tsta ON t.server id tsta.server idWHERE l.name LIKE ’%Message Queuing Service%’AND t.environment id 0AND tsta.type id IN (’8’)AND g.groupname IN (’Prod O/S Services’)AND t.active 1)UNIONALLAnd here is a complicated query fromwith case statements:SELECTCASE WHEN SSISInstanceID IS NULLTHEN ’Total’ELSE SSISInstanceID END SSISInstanceID, SUM(OldStatus4) AS OldStatus4., SUM(OldStatus4 Status0 Status1 Status2 Status3 Status4) AS InstanceTotalFROM(SELECTCONVERT(VARCHAR, SSISInstanceID)AS SSISInstanceID, COUNT(CASE WHEN Status 4 ANDCONVERT(DATE, LoadReportDBEndDate) CONVERT(DATE, GETDATE())THEN StatusELSE NULL END)AS OldStatus4., COUNT(CASE WHEN Status 4 ANDDATEPART(DAY, LoadReportDBEndDate) DATEPART(DAY, GETDATE())THEN StatusELSE NULL END)AS Status4FROM dbo.ClientConnectionGROUP BY SSISInstanceID) AS StatusMatrixGROUP BY SSISInstanceIDContributions and roadmap. We begin by showing sampleC ODE B UFF output in Section 2 and then explain how and whyC ODE B UFF works in Section 3. Section 4 provides empiricalevidence that C ODE B UFF learns a formatting style quicklyand using very few files. C ODE B UFF approximates the corpus style with high accuracy for the languages ANTLR, Javaand SQL, and it is largely insensitive to language-preservinggrammar changes. To adjust for possible selection bias andmodel overfitting to these three well-known languages, wetested C ODE B UFF on an unfamiliar language (Quorum) inSection 5, from which we learned that C ODE B UFF works similarly well, yet improvements are still possible. We positionC ODE B UFF with respect to the literature in Section 6.2.dmart bits IAPPBO510.sqlHere is a snippet from Java, our 2nd test language, taken fromSTLexer.java (trained with java grammar on st corpus):switch ( c ) {.default:if ( c delimiterStopChar ) {consume();scanningInsideExpr false;return newToken(RDELIM);}if ( isIDStartLetter(c) ) {.if ( name.equals("if") ) return newToken(IF);else if ( name.equals("endif") ) return newToken(ENDIF);.return id;}RecognitionException re new NoViableAltException("", 0, 0, nvalid character ’" str(c) "’",templateToken,re);.Sample FormattingThis section contains sample SQL, Java, and ANTLR codeformatted by C ODE B UFF, including some that are poorly formatted to give a balanced presentation. Only the formattingstyle matters here so we use a small font for space reasons.Github [11] has a snapshot of all input corpora and formatted versions (corpora, testing details in Section 4). To arriveat the formatted output for document d in corpus D, our testrig removes all whitespace tokens from d and then appliesan instance of C ODE B UFF trained on the corpus without d,D \ {d}.The examples are not meant to illustrate “good style.”They are simply consistent with the style of a specific corpus.In Section 4 we define a metric to measure the success of theautomated formatter in an objective and reproducible manner. No quantitative research method can capture the qualitative notion of style, so we start with these examples. (We use“. . . ” for immaterial text removed to shorten samples.)SQL is notoriously difficult to format, particularly fornested queries, but C ODE B UFF does an excellent job in mostcases. For example, here is a formatted query from file IP-Here is an example from STViz.java that indents amethod declaration relative to the start of an expressionrather than the first token on the previous line:Thread t new Thread() {@Overridepublic void run() {synchronized ( lock ) {while ( viewFrame.isVisible() ) {try {lock.wait();}catch (InterruptedException e) {}}}}};138

Formatting results are generally excellent for ANTLR, ourthird test language. E.g., here is a snippet from Java.g4:ClassOrIntModifier:annotation (’public’. ’final’ ’strictfp’);// class or interface// class or interface// class only// class or interface(Comments are passed through; see Section 3.5.) Among theformatted files for the three languages, there are a few regionsof inoptimal or bad formatting. C ODE B UFF does not captureall formatting rules and occasionally gives puzzling formatting. For example, in the Java8.g4 grammar, the following rulehas all elements packed onto one line (“ -” means we softwrapped output for printing purposes):unannClassOrIntType: (unannClassType lfno unannClassOrIntType unannInterfaceType lfno unannClassOrIntType) (unannClassType lf unannClassOrIntType unannInterfaceType lf unannClassOrIntType)*;C ODE B UFF does not consider line length during training orformatting, instead mimicking the natural line breaks foundamong phrases of the corpus. For Java and SQL this worksvery well, but not always with ANTLR grammars.Here is an interesting Java formatting issue from Compiler.java that is indented too far to the right (column 102);it is indented from the {{. That is a good decision in general,but here the left side of the assignment is very long, whichindents the put() code too far to be considered good style.public Map . X new . {{put("anchor", "true");}};In STGroupDir.java, the prefix token is aligned improperly:if ( verb ) error("loadFile(" f ") in groupdir." prefix " prefix);We also note that some of the SQL expressions are incorrectly aligned, as in this sample from SQLQuery23.sql:AND X NOT LIKE ’.’AND X NOT LIKE ’.’AND X NOT.Despite a few anomalies, C ODE B UFF generally reproducesa corpus’ style well. Now we describe the design used toachieve these results. In Section 4 we quantify them.3.The Design of an AI for FormattingOur AI formatter mimics what programmers do during theact of entering code. Before entering a program symbol, a139programmer decides (i) whether a space or line break isrequired and if line break, (ii) how far to indent the nextline. Previous approaches (see Section 6) make a languageengineer define whitespace injection programmatically.A formatting engine based upon machine learning operates in two distinct phases: training and formatting. Thetraining phase examines a corpus of code documents, D,written in language L to construct a statistical model that represents the formatting style of the corpus author. The essenceof training is to capture the whitespace preceding each token,t, and then associate that whitespace with the phrase contextsurrounding t. Together, the context and whitespace preceding t form an exemplar. Intuitively, an exemplar captures howthe corpus author formatted a specific, fine-grained piece of aphrase, such as whether the author placed a newline before orafter the left curly brace in the context of a Java if-statement.Training captures the context surrounding t as an mdimensional feature vector, X, that includes t’s token type,parse-tree ancestors, and many other features (Section 3.3).Training captures the whitespace preceding t as the concatenation of two separate operations or directives: a whitespacews directive followed by a horizontal positioning hpos directive if ws is a newline (line break). The ws directive generatesspaces, newlines, or nothing while hpos generates spaces toindent or align t relative to a previous token (Section 3.1).As a final step, training presents the list of exemplarsto a machine learning algorithm that constructs a statisticalmodel. There are N exemplars (Xj , wj , hj ) for j 1.Nwhere N is the number of total tokens in all documents ofcorpus D and wj 2 ws, hj 2 hpos. Machine learning models are typically both a highly-processed condensation of theexemplars and a classifier function that, in our case, classifies a context feature vector, X, as needing a specific bit ofwhitespace. Classifier functions predict how the corpus author would format a specific context by returning a formatting directive. A model for formatting needs two classifierfunctions, one for predicting ws and one for hpos (consultedif ws prediction yields a newline).C ODE B UFF uses a k-Nearest Neighbor (kNN) machinelearning model, which uses the list of exemplars as the actualmodel. A kNN’s classifier function compares unknown context vector X to the Xj from all N exemplars and finds thek nearest. Among these k, the classifier predicts the formatting directive that appears most often (details in Section 3.4).It’s akin to asking a programmer how they normally formatthe code in a specific situation. Training requires a corpusD written in L, a lexer and parser for L derived from grammar G, and the corpus indentation size to identify indentedphrases; e.g., one of the Java corpora we tested indents with2 not 4 spaces. Let FD,G (X, W, H, indentSize) denotethe formatting model contents with context vectors formingrows of matrix X and formatting directives forming elementsof vectors W and H. Function 1 (see appendix) embodies thetraining process, constructing FD,G .

Once the model is complete, the formatting phase can begin. Formatting operates on a single document d to be formatted and functions with guidance from the model. At eachtoken ti 2 d, formatting computes the feature vector Xi representing the context surrounding ti , just like training does,but does not add Xi to the model. Instead, the formatterpresents Xi to the ws classifier and asks it to predict a wsdirective for ti based upon how similar contexts were formatted in the corpus. The formatter “executes” the directiveand, if a newline, presents Xi to the hpos classifier to get anindentation or alignment directive. After emitting any preceding whitespace, the formatter emits the text for ti . Notethat any token ti is identified by its token type, string content, and offset within a specific document, i.The greedy, “local” decisions made by the formatter give“globally” correct formatting results; selecting features forthe X vectors is critical to this success. Unlike typical machine learning tasks, our predictor functions do not yield trivial categories like “it’s a cat.” Instead, the predicted ws andhpos directives are parametrized. The following sections detail how C ODE B UFF captures whitespace, computes featurevectors, predicts directives, and formats documents.3.1that implicitly align or indent ti relative to the first token ofthe previous line:hpos 2 {(align, tj ), (indent, tj ), align, indent}In the following Java fragments, assuming 4-space indentation, directive (indent, if) captures the whitespace at position "a , (align, if) captures "b , and (align, x) captures "c .if ( b ) {z ;public void write(String str)throws IOException {int n 0;"Directive (indent, public) captures the indentation of intbut plain indent does not. Plain indent would mean indenting 4 spaces from throws, the first token on the previousline, incorrectly indenting int 8 spaces relative to public.Directive indent is used to approximate nonstandard indentation as in the following fragment.nl : Inject newlinesp: Inject space character(align, t): Left align current token with previous token t(indent, t): Indent current token from previous token tnone: Inject nothing, no indentation, no alignmentf(100,0);"At the indicated position, the whitespace does not representalignment or standard 4 space indentation. As a default forany nonstandard indentation, function capture hpos returnsplain indent as an approximation.When no suitable alignment or indentation token is available, but the current token is aligned with the previous line,training captures the situation with directive align:For simplicity and efficiency, prediction for nl and spoperations can be merged into a single “predict whitespace”or ws operation and prediction of align and indent can bemerged into a single “predict horizontal position” or hposoperation. While the formatting directives are 2- and 3-tuples(details below), we pack the tuples into 32-bit integers forefficiency, w for ws directives and h for hpos.For ws operations, the formatter needs to know how many(n) characters to inject: ws 2 {(nl , n), (sp, n), none} ascomputed in Function 2. For example, in the following Javafragment, the proper ws directive at "a is (sp, 1), meaning“inject 1 space,” the directive at "b is none, and "c is (nl , 1),meaning “inject 1 newline.”return x y z; // align with first token of previous lineWhile (align, y) is valid, that directive is not available because of limitations in how hpos directives identify previoustokens, as discussed next.3.2x y;"a"dAt position "d , both (indent, for) and (align, ‘(’) capturethe formatting, but training chooses indentation over alignment directives when both are available. We experimentedwith the reverse choice, but found this choice better. Here,(align, ‘(’) inadvertently captures the formatting becausefor happens to be 3 characters.To illustrate the need for (indent, tj ) versus plain indent,consider the following Java method fragment where the firststatement is not indented from the previous line.Capturing Whitespace as Directivesz ;"cfor (int i 0; .x i;"bIn order to reproduce a particular style, formatting directivesmust encode the information necessary to reproduce whitespace encountered in the training corpus. There are five canonical formatting directives:1.2.3.4.5."a}f(x,y)How Directives Refer to Earlier TokensThe manner in which training identifies previous tokens forhpos directives is critical to successfully formatting documents and is one of the key contributions of this paper. Thegoal is to define a “token locator” that is as general as possible but that uses the least specific information. The moregeneral the locator, the more previous tokens directives can"b"cThe hpos directives align or indent token ti relative tosome previous token, tj for j i as computed by Function3. When a suitable tj is unavailable, there are hpos directives140

identify. But, the more specific the locator, the less applicable it is in other contexts. Consider the indicated positionswithin the following Java fragments where align directivesmust identify the first token of previous function arguments.f(x,y)"af(x 1,y)"bexpressionexpression:1(expressionList:1 )child 0primary:5f(x 1,y,z)expression,expression:1 , expression:1f expression:1 expression:1 primary:5"cLeftmostleafThe absolute token index within a document is a completely general locator but is so specific as to be inapplicableto other documents or even other positions within the samedocument. For example, all positions "a , "b , and "c could usea single formatting directive, (align, i), but x’s absolute index, i, is valid only for a function call at that specific location.The model also cannot use a relative token index referringbackwards. While still fully general, such a locator is stilltoo specific to a particular phrase. At position "a , token x isat delta 2, but at position "b , x is at delta 4. Given argumentexpressions of arbitrary size, no single token index delta ispossible and so such deltas would not be widely applicable.Because the delta values are different, the model could notuse a single formatting directive to mean “align with previousargument.” The more specific the token locator, the morespecific the context information in the feature vectors needsto be, which in turn, requires larger corpora (see Section 3.3).We have designed a token locator mechanism that strikes abalance between generality and applicability. Not every previous token is reachable but the mechanism yields a singlelocator for x from all three positions above and has provenwidely applicable in our experiments. The idea is to pop upinto the parse tree and then back down to the token of interest, x, yielding a locator with two components: A path lengthto an ancestor node and a child index whose subtree’s leftmost leaf is the target token. This alters formatting directivesrelative to previous tokens to be: ( , ancestor , child).Unfortunately, training can make no assumptions aboutthe structure of the provided grammar and, thus, parse-treestructure. So, training at ti involves climbing upwards in thetree looking for a suitable ancestor. To avoid the same issueswith overly-specific elements that token indexes have, thepath length is relative to what we call the earliest left ancestor as shown in the parse tree in Figure 1 for f(x 1,y,-z).The earliest left ancestor (or just left ancestor) is theoldest ancestor of t whose leftmost leaf is t, and identifiesthe largest phrase that starts with t. (For the special casewhere t has no such ancestor, we define left ancestor to bet’s parent.) It attempts to answer “what kind of thing we arelooking at.” For example, the left ancestor computed fromthe left edge of an arbitrarily-complex expression alwaysrefers to the root of the entire expression. In this case, theleft ancestors of x, y, and z are siblings, thus, normalizingleaves at three different depths to a common level. The tokenlocator in a directive for x in f(x 1,y,-z) from both y andz is ( , ancestor , child) ( , 1, 0), meaning jump up 1primary:5xLeftancestorLeft - expression:1ancestorprimary:4 Left yprimary:5ancestorliteral:1z1Figure 1. Parse tree for f(x 1,y,-z). Node rule:n in the treeindicates the grammar rule and alternative production number usedto match the subtree phrase.child 0 expression ancestorΔ 1Leftmostexpressionleaf expression:1expression:1 expression:1 primary:5primary:5primary:5 Left zancestorxyFigure 2. Parse tree for x y z;.level from the left ancestor and down to the leftmost leaf ofthe ancestor’s child 0.The use of the left ancestor and the ancestor’s leftmostleaf is critical because it provides a normalization factoramong dissimilar parse trees about which training has noinherent structural information. Unfortunately, some tokensare unreachable using purely leftmost leaves. Consider thereturn x y z; example from the previous section and onepossible parse tree for it in Figure 2. Leaf y is unreachable aspart of formatting directives for z because y is not a leftmostleaf of an ancestor of z. Function capture hpos must eitheralign or indent relative to x or fall back on the plain alignand indent.The opposite situation can also occur, where a given tokenis unintentionally aligned with or indented from multipletokens. In this case, training chooses the directive with thesmallest ancestor , with ties going to indentation.And, finally, there could be multiple suitable tokens thatshare a common ancestor but with different child indexes. Forexample, if all arguments of f(x 1,y,-z) are aligned, theparse tree in Figure 1 shows that (align, 1, 0) is suitable toalign y and both (align, 1, 0) and (align, 1, 2) could alignargument -z. Ideally, the formatter would align all function arguments with the same directive to reduce uncertaintyin the classifier function (Section 3.4) so training chooses(align, 1, 0) for both function arguments.141

The formatting directives capture whitespace in betweentokens but training must also record the context in whichthose directives are valid, as we discuss next.3.3Corpusantlrjavajava8java guavasqlitetsqlToken Context—Feature VectorsFor each token present in the corpus, training computesan exemplar that associates a context with a ws and hposformatting-directive: (X, w, h). Each context has several features combined into a m-dimensional feature vector, X. Thecontext information captured by the features must be specificenough to distinguish between language phrases requiringdifferent formatting but not so specific that classifier functions cannot recognize any contexts during formatting. Theshorter the feature vector, the more situations in which eachexemplar applies. Adding more features also has the potential to confuse the classifier.Through a combination of intuition and exhaustive experimentation, we have arrived at a small set of features thatperform well. There are 22 context features computed during training for each token, but ws prediction uses only 11 ofthem and hpos uses 17. (The classifier function knows whichsubset to use.) The feature set likely characterises the contextneeds of the languages we tested during development to somedegree, but the features appear to generalize well (Section 5).Before diving into the feature details, it is worth describing how we arrived at these 21 features and how they affectformatter precision and generality. We initially thought that asliding window of, say, four tokens would be sufficient context to make the majority of formatting decisions. For example, the context for · · · x 1*· · · would simply be the tokenN tokens19,69242,03242,032499,02914,75814,782Unique ws3.0%3.9%3.4%0.8%8.4%7.5%Unique hpos4.7%17.4%7.5%8.1%30.8%17.9%Figure 3. Percentage of unique context vectors in corpora.needed to achieve good generality. For example, rather thanrelying solely on the exact tokens after a token, it is moregeneral to capture the fact that those tokens begin an expression. A useful metric is the percentage of unique context vectors, which we counted for several corpora and show in Figure 3. Given the features described below, there are very fewunique context for ws decisions (a few %). The contexts forhpos decisions, however, often have many more unique contexts because ws uses 11-vectors and hpos uses 17-vectors.E.g., our reasonably clean SQL corpus has 31% and 18%unique hpos vectors when trained using SQLite and TSQLgrammars, respectively.For generality, the fewer unique contexts the better, aslong as the formatter performs well. At the extreme, a modelwith just one X context would perform very poorly becauseall exemplars would be of the form (X, , ). The formattingdirective appearing most often in the corpus would be thesole directive returned by the classifier function for any X.The optimal model would have the fewest unique contextsbut all exemplars with the same context having identical formatting directives. For our corpora, we found that a majorityof unique contexts for ws and almost all unique contexts forhpos predict a single formatting directive, as shown in Figure4. For example, 57.1% of the unique antlr corpus contextsare associated with just one ws directive and 95.7% of theunique contexts predict one hpos directive. The higher theambiguity associated with a single context vector, the higherthe uncertainty when making decisions during formatting.The guava corpus stands out as having very few uniquecontexts for ws and among the fewest for hpos. This givesa hint that the corpus might be much larger than necessarybecause the other Java corpora are much smaller and yieldgood formatting results. Figure 8 shows the effect of corpussize on classifier error rates. The error rate flattens out aftertraining on about 10 to 15 corpus files.In short, few unique contexts gives an indication of thepotential for generality and few ambiguous decisions givesan indication of the model’s potential for accuracy. Thesenumbers do not tell the entire story because some contextsare used more frequently than others and those might allpredict single directives. Further, while a single context couldbe associated with multiple directives, most of those could beone specific directive.With this perspective in mind, we turn to the details of theindividual features. The ws and hpos de

SQL is notoriously difﬁcult to format, particularly for nested queries, but CODEBUFF does an excellent job in most cases. For example, here is a formatted query from ﬁle IP-MonVeriﬁcationMaster.sql (trained with sqlite grammar on sqlclean corpus): SELECT DISTINCT Smissingmonitors