HIVE Dna-hexagon Tutorial

Transcription

HIVE dna-hexagon TutorialThe purpose of this tutorial is to guide the user through the process of a single alignment using the HIVEdna-hexagon tool. All other variations on alignments using the tool employ this same basic process withmodified inputs and parameters.TABLE OF CONTENTSIntroduction1. Selecting Inputs1.1 Query Sequences (reads)1.2 Reference Genome2. Input Parameters2.1 Alignment Algorithm2.2 Algorithmic Parameters3. Job Processing4. Alignment Results5. What next?INTRODUCTIONThe HIVE dna-hexagon sequence alignment tool allows the user to align reads from a high throughputexperiment to a reference genome. Both reads and references may be selected from those provided byHIVE or supplied by the user. (This tutorial assumes the query sequences are already in the system. Forhelp loading new data into HIVE, please see Tutorial for DMDownloader). Once query sequences areselected, a number of summary and quality-check visualization tools are available to view. Numerousparameters allow customization of the alignment to suit the user’s needs. When the alignment isfinished, the tool produces a hits table and pie chart showing hits of the query sequences to genes onthe reference genome in addition to several other visualizations, all of which can be downloaded. Theuser may now compute SNP profiles or move onto a number of other analysis modules which can bestacked end-to-end with the dna-hexagon aligner.For easier understanding, all text for HIVE system options, parameters or buttons is displayed in boldedblue. Any text the user should input is displayed in the Courier font. Equations are offset anditalicized.

1. SELECTING INPUTSThe only two inputs required for selection are the query sequences (reads) and the reference genome(genomes). While further customization is available via specification of parameters, all other inputshave set defaults such that the alignment will proceed without the user entering any additional values.1.1 Query Sequences (reads)Logged into HIVE in the user home page, the second menu to the left in the header region readsHIVE-Portal (Figure 1). Hovering the mouse on top of this option will open a menu below. Uponclicking the Sequence Alignment on Genome tool, you will be redirected to the HIVE alignment dnahexagon portal.Figure 1. HIVE Home Directory for a Logged-In UserOnce in the portal (Figure 2) you should see four input boxes: Reference Genomes, Short Reads,Alignment Algorithm and Algorithmic Parameters. The Short Reads box should be visible in theupper-right of the window. To view all the short reads available to you, click on the expansion iconin the top left corner of this box. This will reveal the list (Figure 3) of all the reads you have

uploaded in addition to Demo Reads and other files shared with you by the HIVE team under theuser name Biological Data Handler, or by another user.Figure 2. dna-hexagon ViewFigure 3. List of Short Reads Accessed from dna-Hexagon Inputs Page

Note that the above representation is in a list view. At the bottom of the sequence reads windowthere is an option to toggle the organization between the hierarchyand listformats byclicking the associated icons.Each item in the sequence reads menu is preceded by the checkbox icon. To choose a file orseries of files for alignment check the box(es) next to the desired file(s). To choose all files in thedirectory, check theicon at the very top of the chart next to the ID heading.If at any point you find yourself without the desirable sequence files or in the wrong placealtogether, clicking the Home link option on top left of the page will always return you to your userhome directory. For tutorial purposes, please click Home at this time. Any information loaded by orshared with you in HIVE will be found in this directory.For streamlined alignment access, you may also select short reads or genomes directly from yourhome directory. All short reads available to you can be seen in the top page section labeled Files andSequences under the reads tab. Here you should see a list of files identical to that accessed from thedna-hexagon page. To proceed with the tutorial, please check the two Influenza paired readsshown in Figure 4 below.Figure 4. Demo Input Sequences - Influenza Paired ReadsNotice the data in this case is displayed in a pair - these data come from paired-end readexperiments such that one is the forward read and the other the reverse read of the template.Only by checking the boxes will you be able to select data for alignment, simply clicking on a filename (selecting a file) will highlight the selected file and allow you to view details about it but willnot give you the option to perform any action (such as alignment) on the file. This previewinformation can be viewed in the box to the right in the preview and details tabs. For yourconvenience, the help tab containing topical help can also be accessed here.Once you check the boxes next to this pair you should have operational icons appear in your toolbarat the top of this list (Figure 5). Clicking on the dna-hexagon imagein this toolbar will direct youagain to the alignment tool. You should see your short read sequences have already been selectedand are listed in the short reads input box.(NOTE: If the user wishes to upload new experimental sequence information into the system, clickthe add tab on the home page in the Files and Sequences box. This will redirect the user to theDMDownloader utility. Please see the DMDownloader tutorial for further details.)With your query sequences selected, you are now ready to choose a reference genome.

Figure 5. HIVE Toolbar1.2 Reference GenomeA reference genome is a comprehensive sequence of the entire genome of a given species,assembled by scientists to be representative of that species. Reference genomes for several specieshave been made available to the users, shared by the HIVE team under the user name BiologicalData Handler.Selection of a reference genome should depend greatly on the experimental data being aligned. Forexample, if you think you have isolated a new strain of influenza and you want to see how itcompares to other influenzas, you will want to select an influenza reference genome. Alternatively,if you are looking for SNPs in a disease-associated human protein, you will need to select a humanreference genome. Perhaps you are looking for horizontal gene transfer between two species ofbacteria, in which case you may want to align your sequences from one species to the referencegenome of the other.We have selected sequence data for Influenza. There may be several applicable genomes. Inaddition to picking the right species genome, you may need further information to choose the bestpossible genome for alignment. Continuing with the tutorial, we will now choose the influenzagenome. Click on the expansion icon found in the upper left corner of Reference Genomes box.Please check the box next to ID 5153, filed named Influenza Segments.fa, to select it as ourreference genome.2. INPUT PARAMETERSBelow the Input Data box are two collapsible/expandable sections, Alignment Algorithm and AlignmentParameters, which allow for complete customization of many aspects of the alignment. All parametersare populated with default values such that a user is not required to specify any values in order to runthe alignment, they need only click the Align button after selecting inputs. To expand any closed section,click the icon to the left of the section header. To hide an open section, click the icon to the left ofthe section header of the newly expanded window.

2.1 Alignment AlgorithmHere you may select which alignment tool you would like to use to align your data. Access the menuby clicking the expansion icon in the top left of the Alignment Algorithm box. Current optionsinclude: HIVE’s dna-hexagon, NCBI-BLAST, Bowtie, TopHat, Ace View Magic and BWA. dna-hexagonis the native, default alignment tool developed and optimized for use within high-performance cloudcomputing environments and therefore is best adapted to efficiently use the HIVE infrastructure toenhance analysis performance. The external tools have all been adapted to the parallel HIVEenvironment, so you should see increased performance of most of those tools when comparingtheir usage within HIVE to their usage as standalone tools. We recommend dna-hexagon for bestresults, so this tutorial will be using dna-hexagon.2.2 Algorithmic ParametersThe parameters shown by default on the portal page are those which are most often user-modified,including the ability to specify a name for the resultant output file.Name: Specify a name for the alignment output file.Minimum Match Length: Alignments produced will likely vary in length. This option allows the userto choose a preferred length of alignment by discarding all alignments shorter than the specifiedvalue. The default minimum length is set to 75.Matches to Keep: Alignment results can be filtered by choosing one of three options. The reportedmatches/alignments can be returned as the Best Match or First Match, or as the list of All Matchesby selecting the preferred option from the drop-down menu. The First Match option can be usefulwhen the user is simply interested in finding the genome to which a query sequence belongs,whereas the Best Match option ranks alignments according to alignment scores. Default is set toBest Match.Percent Mismatches Allowed: A mismatch in an alignment occurs when the query and referencealign at most points but differ in nucleotides at corresponding positions in the same region. Analignment is not disregarded due to mismatches, but the alignment score reflects a penaltyassociated with mismatches. The mismatch percentage is equal to the total number of mismatchesdivided by the total number of positions.Mismatch % total # mismatches/total # positionsThis filter will remove alignments with a mismatch % greater than the specified value. Default valueis set to 15 for 15%.The remaining, less used and therefore hidden parameters are located in the Algorithm Parametersbox, accessed by clicking the expansion icon in the top left corner of the box.

Advanced Parameters:Alignment Directionality: This parameter allows the user to choose whether sequences will bealigned in a Forward or Bi-directional (forward and reverse) manner by selection from the dropdown menu. Default is Bi-directional.Alignment Tail Extension: Extension can be performed in Conservative Compact or AggressiveExtended mode according to the selection from the drop-down menu. The AggressiveExtended option facilitates continued alignment of sequences until the threshold of mismatchesencountered is reached, resulting in lower sensitivity but higher coverage. The ConservativeCompact option terminates an alignment as soon as a higher score is obtained, providing highersensitivity but lower coverage. The Conservative Compact mode is a good choice when workingwith small genomes that have a higher mutation rate. Default is set to Conservative Compact.Allowed number of gaps during extension: The user can specify the maximum number ofinsertions and deletions allowed during the seed extension phase of alignment. Default is set to3.Intelligent Diagonal: HIVE allows the ability to compute only the region surrounding thediagonal of alignment scoring matrix under the assumption that the best score should lie nearthe diagonal. If a user is uncertain, they can specify to compute the Complete matrix. Default isset to Diagonal of a given size.Intelligent K-mer Jump in Read: Increasing the seed size in large genomes improves thealignment speed significantly at a potential cost of loss of sensitivity. The value 1 means no lossin sensitivity. The size of the seed is the maximum recommended value. If auto is chosen it willbe set to 1 for small genomes and 2 for the large genomes. Default is auto.Intelligent K-mer Jump in Reference: The same logic as for Intelligent K-mer Jump in Readabove applies, but with respect to the reference genome instead of the short read. Default isauto.K-mer Extension Minimal Length Percent: The minimal length of the extension must be equal tothis percentage of the seed length. Default is set to 66 for 66%.K-mer Extension Mismatch Allowance Percent: This parameter allows the user to define thepercentile of mismatches allowed during extension of alignments. If 0 is entered, all positionsare computed using the Smith-Waterman algorithm. Default is set to 25 for 25%.Optimal Alignment Search: This option allows the user to choose whether to find the optimalalignment using the Smith-Waterman algorithm (default setting) or quickly look up highlyscoring hits by making selecting Only Identities from the drop-down menu. The Smith-

Waterman algorithm is a dynamic programming algorithm that employs a substitution matrix toyield the highest scoring local alignment. The time required by the algorithm to return theoptimal alignment greatly increases with an increase in sequence length. If the user is notinterested in optimal alignments and merely wants to identify high scoring potential hits, theOnly Identities option will produce results at a faster rate.Over-represented K-mer Suppression: Some K-mers appear in data much more than is to beexpected by random occurrence. This parameter allows the elimination of K-mer hits which arethis percentage more abundant than expected. Default set to 20 for 20%.Use Read Self Similarity: Selecting the Use self similarity optimization (default) option allowsthe aligner to index and count identical reads instead of aligning each identical read separately.Width of Intelligent Diagonal: As mentioned in the Intelligent Diagonal description above, HIVEallows computation of a defined region around the diagonal of the matrix as opposed tocomputation of the entire matrix, including regions which are known to have no potential ofcontaining the highest scoring path. Setting the parameter to auto will specify a width equal tothe length of the seed.Alignment Algorithm: Automatically populated by selection of algorithm mentioned in Section 2.1above.Alignment Filters:Repeat and Transposition Discovery: This parameter allows the user to perform more detailedmulti-pass lookups in order to identify repeats and transpositions, or repeats only, by selectionof the preferred option from the drop-down menu. Inclusion of this search comes at theexpense of a longer run-time, so the default is set to exclude this search.Score Filter: Each alignment produces a score based on matches, mismatches, insertions,deletions and gaps. Usually, a higher score is correlated with a better alignment. This optionallows the user to eliminate alignment scores below a specified threshold. The default setting ofNone will report all alignments and discard nothing.Alignment Parameters:Alignment Scope: Alignment of sequences across the reference genome can be done locally orglobally by choosing the appropriate value from the drop-down menu. A Local alignment tries tofind the best match in particular regions of the genome whereas a Global alignment triesmatching the entire sequence length to the reference genome (Figure 6). Global alignments aregood for similarity and identity searches between the sequences. Local alignments are bettersuited for identifying conserved residues or domains.

Figure 6. Local vs. Global AlignmentsAlignment Costs: Alignments are ranked by comparison of alignment scores, composed by theassignment of cost values to all matches, mismatches and gaps. Matches are rewarded,mismatches and gaps are penalized.In a biological scenario, single base insertions or deletions are less probable than largerinsertions or deletions. For this reason, a gap opening is assigned a higher penalty thancontinuation of a gap while computing the alignment score. If gaps continue over multiplepositions after the gap has already been opened, they are more likely the result of a trueinsertion or deletion with potential evolutionary significance. Thus, the default Gap openpenalty has a more negative value than the default Gap next penalty.Default values for Match Benefit, Mismatch Penalty, Gap Opening Cost, Gap Continuation Costand Mismatch Continuation Penalty are 5, -4, -12, -4 and -6, respectively.Seed K-mer: A seed is a short fragment of the query sequence used to search for nucleotidesimilarity. Seed length determines the size of the K-mer to look up in a hash table. Higher seedvalues make the alignment faster but have a potential of decreasing the sensitivity of themethod. The default seed length is set at 11 letters, corresponding to the seed length employedby the BLAST algorithm. Alternative seed lengths pre-populate the drop-down menu.Shorter Terminus Alignment: Alignment coverage near the end (terminus) of a referencesegment is often less than ideal due to the Minimum Match Length threshold. For example, ifthe Minimum Match Length parameter is set to 30 but you have a read that exactly matches 20base pairs up to the end of segment, this exact match over the area that matters will beexcluded because it is shorter than the specified threshold. This parameter essentially allowsspecification of a different match length threshold in the region of the terminus. Default set to 0means to use the default Minimum Match Length.Alignment Test Run Slice: This allows a user to conduct a test-run of parameters by specifying thesize of a subset of sequences to run per computational thread. The default of all indicates to run allsequences in the alignment. To use the test-run feature, the user must specify a number ofsequences to run on each thread.

Low complexity filter: Some subjects are well known to have regions or entire sequences of lowcomplexity and a high degree of repeats. With these subjects in mind, HIVE allows the user tospecify masking such that regions of both reads and references can be excluded from the alignmentby defining the minimal entropy threshold and a window size of the sequence to be ignored.Reference masking:Minimal Shannon Entropy: The dropdown menu allows the user to define the MinimalShannon Entropy of a region of the reference genome allowed to be included in analignment. Default is 0 – permissive, which allows all regions to be included.Window Size: The dropdown menu allows the user to specify the Window Size of thereference genome to ignore if the calculated entropy drops below the specified MinimalShannon Entropy. Default of do not mask low complexity regions allows inclusion of allregions regardless of calculated entropy.Short read filtration:Minimal Shannons Entropy The dropdown menu allows the user to define the MinimalShannon Entropy of a region of any short read allowed to be included in an alignment.Default is 0 – permissive, which allows all regions to be included.Window Size: The dropdown menu allows the user to specify the Window Size of all shortreads to ignore if the calculated entropy drops below the specified Minimal ShannonEntropy. Default of do not mask low complexity regions allows inclusion of all regionsregardless of calculated entropy.Number of Computational Subjects per Single Thread: This parameter allows the user to specify themaximum number of sequences per thread to be aligned by a single compute node.Query Data: Automatically populated by object IDs of selected input filesReference Genomes: Automatically populated by selected reference genomeReference genome serial number: Automatically populated by selected reference genome, ifapplicable.For the purposes of our tutorial, we will leave all parameters set to the default values, but we willrename the output file “Demo Align 1”. Make sure both short read sequences are selected as wellas the proper reference genome. Your page should now look identical to Figure 7.

Figure 7. dna-hexagon Portal with Specified Inputs.3. JOB PROCESSINGTo start the job click on the ALIGN button. This will refresh your page and present you with two newboxes. The first titled HIVE-hexagon Alignment and the second Results.The HIVE-hexagon Alignment box tracks the progress of your alignment. By clicking on the expand nodeicon found in the top left side of this box, you can view the progress of the subcomponents of thistask.The Results box will become populated as the alignment is carried out. The process status will changefrom Waiting to Running to Done. The whole alignment is finished when all statuses read Done and theprogress bar will show 100% completion. The time elapsed clock will stop, but the run-time willcontinue to be displayed. At this time, click the refresh button on the left side of this box to assureyou have the complete results.4. ALIGNMENT RESULTSThe Results area has two main sections: on the left is a directory of all the aligned reads and on the rightusing the tabs you can view various information about each read. Your Results section should now looklike Figure 8.

Figure 8. Default Alignment Results ViewBy default your results will be presented in a list format. You may alternately view them in a pie chartformat click on the piechart tab. You should now see a piechart representation of your results (SeeFigure 9).In order to view detailed alignment information, you must select a particular reference genome. Whilesome genomes may consist of a single segment or sequence, the influenza genome has 8 segments thatare each presented within the genome file as a separate reference sequence. Continuing with thetutorial, select (highlight by clicking) the first aligned genome with the id of 5 and the name NS.fa.Figure 9. Pie Chart View of Alignment Hits

The tabular views to the right include the following:alignments: This tab shows all alignments for the selected reference sequence in a triplet formatsuch that the top line contains information about the reference or consensus sequence, the bottomline contains information about the query sequence and the middle line shows matches andmismatches with the use of symbols (See Figure 10). The # column corresponds to the ids suppliedto the reference/consensus sequence or genome and to the read id supplied to the specific readfrom the selected query sequence data. Similarly, Sequence contains the names given to thereference sequence file and the read sequence file. The Repeats column displays the counts of anyexactly repeated reads. In the Direction column, ( ) indicates alignment in the forward direction and(-) indicates alignment in reverse. The Start column displays the numeric position on each sequencewhere an alignment starts. In the middle line of the Alignment column, a pipe symbol indicates amatch between sequence whereas a dot . indicates a deletion, a – indicates an insertion and a spaceindicates a mismatch. The End column displays the numeric position of each sequence where analignment ends.Figure 10. alignments Viewstack: This view is similar to the alignments view but only highlights the differences between thereference and query sequence. Dots . represent matches, dashes – represent deletions and singlenucleotide polymorphisms are represented by the letter code (A,C,T or G) of the base found in theread (See Figure 11). The mutation bias graph shows the distribution of base calls for a givenposition along query reads.

Figure 11. stack Results Viewhit table: Opening the hit table provides yet another way to visualize the alignment results with afocus on scores and positional information. The table has thirteen columns, most of which havealready been discussed in the alignments or stack sections above. Columns not yet covered include:Alignment number – assigns a unique id to each alignment; and Score – calculated as discussedunder the Alignment costs tab above.downloads: Clicking the downloads tab will allow you to download all results views and tables asa .csv file in addition to alignments in SAM format. Once a download is initiated by selecting thedesired data, processes will proceed as determined by your specific browser.5. WHAT NEXT?HIVE is built in a very modular way to allow stacking of different analytic tools end to end in a wide arrayof configurations. Thus, following alignment via dna-hexagon, you have a few options of the nextanalysis to perform on your data. On the top right of the Results box you will should see text that sayswhat can you do next (See Figure 12).If you are unhappy with your results you can click onModify and Resubmit to edit yourparameters or inputs and try alignment again. To proceed onto other analyses, you can hover overProfiling Tools and select the option that is best for your research workflow. Currently availabledownstream applications include Sequence Profiling, Reference Recombination and PopulationAnalysis.

Figure 12. Access to Downstream ToolsThis concludes the HIVE dna-hexagon tutorial. Please see the other tutorials or the tutorial videosavailable on HIVE main pages for further information.

HIVE dna-hexagon Tutorial . The purpose of this tutorial is to guide the user through the process of a single alignment using the HIVE . dna-hexagon. tool. All other variations on alignments using the tool employ thi