Bioinformatics Data Skills - Codecool.ir

Transcription

Bioinformatics Data SkillsAt no other point in human history has our ability to understand life’scomplexities been so dependent on our skills to work with and analyze data.This intermediate-level book teaches the general computational and data skillsyou need to analyze biological data. If you have experience with a scriptinglanguage like Python, you’re ready to get started. Go from handling small problems with messy scripts to tacklinglarge problems with clever methods and toolsProcess bioinformatics data with powerful Unix pipelinesand data toolsLearn how to use exploratory data analysis techniques in theR languageUse efficient methods to work with genomic range data andrange operationsWork with common genomics data file formats like FASTA,FASTQ, SAM, and BAMManage your bioinformatics project with the Git versioncontrol systemTackle tedious data processing tasks with with Bash scriptsand MakefilesMost existing“bioinformaticstextsfocus on algorithms andtheories. BioinformaticsData Skills takes arefreshingly practicalapproach by providinga comprehensiveintroduction to thetechniques, tools, andbest practices thatfoster reproducibleresearch in genomeinformatics. Its handson approach is perfectfor graduate students,postdocs, professors,and hobbyists alike.”—Aaron QuinlanAssociate Professor at University of Utahand author of the BEDTools andGEMINI utilities for bioinformaticsVince Buffalo is currently a first-year graduate student studying populationgenetics in Graham Coop's lab at University of California, Davis, in the PopulationBiology Graduate Group. Before starting his PhD in population genetics, Vinceworked professionally as a bioinformatician in the Bioinformatics Core at the UCDavis Genome Center and in the Department of Plant Sciences.US 49.99BioinformaticsData SkillsREPRODUCIBLE AND ROBUST RESEARCH WITH OPEN SOURCE TOOLSTwitter: informatics Data SkillsLearn the data skills necessary for turning large sequencing datasets intoreproducible and robust biological findings. With this practical guide, you’lllearn how to use freely available open source tools to extract meaning fromlarge complex biological datasets.CAN 57.99ISBN: 978-1-449-36737-4Vince Buffalo

Bioinformatics Data SkillsAt no other point in human history has our ability to understand life’scomplexities been so dependent on our skills to work with and analyze data.This intermediate-level book teaches the general computational and data skillsyou need to analyze biological data. If you have experience with a scriptinglanguage like Python, you’re ready to get started. Go from handling small problems with messy scripts to tacklinglarge problems with clever methods and toolsProcess bioinformatics data with powerful Unix pipelinesand data toolsLearn how to use exploratory data analysis techniques in theR languageUse efficient methods to work with genomic range data andrange operationsWork with common genomics data file formats like FASTA,FASTQ, SAM, and BAMManage your bioinformatics project with the Git versioncontrol systemTackle tedious data processing tasks with with Bash scriptsand MakefilesMost existing“bioinformaticstextsfocus on algorithms andtheories. BioinformaticsData Skills takes arefreshingly practicalapproach by providinga comprehensiveintroduction to thetechniques, tools, andbest practices thatfoster reproducibleresearch in genomeinformatics. Its handson approach is perfectfor graduate students,postdocs, professors,and hobbyists alike.”—Aaron QuinlanAssociate Professor at University of Utahand author of the BEDTools andGEMINI utilities for bioinformaticsVince Buffalo is currently a first-year graduate student studying populationgenetics in Graham Coop's lab at University of California, Davis, in the PopulationBiology Graduate Group. Before starting his PhD in population genetics, Vinceworked professionally as a bioinformatician in the Bioinformatics Core at the UCDavis Genome Center and in the Department of Plant Sciences.US 49.99BioinformaticsData SkillsREPRODUCIBLE AND ROBUST RESEARCH WITH OPEN SOURCE TOOLSTwitter: informatics Data SkillsLearn the data skills necessary for turning large sequencing datasets intoreproducible and robust biological findings. With this practical guide, you’lllearn how to use freely available open source tools to extract meaning fromlarge complex biological datasets.CAN 57.99ISBN: 978-1-449-36737-4Vince Buffalo

Bioinformatics Data SkillsVince BuffaloBoston

Bioinformatics Data Skillsby Vince BuffaloCopyright 2015 Vince Buffalo. All rights reserved.Printed in the United States of America.Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions arealso available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.Editors: Courtney Nash and Amy JollymoreProduction Editor: Nicole ShelbyCopyeditor: Jasmine KwitynProofreader: Kim CoferIndexer: Ellen TroutmanInterior Designer: David FutatoCover Designer: Ellie VolckhausenIllustrator: Rebecca DemarestFirst EditionJune 2015:Revision History for the First Edition2015-06-30:First ReleaseSee http://oreilly.com/catalog/errata.csp?isbn 9781449367374 for release details.The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Bioinformatics Data Skills, the coverimage, and related trade dress are trademarks of O’Reilly Media, Inc.While the publisher and the author have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the author disclaim all responsibilityfor errors or omissions, including without limitation responsibility for damages resulting from the use ofor reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and/or rights.978-1-449-36737-4[LSI]

To my (rather large) family for their continued support: Mom, Dad, Anne, Lisa, Lauren,Violet, and Dalilah; the Buffalos, the Kihns, and the Lambs.And my earliest mentors for inspiring me to be who I am today: Randy Siverson andDuncan Temple Lang.

Table of ContentsPreface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiPart I.Ideology: Data Skills for Robust and Reproducible Bioinformatics1. How to Learn Bioinformatics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Why Bioinformatics? Biology’s Growing DataLearning Data Skills to Learn BioinformaticsNew Challenges for Reproducible and Robust ResearchReproducible ResearchRobust Research and the Golden Rule of BioinformaticsAdopting Robust and Reproducible Practices Will Make Your Life Easier, TooRecommendations for Robust ResearchPay Attention to Experimental DesignWrite Code for Humans, Write Data for ComputersLet Your Computer Do the Work For YouMake Assertions and Be Loud, in Code and in Your MethodsTest Code, or Better Yet, Let Code Test CodeUse Existing Libraries Whenever PossibleTreat Data as Read-OnlySpend Time Developing Frequently Used Scripts into ToolsLet Data Prove That It’s High QualityRecommendations for Reproducible ResearchRelease Your Code and DataDocument EverythingMake Figures and Statistics the Results of ScriptsUse Code as DocumentationContinually Improving Your Bioinformatics Data Skills14568910101112121314141515161616171717v

Part II. Prerequisites: Essential Skills for Getting Started witha Bioinformatics Project2. Setting Up and Managing a Bioinformatics Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Project Directories and Directory StructuresProject DocumentationUse Directories to Divide Up Your Project into SubprojectsOrganizing Data to Automate File Processing TasksMarkdown for Project NotebooksMarkdown Formatting BasicsUsing Pandoc to Render Markdown to HTML212426263131353. Remedial Unix Shell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Why Do We Use Unix in Bioinformatics? Modularity and the UnixPhilosophyWorking with Streams and RedirectionRedirecting Standard Out to a FileRedirecting Standard ErrorUsing Standard Input RedirectionThe Almighty Unix Pipe: Speed and Beauty in OnePipes in Action: Creating Simple Programs with Grep and PipesCombining Pipes and RedirectionEven More Redirection: A tee in Your PipeManaging and Interacting with ProcessesBackground ProcessesKilling ProcessesExit Status: How to Programmatically Tell Whether YourCommand WorkedCommand Substitution37414143454547484950505152544. Working with Remote Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Connecting to Remote Machines with SSHQuick Authentication with SSH KeysMaintaining Long-Running Jobs with nohup and tmuxnohupWorking with Remote Machines Through TmuxInstalling and Configuring TmuxCreating, Detaching, and Attaching Tmux SessionsWorking with Tmux Windowsvi Table of Contents5759616161626264

5. Git for Scientists. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Why Git Is Necessary in Bioinformatics ProjectsGit Allows You to Keep Snapshots of Your ProjectGit Helps You Keep Track of Important Changes to CodeGit Helps Keep Software Organized and Available After People LeaveInstalling GitBasic Git: Creating Repositories, Tracking Files, and Staging and CommittingChangesGit Setup: Telling Git Who You Aregit init and git clone: Creating RepositoriesTracking Files in Git: git add and git status Part IStaging Files in Git: git add and git status Part IIgit commit: Taking a Snapshot of Your ProjectSeeing File Differences: git diffSeeing Your Commit History: git logMoving and Removing Files: git mv and git rmTelling Git What to Ignore: .gitignoreUndoing a Stage: git resetCollaborating with Git: Git Remotes, git push, and git pullCreating a Shared Central Repository with GitHubAuthenticating with Git RemotesConnecting with Git Remotes: git remotePushing Commits to a Remote Repository with git pushPulling Commits from a Remote Repository with git pullWorking with Your Collaborators: Pushing and PullingMerge ConflictsMore GitHub Workflows: Forking and Pull RequestsUsing Git to Make Life Easier: Working with Past CommitsGetting Files from the Past: git checkoutStashing Your Changes: git stashMore git diff: Comparing Commits and FilesUndoing and Editing Commits: git commit --amendWorking with BranchesCreating and Working with Branches: git branch and git checkoutMerging Branches: git mergeBranches and RemotesContinuing Your Git 8899092979797991001021021031051061086. Bioinformatics Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109Retrieving Bioinformatics DataDownloading Data with wget and curlRsync and Secure Copy (scp)110110113Table of Contents vii

Data IntegritySHA and MD5 ChecksumsLooking at Differences Between DataCompressing Data and Working with Compressed DatagzipWorking with Gzipped Compressed FilesCase Study: Reproducibly Downloading DataPart III.114115116118119120120Practice: Bioinformatics Data Skills7. Unix Data Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125Unix Data Tools and the Unix One-Liner Approach: Lessons fromProgramming PearlsWhen to Use the Unix Pipeline Approach and How to Use It SafelyInspecting and Manipulating Text Data with Unix ToolsInspecting Data with Head and TaillessPlain-Text Data Summary Information with wc, ls, and awkWorking with Column Data with cut and ColumnsFormatting Tabular Data with columnThe All-Powerful GrepDecoding Plain-Text Data: hexdumpSorting Plain-Text Data with SortFinding Unique Values in UniqJoinText Processing with AwkBioawk: An Awk for Biological FormatsStream Editing with SedAdvanced Shell TricksSubshellsNamed Pipes and Process SubstitutionThe Unix Philosophy 71631651691691711738. A Rapid Introduction to the R Language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175Getting Started with R and RStudioR Language BasicsSimple Calculations in R, Calling Functions, and Getting Help in RVariables and AssignmentVectors, Vectorization, and IndexingWorking with and Visualizing Data in RLoading Data into Rviii Table of Contents176178178182183193194

Exploring and Transforming DataframesExploring Data Through Slicing and Dicing: Subsetting DataframesExploring Data Visually with ggplot2 I: Scatterplots and DensitiesExploring Data Visually with ggplot2 II: SmoothingBinning Data with cut() and Bar Plots with ggplot2Merging and Combining Data: Matching Vectors and Merging DataframesUsing ggplot2 FacetsMore R Data Structures: ListsWriting and Applying Functions to Lists with lapply() and sapply()Working with the Split-Apply-Combine PatternExploring Dataframes with dplyrWorking with StringsDeveloping Workflows with R ScriptsControl Flow: if, for, and whileWorking with R ScriptsWorkflows for Loading and Combining Multiple FilesExporting DataFurther R Directions and 32542572602619. Working with Range Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263A Crash Course in Genomic Ranges and Coordinate SystemsAn Interactive Introduction to Range Data with GenomicRangesInstalling and Working with Bioconductor PackagesStoring Generic Ranges with IRangesBasic Range Operations: Arithmetic, Transformations, and Set OperationsFinding Overlapping RangesFinding Nearest Ranges and Calculating DistanceRun Length Encoding and ViewsStoring Genomic Ranges with GenomicRangesGrouping Data with GRangesListWorking with Annotation Data: GenomicFeatures and rtracklayerRetrieving Promoter Regions: Flank and PromotersRetrieving Promoter Sequence: Connection GenomicRanges with SequenceDataGetting Intergenic and Intronic Regions: Gaps, Reduce, and Setdiffs inPracticeFinding and Working with Overlapping RangesCalculating Coverage of GRanges ObjectsWorking with Ranges Data on the Command Line with BEDToolsComputing Overlaps with BEDTools IntersectBEDTools Slop and FlankCoverage with BEDToolsTable of 324327329330333335 ix

Other BEDTools Subcommands and pybedtools33610. Working with Sequence Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339The FASTA FormatThe FASTQ FormatNucleotide CodesBase QualitiesExample: Inspecting and Trimming Low-Quality BasesA FASTA/FASTQ Parsing Example: Counting NucleotidesIndexed FASTA Files33934134334434634935211. Working with Alignment Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355Getting to Know Alignment Formats: SAM and BAMThe SAM HeaderThe SAM Alignment SectionBitwise FlagsCIGAR StringsMapping QualitiesCommand-Line Tools for Working with Alignments in the SAM FormatUsing samtools view to Convert between SAM and BAMSamtools Sort and IndexExtracting and Filtering Alignments with samtools viewVisualizing Alignments with samtools tview and the Integrated GenomicsViewerPileups with samtools pileup, Variant Calling, and Base Alignment QualityCreating Your Own SAM/BAM Processing Tools with PysamOpening BAM Files, Fetching Alignments from a Region, and IteratingAcross ReadsExtracting SAM/BAM Header Information from an AlignmentFile ObjectWorking with AlignedSegment ObjectsWriting a Program to Record Alignment StatisticsAdditional Pysam Features and Other SAM/BAM 8839139412. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks. . . . . . . . . . 395Basic Bash ScriptingWriting and Running Robust Bash ScriptsVariables and Command ArgumentsConditionals in a Bash Script: if StatementsProcessing Files with Bash Using for Loops and GlobbingAutomating File-Processing with find and xargsUsing find and xargsFinding Files with findx Table of Contents396396398401405411411412

find’s Expressionsfind’s -exec: Running Commands on find’s Resultsxargs: A Unix PowertoolUsing xargs with Replacement Strings to Apply Commands to Filesxargs and ParallelizationMake and Makefiles: Another Option for Pipelines41341541641841942113. Out-of-Memory Approaches: Tabix and SQLite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425Fast Access to Indexed Tab-Delimited Files with BGZF and TabixCompressing Files for Tabix with BgzipIndexing Files with TabixUsing TabixIntroducing Relational Databases Through SQLiteWhen to Use Relational Databases in BioinformaticsInstalling SQLiteExploring SQLite Databases with the Command-Line InterfaceQuerying Out Data: The Almighty SELECT CommandSQLite FunctionsSQLite Aggregate FunctionsSubqueriesOrganizing Relational Databases and JoinsWriting to DatabasesDropping Tables and Deleting DatabasesInteracting with SQLite from PythonDumping 545845946514. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467Where to Go From Here?468Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483Table of Contents xi

PrefaceThis book is the answer to a question I asked myself two years ago: “What bookwould I want to read first when getting started in bioinformatics?” When I beganworking in this field, I had programming experience in Python and R but little else. Ihad hunted around for a terrific introductory text on bioinformatics, and while Ifound some good books, most were not targeted to the daily work I did as a bioinfor‐matician. A few of the texts I found approached bioinformatics from a theoretical andalgorithmic perspective, covering topics like Smith-Waterman alignment, phylogenyreconstruction, motif finding, and the like. Although they were fascinating to read(and I do recommend that you explore this material), I had no need to implementbioinformatics algorithms from scratch in my daily bioinformatics work—numerousterrific, highly optimized, well-tested implementations of these algorithms alreadyexisted. Other bioinformatics texts took a more practical approach, guiding readersunfamiliar with computing through each step of tasks like running an aligner ordownloading sequences from a database. While these were more applicable to mywork, much of those books’ material was outdated.As you might guess, I couldn’t find that best “first” bioinformatics book. Bioinformat‐ics Data Skills is my version of the book I was seeking. This book is targeted towardreaders who are unsure how to bridge the giant gap between knowing a scripting lan‐guage and practicing bioinformatics to answer scientific questions in a robust andreproducible way. To bridge this gap, one must learn data skills—an approach thatuses a core set of tools to manipulate and explore any data you’ll encounter during abioinformatics project.Data skills are the best way to learn bioinformatics because these skills utilize timetested, open source tools that continue to be the best way to manipulate and explorechanging data. This approach has stood the test of time: the advent of highthroughput sequencing rapidly changed the field of bioinformatics, yet skilled bioin‐formaticians adapted to this new data using these same tools and skills. Nextgeneration data was, after all, just data (different data, and more of it), and masterbioinformaticians had the essential skills to solve problems by applying their tools toxiii

this new data. Bioinformatics Data Skills is written to provide you with training inthese core tools and help you develop these same skills.The Approach of This BookMany biologists starting out in bioinformatics tend to equate “learning bioinformat‐ics” with “learning how to run bioinformatics software.” This is an unfortunate andmisinformed idea of what bioinformaticians actually do. This is analogous to think‐ing “learning molecular biology” is just “learning pipetting.” Other than a few simpleexamples used to generate data in Chapter 11, this book doesn’t cover running bioin‐formatics software like aligners, assemblers, or variant callers. Running bioinformat‐ics software isn’t all that difficult, doesn’t take much skill, and it doesn’t embody anyof the significant challenges of bioinformatics. I don’t teach how to run these types ofbioinformatics applications in Bioinformatics Data Skills for the following reasons: It’s easy enough to figure out on your own The material would go rapidly out of date as new versions of software or entirelynew programs are used in bioinformatics The original manuals for this software will always be the best, most up-to-dateresource on how to run a programInstead, the approach of this book is to focus on the skills bioinformaticians use toexplore and extract meaning from complex, large bioinformatics datasets. Exploringand extracting information from these datasets is the fun part of bioinformaticsresearch. The goal of Bioinformatics Data Skills is to teach you the computationaltools and data skills you need to explore these large datasets as you please. These dataskills give you freedom; you’ll be able to look at any bioinformatics data—in any for‐mat, and files of any size—and begin exploring data to extract biological meaning.Throughout Bioinformatics Data Skills, I emphasize working in a robust and reprodu‐cible manner. I believe these two qualities—reproducibility and robustness—are toooften overlooked in modern computational work. By robust, I mean that your work isresilient against silent errors, confounders, software bugs, and messy or noisy data. Incontrast, a fragile approach is one that does not decrease the odds of some type oferror adversely affecting your results. By reproducible, I mean that your work can berepeated by other researchers and they can arrive at the same results. For this to bethe case, your work must be well documented, and your methods, code, and data allneed to be available so that other researchers have the materials to reproduce every‐thing. Reproducibility also relies on your work being robust—if a workflow run on adifferent machine yields a different outcome, it is neither robust nor fully reproduci‐ble. I introduce these concepts in more depth in Chapter 2, and these are themes thatreappear throughout the book.xiv Preface

Why This Book Focuses on Sequencing DataBioinformatics is a broad discipline, and spans subfields like proteomics, metabolo‐mics, structure bioinformatics, comparative genomics, machine learning, and imageprocessing. Bioinformatics Data Skills focuses primarily on handling sequencing datafor a few reasons.First, sequencing data is abundant. Currently, no other “omics” data is as abundant ashigh-throughput sequencing data. Sequencing data has broad applications acrossbiology: variant detection and genotyping, transcriptome sequencing for gene expres‐sion studies, protein-DNA interaction assays like ChIP-seq, and bisulfite sequencingfor methylation studies just to name a few examples. The ways in which sequencingdata can be used to answer biological questions will only continue to increase.Second, sequencing data is terrific for honing your data skills. Even if your goal is toanalyze other types of data in the future, sequencing data serves as great example datato learn with. Developing the text-processing skills necessary to work with sequenc‐ing data will be applicable to working with many other data types.Third, other subfields of bioinformatics are much more domain specific. The wideavailability and declining costs of sequencing have allowed scientists from all disci‐plines to use genomics data to answer questions in their systems. In contrast, bioin‐formatics subdisciplines like proteomics or high-throughput image processing aremuch more specialized and less widespread. Still, if you’re interested in these fields,Bioinformatics Data Skills will teach you useful computational and data skills that willbe helpful in your research.AudienceIn my experience teaching bioinformatics to friends, colleagues, and students of anintensive week-long course taught at UC Davis, most people wishing to learn bioin‐formatics are either biologists, or computer scientists/programmers. Biologists wishto develop the computational skills necessary to analyze their own data, while theprogrammers and computer scientists wish to apply their computational skills to bio‐logical problems. Although these two groups differ considerably in biological knowl‐edge and computational experience, Bioinformatics Data Skills covers material thatshould be helpful to both.If you’re a biologist, Bioinformatics Data Skills will teach you the core data skills youneed to work with bioinformatics data. It’s important to note that Bioinformatics DataSkills is not a how-to bioinformatics book; such a book on bioinformatics wouldquickly go out of date or be too narrow in focus to help the majority of biologists. Youwill need to supplement this book with knowledge of your specific research and sys‐tem, as well as the modern statistical and bioinformatics methods that your subfieldPreface xv

uses. For example, if your project involves aligning sequencing reads to a referencegenome, this book won’t tell you the newest and best alignment software for yourparticular system. But regardless of which aligner you use, you will need to have athorough understanding of alignment formats and how to manipulate alignment data—a topic covered in Chapter 11. Throughout this book, these general computationaland data skills are meant to be a solid, widely applicable foundation on which themajority of biologists can build.If you’re a computer scientist or programmer, you are likely already familiar withsome of the computational tools I teach in this book. While the material presented inBioinformatics Data Skills may overlap knowledge you already have, you will stilllearn about the specific formats, tools, and approaches bioinformaticians use in theirwork. Also, working through the examples in this book will give you good practice inapplying your computational skills to genomics data.The Difficulty Level of Bioinformatics Data SkillsBioinformatics Data Skills is designed to be a thorough—and in parts, dense—book.When I started writing this book, I decided the greatest misdeed I could do would beto treat bioinformatics as a subject that’s easier than it truly is. Working as a professio‐nal bioinformatician, I routinely saw how very subtle issues could crop up andadversely change the outcome of the analysis had they not been caught. I don’t wantyour bioinformatics work to be incorrect because I’ve made a topic artificially simple.The depth at which I cover topics in Bioinformatics Data Skills is meant to prepareyou to catch similar issues in your own work so your results are robust.The result is that sections of this book are quite advanced and will be difficult forsome readers. Don’t feel discouraged! Like most of science, this material is hard, andmay take a few reads before it fully sinks in. Throughout the book, I try to indicatewhen certain sections are especially advanced so that you can skip over these andreturn to them later.Lastly, I often use technical jargon throughout the book. I don’t like using jargon, butit’s necessary to communicate technical concepts in computing. Primarily it will helpyou search for additional resources and help. It’s much easier to Google successfullyfor “left outer join” than “data merge where null records are included in one table.”Assumptions This Book MakesBioinformatics Data Skills is meant to be an intermediate book on bioinformatics. Tomake sure everyone starts out on the same foot, the book begins with a few simplechapters. In Chapter 2, I cover the basics of setting up a bioinformatics project, and inChapter 3 I teach some remedial Unix topics meant to ensure that you have a solidxvi Preface

grasp of Unix (because Unix is a large component in later chapters). Still, as an inter‐mediate book, I make a few assumptions about you:You know a scripting languageThis is the biggest assumption of the book. Except for a few Python programsand the R material (R is introduced in Chapter 8), this book doesn’t directly relyon using lots of scripting. However, in learning a scripting language, you’vealready encountered many important computing concepts such as working witha text editor, running and executing programs on the command line, and basicprogramming. If you do not know a scripting language, I would recommendlearning Python while reading this book. Books like Bioinformatics ProgrammingUsing Python by Mitchell L. Model (O’Reilly, 2009), Learning Python, 5th Edition,by Mark Lutz (O’Reilly, 2013), and Python in a Nutshell, 2nd, by Alex Martelli(O’Reilly, 2006) are

Bioinformatics Data Skills ISBN: 978-1-449-36737-4 US 49.99 CAN 57.99 “ Mostxisting e bioinformatics texts focusnlgorithms o a and theories. Bioinformatics Data Skills akest a refreshingly practical approachy b providing aomprehensive c introduction to the techniques, tools, and b