Commerce Data Academy: Intro To Github And Git

Transcription

Intro to Github and GitSasan BahadaranMay 9, 2017

Commerce Data Academy A data education initiative of the Commerce Data Service. Launched by CDS to offer data science, data engineering, andweb development training to employees of the USDepartment of Commerce. Course schedule and materials (e.g. slides, code, papers)produced for the Commerce Data Academy on Github. Questions? Feel free to write us at Data Academy(dataacademy@doc.gov).

GoalsOur goals for the class Explain and make the case for version control. Collaboration in coding/software engineering. Illustrate what Git software is and what it can do. Differentiate Git (the software) and Github (the website). Describe how we integrate Git and Github into our projectworkflows.

GoalsYour goals for the class Understand what version control is and why should you use itfor your projects. Start using Git on the command line. Experiment with pushing repos to Github. Practice working with a team using Waffle.io.

Prerequisites1.2.3.4.5.Create your own Github accountCreate your own Waffle.io accountDownload/install GitDownload/install Anaconda's Python distributionVerify your access to Terminal (Mac) or Powershell (Windows)Any challenges? Questions?

Open Sources Installations We use open source and free software, so they should have a minimal impact onyour IT department! DOC has provided guidance that states that states that Github and all the toolsthat we are teaching are permissible under policy. However, it is up to the CIO of each bureau to accept this guidance policy or not. DOC has a formalized Github policy: e.md

Review

What is data science?

“Data science is the practice of transforming raw data into insights, products,and applications to empower data-driven decision making. It combinesproven, time-tested methods from fields including statistics, natural sciences,computer science, operations research, and design in ways that areparticularly well-suited to the data age. These methods, which range fromdata mining and visualization to predictive modeling, can scale from small tolarge datasets and can handle structured data as well as unstructured datalike text and images.”Jeff Chen, Chief Data ScientistU.S. Department of Commerce

How is data science different fromdata analytics?

What is hypothesis-driven development?

COMMERCEDATASERVICErt poth i Dviv n o v loprvi n ThoughtWorks·We Believe ThatWill Result In fht C--Jf Jbtj/f1 fh/ OfJfCAJfYle-We Will Know We Have Succeeded When . we- i;e-e-a rne-at;wab/e- t;igtial

What tools do data scientists use?

What is the data science pipeline?

Data IngestionData Munging andWranglingComputation andAnalysesReporting andVisualizationModeling andApplication

What is a data product?

How are data products different fromanalytical insights?

Data products are self-adapting, broadlyapplicable economic engines that derivetheir value from data and generate moredata by influencing human behavior or bymaking inferences or predictions upon newdata.Benjamin Bengfort

What is software engineering?

What does collaboration look like in adata group?

COMMERCEDATASERVICE

COMMERCEDATASERVICE

20secure identification, keycards, anddisable explosive set by trapDone29alert others of distress calluniforms0.In Progressfind a brand new compression coil for thesteamer.II wontfix1831recover hidden loot at Cantonlower onto train and secure cargoI trainjob Ifix ship's engine problem.financial:.'!II :.4mmmretrieve cargo from trainunload and pen cattlerepair ambulance shuttleI ttain job I enhancementM .MMU.i.:.IIfind a captain for t he shipI startup III1322I hospi taljob IQ ,FiiNF 1-"L27"""L IIfind a mechanic for the shipI startup III3032join Mal in boarding traincapture an Alliance anti-aircraft gun:.W .f f,l&.F[ trainjobget cargo from abandoned carrierI startup I21collect remaining funds to pay forshipmates releasecheck ship for survivors1--1-"locollect package from post master:.financial16buy a solid shipI\I-1-" lo()." II

Version Control

Examples?

COMMERCEDATASERVICEGoogle DriveI SharePointrop ox8BitbucketTortoise SVN

What is version control?Other names?What problems does this solve?What are the benefits?What are some common features?

Definition:The management of changes to electronic documentsand, in particular, computer programs.

“In computer software engineering, revision control is anykind of practice that tracks and provides control overchanges to source code.”Wikipedia knows everything

Tell us about a time when you could have used someversion control.

Local Version Control Systems

Version Control:A Visualization

COMMERCEDATASERVICELocal ComputerCheckoutVersion DatabaseFileVersion 3Version 2Version 1

12345ABC6Branches and revisions through time - example scenario

COMMERCEDATASERVICEAug27285JlBranches and revisions through time - actual workflow

Distributed vs. Centralized

What are thebenefits?What are theweaknesses?Centralized

What are thebenefits?What are theweaknesses?Decentralized

Git

Installing GitCOMMERCEDATASERVICE git --distributed-is-the-new-centralizedGit is a free and open source distributed version control systemdesigned to handle everything from small to very large projectswith speed and efficiency.Git is easy to learn and has a tiny footprint with lightning fastperformance. It outclasses SCM tools like Subversion, CVS,Perforce, and ClearCase with features like cheap local branching,convenient staging areas, and multiple workflows.· · Learn Git in your browser for free with Try Git.00 AboutThe advantages of Gitcompared to other sourcecontrol systems.DownloadsGUI clients and binary releasesfor all major platforms.mp DocumentationCommand reference pages, ProGit book content, videos andother material.CommunityGet involved! Bug reporting,mailing list, chat, developmentand more.Q. Search entire site.

COMMERCEDATASERVICEInstalling GitInstalling on WindowsThere are also a few ways to install Git on Windows. The most official build is available for download onthe Git website. Just go to http:llgit-scm.comldownloadlwin and the download will start automatically. Notethat this is a project called Git for Windows, which is separate from Git itself; for more information on it, goto https:llgit-for-windows.github.iol.Another easy way to get Git installed is by installing GitHub for Windows. The installer includes acommand line version of Git as well as the GUI. It also works well with Powershell, and sets up solidcredential caching and sane CALF settings. We'll learn more about those things a little later, but suffice itto say they're things you want. You can download this from the GitHub for Windows website, .github.io/

Installing Githttp://git-scm.com/download/mac

Originally conceived/created by Linus Torvalds (after a fight with BitKeeper) Distributed Version Control Open Source Initial release: 7 April 2005 All metadata is stored in the .git directoryGit - History Lesson

SpeedSimple designStrong support for non-linear development (thousands of parallel branches)Fully distributedAble to handle large projects like the Linux kernel efficiently (speed and data size)Git - Advantages

Object Databasewhere git stores metadata about each commitIndex / Staging Areafile snapshots to be included in next commitWorking Directorythe “physical” files on a computerGit - “Places”

Committeddata is safely stored in your local object databaseStagedmarked such that the current state of the modified file will beincluded in the next commitModifiedchanged but not staged or committedGit - “Stages”

COMMERCEDATASERVICEWorkingDirectoryStagingArea. git directory(Repository)Git - Areas/places

Git Commands

git initcreate a new git repository to manage the current foldergit clone repository address downloads an existing git repository for the first timegit add file path marks individual/modified files to be added to the index/staging area for nextcommitgit commit -m message takes metadata/changes from staging and adds to the object databaseGit - Basic Commands

git fetch server branch updates your object database but does not change the working directorygit merge source branch applies the commits from source branch to the current working directory(which is the manifestation of another branch)git pull server branch performs a fetch and then merges those changes into your working directorygit push server branch sends your latest branch commits to the remote serverGit - Basic Commands

Git Challenge (20 1

Github

COMMERCEDATASERVICE

A remote git repository A website provides secure access provides repository metadata & reports provides tools for development teams Launched: April 10, 2008 10 million users in 2015Github

COMMERCEDATASERVICE0 00 Non-local git repositories are called “remotes”

Object Databasewhere git stores metadata about each commitIndex / Staging Areafile snapshots to be included in next commitWorking Directorythe “physical” files on a computerGit - “Places”

COMMERCEDATASERVICEServer ComputerVersion DatabaseVersion 3Version 2Version 1Computer AComputer BVersion DatabaseVersion DatabaseVersion 3Version 3Version 2Version 2Version 1Version 1Github: A Distributed Version Control example

The “origin” remote is automatically created when you clone It is the default remote to use for pushing and pulling There is nothing special about “origin” it is just a default nameGit - “Origin”

User Account

COMMERCEDATASERVICE0Pull requestsSearch GitHub!Q Repositories[ ]ContributionsIssues3;\GistPopular repositoriesRepositories contributed tov xbus-503-ipython-demosDemonstration code for XBUS-503 Data Wran.v calendarBuilding a simple Python application - Calenda.v capstoneCapstone project as part of Data Analysis certi.Rebecca BilbrorebeccabilbroC9v Colonials8o*v dashboards!Qo*o*CommerceData. /recordtaggerNOAA metadata record tagger that implement. .8o*!Qo*DistrictDataLabsltrinketMultidimensional data explorer and visualizatio.!Qo*CommerceData. /newexportersbuilding a predictive model for new exporterso*Responsive dashboard templates for BootstrapDistrictDataLabs/BlogsData Science related biogs for DDLo*GT Colonials\'.) Washington, DC,. Edit profilePublic activity3*georgetown-an. /sql-tutorialA brief tutorial on SOL with Python (using SQL.1*Joined on Sep 13, 20141711FollowersStarred u nMay JulAugSepOct NovDecJanFeb Summary of pull requests, issues opened, and commits. Learn how we count contributions.Less More

Repo

COMMERCEDATASERVICE0This repositoryPull requestsSearchIssuesGisti;;J rebeccabilbro I orloCD Issues Code40Unwatch 1 Pull requestsoWiki- .- Pulse01!J GraphsV Fork*StarSettingsA tour of ROC curves - EditiLl 19 commitsBranch: master· ii 1 branchNew pull requestNew fileUpload files'V O releasesFind filerebeccabilbro added method to guess the label columnSSH·1 contributorgi t@g i thub . com : r ebeccabill@Download ZIPLatest commit 382b9ca 4 days agoii datastarting to flesh out bulk ingest method for UGI data16 days agoii figuresadded precision recall image19 days ago .DS Storebasic implementation of roe curve plotter9 days ago .gitignorebasic implementation of roe curve plotter9 days ago LICENSEInitial commit README.mdadded plotting template to readme9 days ago classi.pyadded method to guess the label column4 days ago ingest.pyadded randomizer to ingest9 days ago roc.pybasic implementation of roe curve plotter9 days agolillJ README.md19 days ago

Command Line

Shifting to the command line.

COMMERCEDATASERVICEWindowsOn Windows we're going to use PowerShell. People used to work with a program called cmd.exe, but it's not nearly as usable asPowerShell. If you have Windows 7 or later, do this: Click Start. In "Search programs and files" type: powershell Hit Enter.Mac OSXFor Mac OSX you'll need to do this: Hold down COMMAND and hit the spacebar. In the top right the blue "search bar" will pop up. Type: terminal Click on the Terminal application that looks kind of like a black box. This will open Terminal. You can now go to your Dock and CTRL-click to pull up the menu, then select Options- Keep In Dock.Now you have your Terminal open and it's in your Dock so you can get to it.

Windows PowershellMac OSX TerminalWhere am I?

Windows PowershellMac OSX TerminalWhat’s my name?

Windows Powershell stemp/stuff/things/frank/joe/alex/johnMac OSX TerminalMake a directory

Windows Powershell cd temp pwd Mac OSX Terminal cd temp pwd Change between directories

Windows Powershell dir Mac OSX Terminal ls List files and directories

Windows Powershell cd temp New-Item iamcool.txt -type file dir Mac OSX Terminal cd temp touch iamcool.txt ls Make an empty file

Zed Shaw’sbook

Let’s use what we’ve learned!

Merge Conflict Workshop (20 minutes):http://bit.ly/xbus501-workshop-git

breporepofetchcheckout HEADRevertComparediff-cached

Teamwork(makes the dream work!)

Organization

COMMERCEDATASERVICE0This organizationPull requestsSearchIssuesGistCommerce Data ServiceA startup within DOC focused on building data products with and for the bureaus.Washington DCIQ RepositoriesFiiters Q@ People 20http:Jlwww.commerce.gov.{!!l Teamsdata@doc.gov4Find a repository.DataService WebSiteIV forked from timwood/DataCorps WebSite New repository*1V4css *oV&JavaScriptThe website for the Commerce Data Service - A startup within the Department ofCommerceUpdated 19 hours agoITA Principal TravelUpdated a day agoCommerce Data Academy CoursesCourse materials offered by the Commerce Data AcademyUpdated a day agoPeople20

Waffle

etBacklogReadyS6S4Better LicensingData file uploading-0type:featurelversiono.31 -type:featureSS43username checkImplement beta auto analysispriority: mediumtype: bug0lversiono.31 -type:feature "0(')soDataset Searchingpriority: mediumtype: feature0 "Dataset Overwrite-type: tec.hnicaldebt "04S."500 error on upload w/ missing col/ row values "AJAXify t he uptoadertype: featurC"(') "38" 3Dtours0 ".0 ".37Sampling technique for bigger datasets36Feature nomination tool for visualizationDimension Histograms and Ranking: 10type: feature0· Large files "hang" uploader(')priority: mediumtype: feature13I Version 0.3 I priority: medium· · Async Upload with CeleryI Version 0.3 I priority: mediumlversiono.3IB l l ltype: bug(") "(") "Upload Error: line contains NULL byteI Versiono.3 IB l l ltype: bug. In ProgressDone14Research Auto-analysis FeatureI Version 0.3 I priority: mediumquestiontask100 Dropd own Dataset Edit FormI Version 0.3 I priority: mediumtype: feature " .DoneIssues closed in the last week are shown in thiscolumn. Drag issues here to close them.

Pair programming:Make your own waffle!

Communication:Commit Messages

git commit -m “try to be as helpful as possible”(To your team and to future you)

Why?

Why do data scientists need versioncontrol?

Data IngestionData Munging andWranglingComputation andAnalysesReporting andVisualizationModeling andApplicationWhere does version control fit into thedata science pipeline?

Folder structure conventions on Github

README.md

.gitignore

/fixtures

requirements.txt

Where to go from here?

Additional w.tutorialspoint.com/git/

ResourcesGit Desktop : https://desktop.github.com/TortoiseGit: https://tortoisegit.org/Git Cheat Sheet: it-cheat-sheet.pdfGetting Started: ut-Version-ControlBasics: a-Git-RepositoryBranching: hes-in-a-NutshellGithub Setup: p-and-ConfigurationGit Tools: SelectionGit Commands: and-Config

Git is easy to learn and has a tiny footprint with lightning fast performance. It outclasses SCM tools like Subversion, CVS, Perforce, and ClearCase with features like cheap local branching, convenient staging areas, and multiple workfl