Python For Data Science For Dummies - GBV

Transcription

2nd Editionby John Paul Muellerand Luca Massaronfor»dummiesA Wiley Brand

Table of ContentsINTRODUCTION1About This Book1Foolish3AssumptionsIcons Used in This Book4Beyond the BookWhere to Go from Here4PART 1:5GETTING STARTED WITH DATASCIENCE AND PYTHONchapter 1:Discovering7the Match betweenData Science andPython9Defining the Sexiest Job of the 21st CenturyConsidering the emergence of data scienceOutlining the core competencies of a data scientistLinking data science, big data, and AlUnderstanding the role of programmingCreating the Data Science PipelinePreparing the data121213141415Performing exploratory data analysisLearning from dataVisualizingObtaining insights and data productsUnderstanding Python's Role in Data ScienceConsidering the shifting profile of data scientistsWorking with a multipurpose, simple, andefficient languageLearning to Use Python FastLoading dataTraining a modelViewing a g Python's Capabilitiesand Wonders21Why Python?Grasping Python's Core PhilosophyContributing to data scienceDiscovering present and future development goalsTable of Contents.22232324vii

Working with PythonGetting a taste of the language2525Understanding the need for indentationWorking at the command line or in the IDEPerforming Rapid Prototyping and ExperimentationConsidering Speed of ExecutionVisualizing PowerUsing the Python Ecosystem for Data ScienceAccessing scientific tools using SciPyPerforming fundamental scientific computing using NumPyPerforming data analysis using pandasImplementing machine learning using Scikit-learnGoing for deep learning with Keras and TensorFlowPlotting the data using matplotlibCreating graphs with NetworkXParsingchapters:HTMLdocuments using Beautiful SoupSetting Up Pythonfor Data ScienceConsidering the Off-the-Shelf Cross-PlatformScientific Distributions32333535363636373838383940Installing Anaconda on LinuxInstalling Anaconda on Mac OS XDownloading the Datasets and Example4641424247Code48Using Jupyter Notebook49Defining the code repository50the datasets used in this bookWorking with Google ColabDefining GoogleUnderstanding what Google Colab doesConsidering the online coding differenceUsing local runtime supportGetting a Google AccountCreating theSigningaccountinCreatingOpening existing notebooksSaving notebooksDownloading notebooksa newfor Data Science For Dummies575960606163636464Working with NotebooksPython3140Colabviii.27Getting Continuum Analytics AnacondaGetting Enthought Canopy ExpressGetting WinPythonInstalling Anaconda on WindowsUnderstandingchapter 4:26.notebook6565666871

PerformingCommon Tasks71Creating code cellsCreating71text cells72cells73Creating specialEditing cellsMoving cellsUsing Hardware AccelerationExecuting the CodeViewing Your NotebookDisplaying the table of contentsGetting notebook informationChecking code executionSharing Your NotebookGetting Help74757576767777787980PART 2: GETTING YOUR HANDS DIRTY WITH DATAchapter 5:Understanding the Tools83Using the Jupyter ConsoleInteracting with screen textChanging the window appearanceGetting Python helpGetting IPython helpUsing magic functionsDiscovering objects84Using Jupyter Notebook93848687899091withWorkingstylesRestarting the kernelRestoring a checkpointPerforming Multimedia and Graphic IntegrationEmbedding plots and other imagesLoading examples from online sitesObtaining onlinechapter 6:81Working withgraphicsand multimediaCSV delimited formatReading Excel and other9596969699DataUploading, Streaming,SamplingUploading small amounts of data into memoryStreaming large amounts of data into memoryGenerating variations on image dataSampling data in different waysAccessing Data in Structured Flat-File FormReading from a text fileReading9496Real Dataand93100101102103104105106107Microsoft Office filesTable of Contents109IX

Sending Data in Unstructured File FormManaging Data from Relational DatabasesInteracting with Data from NoSQL DatabasesAccessing Data from the Webchapter 7:Conditioning Your111113115116Data.Juggling between NumPy and pandasKnowing whento use122-124124126126129130131132133Formatting date and time valuesUsing the right time transformationDealing with Missing DataFinding the missing dataEncoding missingnessImputing missing dataSlicing and Dicing: Filtering and SelectingSlicing rowsSlicing columnsDicingConcatenating and Transforming134new cases135136136.137138Dataand variablesRemoving dataSorting and shufflingAggregatingData atAny140140141142142144145Level.146149Working with HTML PagesParsing XML and HTMLUsing XPath for data extraction150Raw TextDealing with UnicodeStemming and removing stop wordsIntroducing regular expressionsUsing the Bag of Words Model and BeyondUnderstanding the bag of words modelPython for.139Shaping DataWorking withX122with Dates in Your DataAddingchapter 8:.121122NumPypandasKnowing when to useValidating Your DataFiguring out what's in your dataRemoving duplicatesCreating a data map and data planManipulating Categorical VariablesCreating categorical variablesRenaming levelsCombining levelsDealing.Data Science For Dummies150.151153153153155158159

Working with n-gramsImplementing TF-IDF transformationsWorking with Graph DataUnderstanding the adjacency matrixUsing NetworkX basicschapter 9 Putting What YouContextualizingEvaluatinga161162165165166Know in Action169Problems and Data170data science171problemResearching solutionsFormulating a hypothesisPreparing your dataConsidering the Art of'Feature CreationDefining feature creationCombining variablesUnderstanding binning and discretizationUsing indicator variablesTransforming distributions173Performing Operations178on174175175175176177177178ArraysUsing vectorizationPerforming simple arithmetic on vectorsPerforming matrix vector multiplicationPerforming matrix multiplication179and matrices180181PART 3: VISUALIZING INFORMATIONchapter io:179183Getting a Crash Course in MatPlotLib185186Starting with a GraphDefining the plot186Drawing multiplelines andSaving your workto disk.187plots188Setting the Axis, Ticks, GridsGetting the axesFormatting the axesAdding gridsDefining the Line AppearanceWorking with line styles189189190191192193194Using colorsAdding markersUsing Labels, Annotations,Adding labelsAnnotating the chartCreating a legend195and197Legends198198199Table of Contentsxi

chapter 11:Visualizing theData201Choosing the Right GraphShowing parts ofa,-202whole withpiechartsCreating comparisons with bar charts203Showing distributions using histogramsDepicting groups using boxplotsSeeing data patterns using scatterpiotsCreating Advanced ScatterpiotsDepicting groupsShowing correlationsPlotting Time SeriesRepresenting time on axes205Plotting trends over timePlotting Geographical Data214Using an environment in NotebookGetting the Basemap toolkitDealing with deprecated library issuesUsing Basemap to plot geographic dataVisualizing GraphsDeveloping undirected graphsDeveloping directed graphs217PART 4: WRANGLING DATAchapter Stretching Python's Capabilities229Playing with Scikit-learn230230Understanding classes in Scikit-learnDefining applications for data sciencePerforming the Hashing Trick.231234Using hash functionsDemonstratingchapter 13:thehashing235trick235Working with deterministic selectionConsidering Timing and PerformanceBenchmarking with timeitWorking with the memory profilerRunning in Parallel on Multiple CoresPerforming multicore parallelismDemonstrating 48AnalysisThe EDAApproach.Defining Descriptive Statistics for NumericMeasuring central tendencyMeasuring variance and rangexiiPython202for Data Science For Dummies252Data.253254255

Working with percentilesDefining measures of normalityCounting for Categorical DataUnderstanding frequenciesCreating contingency tablesCreating Applied Visualization for EDAInspecting boxplotsPerforming t-tests after boxplotsObserving parallel coordinatesGraphing distributionsPlotting 63264265266Correlationcovariance and correlationchapter 14:UsingUsing nonparametric correlationConsidering the chi-square test for tablesModifying Data DistributionsUsing different statistical distributionsCreating a Z-score standardization.Transforming other notable distributions268Reducing Dimensionality275UnderstandingLooking for dimensionality276270271272272273273SVDUsingchapter is:reduction277SVD to measure the invisible.279Performing Factor Analysis and PCAConsidering the psychometric modelLooking for hidden factorsUsing components, not factorsAchieving dimensionality reductionSqueezing information with t-SNEUnderstanding Some ApplicationsRecognizing faces with PCAExtracting topics with NMFRecommending movies280Clustering295Clustering ding centroid-based algorithms298Creating an example with image dataLooking for optimal solutionsClustering big dataPerforming Hierarchical ClusteringUsing a hierarchical cluster solutionUsing a two-phase clustering solutionDiscovering New Groups with DBScan299.301304305307308310Table of Contentsxiii

chapter 16:Detecting Outliers inData313314Considering Outlier DetectionFinding more things that can go wrongUnderstanding anomalies and novel dataExamining a Simple Univariate MethodLeveraging on the Gaussian distributionMaking assumptions and checking outDeveloping a Multivariate ApproachUsing principal component analysisUsing cluster analysis for spotting outliersAutomating detection with Isolation Forests315316317319320322322324325PART 5: LEARNING FROM DATAchapter 17:Exploring Four SimpleEffective AlgorithmsGuessingthe Number: Linear327and.329329RegressionDefining the family of linear modelsUsing more variablesUnderstanding limitations and problemsLogistic RegressionApplying logistic regressionConsidering when classes are moreMaking Things as Simple as Naive BayesFinding out that Naive Bayes isn't so naivePredicting text classificationsLearning Lazily with Nearest NeighborsPredicting after observing neighborsChoosing your k parameter wiselyMovingchapter 18:toPerforming Cross-Validation,Selection, and OptimizationPondering the Problem ofFittingaModelbias and varianceUnderstandingDefining a strategy for picking modelsDividing between training and test setsCross-ValidatingUsing cross-validation on k foldsSampling stratifications for complex dataVariables LikeaSelectingSelecting by univariateUsing a greedy searchPromeasuresPumping UpHyperparametersImplementing a grid searchTrying a randomized searchYourxivPythonfor Data Science For 8349350354356357358360360362363.364368

chapter 19:Increasing Complexity withLinear and Nonlinear Tricks371Nonlinear Transformationschapter 20:UsingDoing variable transformationsCreating interactions between variablesRegularizing Linear ModelsRelying on Ridge regression (L2)Using the Lasso (L1)Leveraging regularizationCombining L1 & L2: ElasticnetFighting with Big Data Chunk by ChunkDetermining when there is too much dataImplementing Stochastic Gradient DescentUnderstanding Support Vector MachinesRelying on a computational methodFixing many new parametersClassifying with SVCGoing nonlinear is easyPerforming regression with SVR372Creating a stochastic solution with SVMPlaying with Neural NetworksUnderstanding neural networksClassifying and regressing with neurons401Understanding thePower of 6407408ManyStarting with a Plain Decision TreeUnderstanding a decision treeCreating trees for different purposesMaking Machine Learning AccessibleWorking with a Random Forest classifierWorking with a Random Forest regressorOptimizing a Random ForestBoosting PredictionsKnowing that many weak predictors winSetting a gradient boosting classifierRunning a gradient boosting regressorUsing GBM 6427PART 6: THE PART OF TENSchapter 2v.372429Ten Essential Data Resources431Discovering the News with SubredditGetting a Good Start with KDnuggetsLocating Free Learning Resources with Quora432432.432Ta bie of ContentsXV

Gaining Insightswith Oracle's Data ScienceAccessing the HugeListof ResourcesonLearning New Tricks from the Aspirational Data ScientistObtaining the Most Authoritative Sources at UdacityReceiving Help with Advanced Topics at ConductricsObtaining the Facts of Open Source Data Science from MastersZeroing In on Developer Resources with Jonathan Bowerchapter22:Ten DataChallengesYou Should Take.Meeting the Data Science London Scikit-learn ChallengePredicting Survival on the TitanicFinding a Kaggle Competition that Suits Your NeedsHoning Your Overfit StrategiesTrudging Through the MovieLens DatasetGetting Rid of Spam E-mailsWorking with Handwritten InformationWorking with PicturesAnalyzing Amazon.com ReviewsInteracting with a Huge Graph%INDEXPython for Data Science For Dummies433BlogData Science 2443444444.447

Workingwith n-grams 161 ImplementingTF-IDFtransformations 162 Workingwith Graph Data 165 Understandingthe adjacencymatrix 165 UsingNetworkXbasics 166 chapter9 PuttingWhatYouKnowinAction 169 ContextualizingProblemsand Data 170 Evaluating a data science problem 171 Researchingsolutions 173 Formulatinga hypothesis 174 Preparingyourdata 175 ConsideringtheArtof'FeatureCreation 175