Big Data Analytics Tutorial - RxJS, Ggplot2, Python Data .

Transcription

Big Data AnalyticsAbout the TutorialThe volume of data that one has to deal has exploded to unimaginable levels in the pastdecade, and at the same time, the price of data storage has systematically reduced.Private companies and research institutions capture terabytes of data about their users’interactions, business, social media, and also sensors from devices such as mobile phonesand automobiles. The challenge of this era is to make sense of this sea of data. This iswhere big data analytics comes into picture.Big Data Analytics largely involves collecting data from different sources, munge it in away that it becomes available to be consumed by analysts and finally deliver data productsuseful to the organization business.The process of converting large amounts of unstructured raw data, retrieved from differentsources to a data product useful for organizations forms the core of Big Data Analytics.In this tutorial, we will discuss the most fundamental concepts and methods of Big DataAnalytics.AudienceThis tutorial has been prepared for software professionals aspiring to learn the basics ofBig Data Analytics. Professionals who are into analytics in general may as well use thistutorial to good effect.PrerequisitesBefore you start proceeding with this tutorial, we assume that you have prior exposure tohandling huge volumes of unprocessed data at an organizational level.Through this tutorial, we will develop a mini project to provide exposure to a real-worldproblem and how to solve it using Big Data Analytics. You can download the necessaryfiles of this project from this link: http://www.tools.tutorialspoint.com/bda/Copyright & Disclaimer Copyright 2017 by Tutorials Point (I) Pvt. Ltd.All the content and graphics published in this e-book are the property of Tutorials Point (I)Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republishany contents or a part of contents of this e-book in any manner without written consentof the publisher.We strive to update the contents of our website and tutorials as timely and as precisely aspossible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of ourwebsite or its contents including this tutorial. If you discover any errors on our website orin this tutorial, please notify us at contact@tutorialspoint.com1

Big Data AnalyticsTable of ContentsAbout the Tutorial . 1Audience. 1Prerequisites. 1Copyright & Disclaimer. 1Table of Contents . 2BIG DATA ANALYTICS BASICS.41.Big Data Analytics – Overview.52.Big Data Analytics – Data Life Cycle.6Traditional Data Mining Life Cycle. 6Big Data Life Cycle . 83.Big Data Analytics – Methodology .114.Big Data Analytics – Core Deliverables .125.Big Data Analytics – Key Stakeholders .136.Big Data Analytics – Data Analyst.147.Big Data Analytics – Data Scientist .15BIG DATA ANALYTICS – PROJECT.168.Big Data Analytics – Problem Definition.17Project Description . 17Problem Definition . 179.Big Data Analytics Data Collection.1910. Big Data Analytics Cleansing Data .2211. Big Data Analytics Summarizing Data .2412. Big Data Analytics Data Exploration .3013. Big Data Analytics Data Visualization .33BIG DATA ANALYTICS METHODS .3814. Big Data Analytics Introduction to R.3915. Big Data Analytics Introduction to SQL.4816. Big Data Analytics Charts & Graphs .57Univariate Graphical Methods . 57Multivariate Graphical Methods . 602

Big Data Analytics17. Big Data Analysis Data Analysis Tools.64R Programming Language. 64Python for data analysis . 64Julia. 64SAS . 65SPSS . 65Matlab, Octave . 6518. Big Data Analytics Statistical Methods .66Correlation Analysis. 66Chi-squared Test. 68T-test . 70Analysis of Variance. 72BIG DATA ANALYTICS ADVANCED METHODS.7619. Big Data Analytics Machine Learning for Data Analysis .77Supervised Learning . 77Unsupervised Learning . 7720. Big Data Analytics Naive Bayes Classifier .7821. Big Data Analytics K-Means Clustering .8122. Big Data Analytics Association Rules .8423. Big Data Analytics Decision Trees .8724. Big Data Analytics Logistic Regression .8925. Big Data Analytics Time Series Analysis.9126. Big Data Analytics Text Analytics.9527. Big Data Analytics Online Learning.973

Big Data AnalyticsBig Data Analytics Basics4

Big Data Analytics – OverviewThe volume of data that one has to deal has exploded to unimaginable levels in the pastdecade, and at the same time, the price of data storage has systematically reduced.Private companies and research institutions capture terabytes of data about their users’interactions, business, social media, and also sensors from devices such as mobile phonesand automobiles. The challenge of this era is to make sense of this sea of data. This iswhere big data analytics comes into picture.Big Data Analytics largely involves collecting data from different sources, munge it in away that it becomes available to be consumed by analysts and finally deliver data productsuseful to the organization business.The process of converting large amounts of unstructured raw data, retrieved from differentsources to a data product useful for organizations forms the core of Big Data Analytics.5

Big Data Analytics – Data Life CycleTraditional Data Mining Life CycleIn order to provide a framework to organize the work needed by an organization anddeliver clear insights from Big Data, it’s useful to think of it as a cycle with different stages.It is by no means linear, meaning all the stages are related with each other. This cycle hassuperficial similarities with the more traditional data mining cycle as described in CRISPmethodology.CRISP-DM MethodologyThe CRISP-DM methodology that stands for Cross Industry Standard Process for DataMining, is a cycle that describes commonly used approaches that data mining experts useto tackle problems in traditional BI data mining. It is still being used in traditional BI datamining teams.Take a look at the following illustration. It shows the major stages of the cycle as describedby the CRISP-DM methodology and how they are interrelated.Figure: CRISP-DM life cycle6

Big Data AnalyticsCRISP-DM was conceived in 1996 and the next year, it got underway as a European Unionproject under the ESPRIT funding initiative. The project was led by five companies: SPSS,Teradata, Daimler AG, NCR Corporation, and OHRA (an insurance company). The projectwas finally incorporated into SPSS. The methodology is extremely detailed oriented in howa data mining project should be specified.Let us now learn a little more on each of the stages involved in the CRISP-DM life cycle: Business Understanding This initial phase focuses on understanding theproject objectives and requirements from a business perspective, and thenconverting this knowledge into a data mining problem definition. A preliminary planis designed to achieve the objectives. A decision model, especially one built usingthe Decision Model and Notation standard can be used. Data Understanding The data understanding phase starts with an initial datacollection and proceeds with activities in order to get familiar with the data, toidentify data quality problems, to discover first insights into the data, or to detectinteresting subsets to form hypotheses for hidden information. Data Preparation The data preparation phase covers all activities to constructthe final dataset (data that will be fed into the modeling tool(s)) from the initialraw data. Data preparation tasks are likely to be performed multiple times, and notin any prescribed order. Tasks include table, record, and attribute selection as wellas transformation and cleaning of data for modeling tools. Modeling In

26. Big Data Analytics . interactions, business, social media, and also sensors from devices such as mobile phones and automobiles. The challenge of this era is to make sense of this sea of data. This is where big data analytics comes into picture. Big Data Analytics largely involves collecting data from different sources, munge it in a way that it becomes available to be consumed by .