DSO-428: Essentials And Digital Frontiers Of Big Data Units: Fall 2017 .

Transcription

DSO-428: Essentials and Digital Frontiers of Big DataUnits: 4Fall 2017—Tuesday/Thursday—4:00-5:50pmLocation: JFF LL103Instructor: Tianshu SunOffice: BRI-310BOffice Hours: Tuesday-3:00-4:00pm or byappointmentContact Info:Email: tianshu@marshall.usc.eduPhone number: 213-821-9885Grader: Dylan (Yile) WuOffice: TBAOffice Hours: TBAContact Info: dylanyilewedso428@gmail.comCourse DescriptionThis course covers the fundamentals of big data infrastructure (Database, Data Warehouse,Hadoop/MapReduce, SPARK, SQL), business analytics (Data Mining, Google Analytics, Tableau, A/B testing,etc.) and their applications to different business problems. An important goal of this course is to understandhow data infrastructure and business analytics can address different challenges faced by an organization.Learning ObjectivesUpon completion of this course (with efforts!), the student should be able to:1.2.3.4.5.6.Explain the big picture and core concepts related to big data infrastructure and big data analytics, andtheir applications in business (Overview)Design conceptual data model and data mart for business applications (Big Data Infrastructure)Demonstrate strong ability to use SQL to process and analyze data in existing database (SQL) Ability to formulate complex SQL queries to process data and perform business analysis Implement relational data model using SQL DDL Familiarity with SQL Server, data warehouse and distributed storage/processing environment.Utilize contemporary data mining methods, such as classification, prediction and clustering, and familiarwith advanced data mining digital tools (Big Data Analytics)Demonstrate ability to use state-of-art business analytics tools, such as Google Analytics, Tableau, andA/B testing (Big Data Analytics).Describe the landscape of big data ecosystems and related business opportunities (Big Data Landscape)*Each goal above corresponds to one part in the “Course Map”.Revised July 2016

COURSE MATERIALSRequired:1.2.3.All Documents on Blackboard (under “Content” directory) -- Lecture, Lab, Homework, ProjectDatabase Systems: A Practical Approach to Design, Implementation and Management (5th/ 6th)Authors: Thomas M. Connolly and Carolyn E. BeggPublisher: Addison-Wesley (ISBN-10: 0321523067 / ISBN-13: 978-0321523068)All the readings (specified in “Course Schedule”) will be based on 5th edition of the book.I put Table of Content for the 5th edition under “Syllabus/Schedule” folder on Blackboard.Data Mining for the MassesAuthor: Dr. Matthew A. NorthAvailable Online for Download (Google the name of the Book)Supplement Materials (Also ask me for more materials if you are interested in specific topics):1. Data Science for Business (Very good book on Data Mining/Machine Learning for Business)Authors: Foster Provost, Tom Fawcett, Publisher: O'Reilly Media2. Google Analytics Demystified: A Hands-On Approach (Second Edition)Author: Joel J. Davis, Publisher: CreateSpace Independent Publishing PlatformTECHNOLOGY SKILLS Microsoft SQL Server 2012 (or 2008) and SQL Database Management Studio 2012 (or 2008) Hadoop / Spark (Free Online) Google Analytics (Free Online) Tableau (Free Online) RapidMiner (Free Online) (Similar to SAS Enterprise Miner in some aspects)WHAT YOU CAN LEARNThis is a demanding class. It covers various topics in the Big Data Infrastructure, Big Data Analytics, and Big DataLandscape. It includes both theoretical materials (lecture & homework & quiz) that require careful review & practice,and hands-on experience (lab & team project) that involves trial and-error. While I will make every effort to deliver thematerial in a clear and organized fashion, each student must commit enough time and efforts. Hard work will definitelypayoff: as big data and business analytics becomes more and more important for organizations, the knowledge andskills you can get from this course will be crucial for a wide range of career paths.If you are willing to make the commitment of time and effort, here is what you will accomplish orTHINGS TO PLACE ON YOUR CV:üüüüüüüüTag: Database, Data Modeling, Relational Model, SQL, Data Mining, Google Analytics, Tableau, A/B testing, DataWarehouse, Big Data InfrastructureFamiliarity with SQL Server and Database Management ToolUnderstanding of relational data model and Ability to construct data mart for applications.Ability to formulate complex SQL queries for Business Analytics and Data ScienceAbility to monitor website and campaign performance using Google AnalyticsAbility to perform basic data visualization and A/B testingUnderstanding of current Big Data Infrastructure such as Hadoop, MapReduce, and SPARKTeam based application development.FEEDBACKThe success of this class depends on accurate and timely feedback from you. Any feedback or suggestions on anycourse related issues are more than welcome. Potential channels include face-to-face chat with me or TA, email to meor TA and in-class Survey.

COURSE COMPONENTSLecturesThe lectures will focus on discussing concepts with simple examples. In-Class practice will often be used to facilitatelearning. All lecture meetings will be in the classroom. The student will be responsible for all the material and updatespresented and discussed in the lectures. The lecture notes (i.e. slides for every lecture) and textbook will provide abase of knowledge that you can use in the homework assignments and the course project.Never hesitate to ask questions. If something seems unclear, please stop the instructor to ask for clarification. If youdo not understand something in-class or have a question, the chances are someone else may also have the samequestion! Please take the initiative to ask! Remember, there are no bad questions except those not asked.Quizzes (in class)To reinforce database concepts and encourage you to review course materials, five quizzes will be given during classtime throughout the semester. Check the updated ‘course schedule’ file on BLACKBOARD for quiz dates (Note:Course schedule may be updated over time; you will receive notification on BLACKBOARD for those updates). Thequizzes are close-book and will be relatively simple and based on the readings & lecture of the previous weeks. Thequizzes will be held at the beginning of class. Please don’t be late as there is no make-up time or make-up quizunless for special reasons (see “excused absence” below).LaboratoriesYou will follow a series of tutorials and exercises to gain hands-on experience with the cutting edge business analytictools including Microsoft SQL Database Management Tool, Google Analytics, Tableau, RapidMiner and so on. You willalso explore tools to develop reports and applications. The lab is an integral part of the course, which is why we willmeet once a week in the lab (i.e. same as the lecture room). Lab sessions will be held weekly (on Thursday) startingfrom the second week. Attendance is mandatory. With the guidance of the instructor and the TA, you will learn thosetools in a more efficient way and would be able to finish a good portion of lab assignment within the 50-min labsession. For the lab schedules, check the course schedule file.The TA will help you during the lab. If you wait until the lab to start thinking about the assignment, you will not takeadvantage of the assistance made available to you. Please come prepared (by reading the lab instruction in advance).You should submit completed lab assignment after completion via BLACKBOARD by the time on the day due. Pleasesee class schedule for specific due dates.Description and Assessment of AssignmentsHomework and Lab AssignmentsThere will be four homework assignments and seven lab assignments to be completed individually. Each assignmentwill reinforce concepts related to the relevant chapters/lectures. Lab assistance is provided specifically to help you. It isnot wrong to seek clarifications and minor help with completion of each assignment but there is a fine line betweenseeking assistance and cheating.Each lab assignment will require you to spend time on the computer. Please plan on it. Each submission-requiredassignment will be worth 20 points. Submission instructions for all assignments will be covered at the beginning ofeach assignment.All assignments are due by 11:59pm on the day due (see the class schedule file for specific dates). These assignmentsare submitted via BLACKBOARD and the submission can be done by the assigned time/date. You may submit multipletimes (i.e. multiple attempts). However, only the last submission (attempt) will be graded. The submission time is alsodetermined by the last submission/attempt you made in the system. If you complete and submit lab assignments andhomework assignments on time, you will receive full credit for the assignments. However, if submitted deliverables donot satisfy homework and lab requirements, you will get a grade reduction of 10% per each requirement that is notsatisfied.Late submission of assignments (by less than 24 hours) will result in a grade reduction of 25%. Assignmentssubmitted late by more than 24 hours will receive a zero.

Presentation on Data-driven ApplicationThis is a big data course in business school. Thus, it is important to understand the potential of data-driven applicationsin different business contexts, with a focus on the role of big data and business analytics in enabling such applications.The students will form a team of 4-5 students and identify an interesting data-driven application in the real world (e.g.a new product feature from an existing firm powered by data and analytics, or a rising startup with data-drivenapplications at its core). The team needs to illustrate both the product feature/startup background, how its corefeature/service is enabled by data and analytics as well as the mechanism of such link. I will illustrate the requirementin detail and provide guidelines for the presentation.Course ProjectA major objective of the course is to get hands-on experience in performing business analytics correctly, with a focuson both data infrastructure and advanced business analytics (e.g. data mining, data visualization). Each team willidentify an interesting real world business application and define the goal of the data consulting project. Specifically,you will design a relational database schema and write a SQL script to create tables and draw data. You will alsodevelop SQL queries and analytics reports (using RapidMiner/Tableau/Google Analytics) to illustrate usefulapplications of business analytics. Finally, you also need to deliver presentations to report progresses and demonstrateresults.1.2.3.4.5.There will be several intermediate and final deliverables. I will illustrate them in detail in the course projectdescription.You will create a team at your preferences and submit the information on team members. Please provide teamroster (name and email of each member) before the deadline on “course schedule”. The team should have 4-5students. Only one member needs to submit the roster.The project grade will be a team grade. Any team that does not wish to receive team grades must inform me oneweek prior to the final project deliverable. I will then assign the project grade using peer review. I highlyrecommend that you resolve your differences prior to this event or bring the problem to my attention.All project deliverables are due by 11:59pm on the day due. See the course schedule file for specific dates. Alldeliverables are submitted via BLACKBOARD by the assigned time/date. If you submit required project deliverableson time, you will receive full credit for the deliverables. However, the quality of submitted deliverables must meetthe project guideline. If submitted deliverables do not satisfy the project requirements, you will get a gradereduction of 10% per each requirement that is not satisfied.Late submissions (by less than 24 hours) will result in a grade reduction of 25%. Late submissions by more than24 hours will receive a zero.I will illustrate the requirements & provide detailed guideline for deliverables after the Midterm.ExamsExams will primarily test whether you understood concepts covered in the lecture and reading. There will be twoexams, worth 250 points each. The final exam is NOT cumulative. Both exams are open textbook and coursematerials. However, computers, mobile devices and any forms of communication are not allowed. If your cell phoneor mobile device goes off during an exam you will receive a grade reduction of 12.5%. No extra time will be providedfor late arrivals.Exams can only be made-up in the event of documented emergencies. Written permission must be obtained 48 hoursbefore the exam if you cannot attend. In any event, make-up exams are only given at instructor discretion.

Grading BreakdownAssignmentHomework (4) and Lab Assignments (7)In-Class Quizzes (5 @ 20 each)Presentation on Data-Driven ApplicationCourse Project (presentations & deliverableExams (2 @ 250 each)Maximum Points PossiblePoints220100301505001000% of Grade221031550100The final points will be mapped to a Final Grade.Grading Scale (Example)Course final grades will be determined using the following scale. This is indicative and may be adjusted at mydiscretion.A95-100A90-94B 87-89B83-86B80-82C 77-79C73-76C70-72D 67-69D63-66D60-62F59 and belowGrading TimelineThe scores will also be posted on BLACKBOARD within a week of being handed in. The course project will be graded inthe most expedient way possible at the end of the semester. Please recognize that given the extensiveness of theproject, there may be a delay in posting grades at the end of the semester. If you are graduating this semester, pleasebe sure to inform me toward the end of the semester so that your grade may be posted in time for graduation.Additional PoliciesDocumentation for absences due to medical reasons must contain a statement that you were incapacitated, the phonenumber of the health care professional who examined you, and the dates of incapacitation (which must include thedates of the missed exam or quiz).It is the student’s responsibility to inform the instructor of any expected excused absences ahead of time. For examsor quizzes, students are expected to inform the instructor of a conflict in writing (e-mail is acceptable) as soon theexam is announced or the conflict is known, whichever occurs first. The written document (or email) should besubmitted at least 7 days before the exam or quizzes.An excused absence does not relieve the student of the obligation to turn in homework/lab assignments and projecton time, as the assignments and project are assigned well in advance of their due dates. In cases of a lengthy illness,or other protracted emergency situations, the instructor may consider extensions on project assignments, dependingon the specific circumstances.

Statement on Academic Conduct and Support SystemsAcademic ConductPlagiarism – presenting someone else’s ideas as your own, either verbatim or recast in your own words – is aserious academic offense with serious consequences. Please familiarize yourself with the discussion ofplagiarism in SCampus in Part B, Section 11, “Behavior Violating University /part-b. Other forms of academic dishonesty are equallyunacceptable. See additional information in SCampus and university policies on scientific duct.Discrimination, sexual assault, intimate partner violence, stalking, and harassment are prohibited by theuniversity. You are encouraged to report all incidents to the Office of Equity and Diversity/Title IX Officehttp://equity.usc.edu and/or to the Department of Public Safety http://dps.usc.edu. This is important for thehealth and safety of the whole USC community. Faculty and staff must report any information regarding anincident to the Title IX Coordinator who will provide outreach and information to the affected party. Thesexual assault resource center webpage http://sarc.usc.edu fully describes reporting options. Relationshipand Sexual Violence Services https://engemannshc.usc.edu/rsvp provides 24/7 confidential support.Support SystemsA number of USC’s schools provide support for students who need help with scholarly writing. Check withyour advisor or program staff to find out more. Students whose primary language is not English should checkwith the American Language Institute http://ali.usc.edu, which sponsors courses and workshops specificallyfor international graduate students. The Office of Disability Services and Programs http://dsp.usc.eduprovides certification for students with disabilities and helps arrange the relevant accommodations. If anofficially declared emergency makes travel to campus infeasible, USC Emergency Informationhttp://emergency.usc.edu will provide safety and other updates, including ways in which instruction will becontinued by means of Blackboard, teleconferencing, and other technology.Students with DisabilitiesStudents need to make a request with Disability Services and Programs (DSP) for each academic term thataccommodations are desired. Guidelines for the DSP accommodation process can be found ograms/dsp/registration/guidelines/guidelines general.htmlStudents requesting test-related accommodations will need to share and discuss their DSP recommendedaccommodation letter/s with their faculty and/or appropriate departmental contact person at least three weeksbefore the date the accommodations will beThe Office of Disability Services and Programs (www.usc.edu/disability) provides certification for students withdisabilities and helps arrange the relevant accommodations. Any student requesting academic accommodations basedon a disability is required to register with Disability Services and Programs (DSP) each semester. A letter of verificationfor approved accommodations can be obtained from DSP. Please be sure the letter is delivered to me (or to your TA)as early in the semester as possible. DSP is located in GFS (Grace Ford Salvatore Hall) 120 and is open 8:30 a.m.–5:00p.m., Monday through Friday. The phone number for DSP is (213) 740-0776. Email: ability@usc.edu.

Tentative Course Schedule(Please check Blackboard and Emails to see updates on schedule. This Version: Oct.17)Week 1:DateTueThurLab (Thur)Week 2:Aug.23 (Tuesday)LectureSyllabus and Course OverviewIntroduction to DatabasesDatabase EnvironmentNo LabReadingsSyllabusCourse ScheduleCh. 1Ch. 2DueHW 1 (Introduction):Post on Blackboard:Fri. Aug.26Aug.30 (Tuesday)DateTueLectureRelational Data ModelReadingsCh. 4DueLab 1 (Schema Analysis)ThurQuiz 1 (Ch. 1, Ch. 2)Relational Data ModelCh. 4Post: Mon. Aug.29HW 2 (Relational Model)Lab (Thur)Lab 0 (Introduction to SQL Server andManagement Studio)Post: Wed. Aug.31HW 1 (Introduction)Lab 1 (Schema Analysis)Due: Fri. Sep.02 @11:59pmWeek 3:DateTueThurLab (Thur)Week 4:DateTueThurLab (Thur)Sep.06LectureRelational Data ModelSQL DDLLab 2 (SQL DDL)ReadingsCh. 7.1-7.3Ch. 6DueLab 2 (SQL DDL)Post: Mon. Sep.05Team roster for presentationDue: Fri. Sep.09 @ 11:59pmLab 1 (Deduce Schema)Due: Fri. Sep.09 @11:59pmReadingsDueHW 2 (Relational Model)Due: Mon. Sep.12 @ 11:59pmHW 3 (SQL DML 1)Post: Wed. Sep.14Lab 2 (SQL DDL)Due: Sun. Sep.18 @ 11:59pmSep.13LectureQuiz 2 (Ch. 4, Ch. 7.1-7.3)SQL DMLOnline learning module of Big DataBonus Lab on Code Academy (5 points)TA Office hour (4-6:15pm at JFF LL 103)Ch. 61

Week 5:DateTueThurLab (Thur)Week 6:DateTueThurLab (Thur)Week 7:Sep. 20LectureSQL DMLSQL DMLLab 3 (SQL DML)DueLab 3 (SQL DML)Post: Mon. Sep.19ReadingsDueHW 3 (SQL DML 1)Due: Wed. Sep.28@ 11:59pmLab 3 (SQL DML)Due: Fri. Sep.30 @ 11:59pmLab 4 (Big Data Infrastructure)Post: Fri. Sep.30Sep. 27LectureData Warehouse / Data MartBig Data: Hadoop/MapReduceQuiz 3 (Ch6, Ch7.4, Ch7.6)Big Data: SPARK/OthersCh. 32, 34On BlackboardLab 4 (Big Data Infrastructure)Oct.04DateTueThurLab (Thur)LectureBig Data/Review for Midterm Exam*** Midterm Exam ***No LabWeek 8:DateTueThurLab (Thur)Oct. 11LectureMid-term Review & Project GuidelineAdvanced SQLAdvanced SQLNo LabWeek 9:DateTueThurLab (Thur)Oct.18LectureAdvanced SQL & Create ViewData Mining (Overview)Introduction to Lab 5 (SQL DDL & DML)Week 10:DateOct.25LectureGuest Speaker:Lead Data Scientist from NetflixMore time for Lab 5 (SQL DDL & DML)Data MiningLab 6 (Data Mining 1)TueThurLab (Thur)ReadingsCh. 6Ch. 6, Ch. 7.4ReadingsOn BlackboardReadingsCh. 6Ch. 6DueDueLab 4 (Big Data Infrastructure)Due: Fri. Oct.14 @ 11:59pmReadingsCh. 6, Ch. 7.4North Ch. 1-2DueHW 4 (SQL DML 2)Post: Mon. Oct.17Lab 5 (SQL DDL & DML)Post: Mon. Oct.17ReadingsDueNorthHW 4 (SQL DML 2)Due: Wed. Oct.26 @ 11:59pmLab 6 (Data Mining 1)Post: Mon. Oct.242

Week 11:DateTueThurNov. 01LectureData MiningGoogle AnalyticsQuiz 4 on Data MiningGoogle AnalyticsLab (Thur)Lab 7 (Data Mining 2)Week 12:DateTueNov.08 (Team presentation on data-driven application starting from this week)LectureReadingsDueGoogle AnalyticsLab 7 (Data Mining 2)Guest Speaker from MicrosoftOn BlackboardDue: Fri. Nov.11 @ 11:59pmPresentation on data-driven applicationSlides of Team Presentation onGoogle Analytics/TableauOn Blackboarddata-driven applicationDue: Fri. Nov.11 @ 11:59pmNo Lab from now onThurLab (Thur)Week 13:DateTueThurWeek 14:DateTueThurWeek 15:DateTueThurNov.15LectureFirst Presentation on Team ProjectA/B testingQuiz 5 on Google Analytics &A/B testing, and TableauGuest Speaker from SnapchatNov.22LectureBonus Quiz (10 pts on 3 Guest lectures)Business AnalyticsThanksgivingNov.29LectureReview for Final ExamProject Help & Office HourFinal Presentation on Team ProjectProject Help & Office HourReadingsNorthOn BlackboardOn BlackboardDueLab 7 (Data Mining 2)Post: Mon. Oct.31Lab 5 (SQL DDL & DML)Due: Tues. Nov.01 @ 11:59pmLab 6 (Data Mining 1)Due: Fri. Nov.04 @ 11:59pmReadingsDueOn BlackboardProject Deliverable 1Due: Fri. Nov.18 @ 11:59pmReadingsDueReadingsDueProject Deliverable 2Due: Fri. Dec.02 @ 11:59pmReadingsDueProject Final DeliverableDue: Tuesday, Dec.05 @ 11:59pmWeek 16:Study Day and Final Exam on Dec.08Final Exam Date: Dec.08 4:30pm-6:30pm for 4pm section;Dec.08 7pm-9pm for 6pm sectionDateTueThurLectureNo Class (Office Hour: 4-6pm BRI 310B)No Class (Office Hour: 3-4pm BRI 310B)

Microsoft SQL Server 2012 (or 2008) and SQL Database Management Studio 2012 (or 2008) Hadoop / Spark (Free Online) Google Analytics (Free Online) Tableau (Free Online) RapidMiner (Free Online) (Similar to SAS Enterprise Miner in some aspects) WHAT YOU CAN LEARN This is a demanding class.