FIT3152 Lecture 01 PDF

Title FIT3152 Lecture 01
Author Wong Kai Jeng
Course Data Science
Institution Monash University
Pages 94
File Size 2.6 MB
File Type PDF
Total Downloads 53
Total Views 131

Summary

Lecture 1...


Description

FIT3152 Data analytics– Lecture 1 Introduction to Data Science Recent examples •  Common themes

• 

Introduction to FIT3152

Draft: do not circulate

Unit objectives •  Unit outline •  Assessment details •  Unit management

• 

Review of basic statistics using R FIT3152 Data analytics– Lecture 1

Slide 1

What is data science? From Wikipedia: • 

Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD).

Draft: do not circulate

• 

https://en.wikipedia.org/wiki/Data_science (Accessed 21/07/2016)

FIT3152 Data analytics– Lecture 1

Slide 3

Data Science: Recent examples

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 4

Criminal investigation Road Shooter Found Via Mass Data Collection

Dr

ate

http://www.spiegel.de/international/germany/spectacular-highway-shooterinvestigation-raises-data-privacy-concerns-a-908006.html FIT3152 Data analytics– Lecture 1

Slide 5

Criminal investigation ... The case involves a truck driver who fired at least 762 shots at cars and trucks on German highways and at buildings in a shooting spree that began in 2008. In several cases, his targets were only barely able to avoid accidents as a result of the shots. In 2009, one woman was hit in the neck with a bullet fired by the truck driver, identified on Tuesday only as a 57-yearold truck driver from North Rhine-Westphalia, but survived. ...

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 6

Criminal investigation ... On seven sections of the autobahns in question, police erected equipment that was able recognize and store the license plate numbers of vehicles that drove by. Using that data, they were able to identify vehicles that passed a certain section of highway at roughly the same time as did a target vehicle. ...

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 7

Criminal investigation ... In April, the system hit pay-dirt. In just five days, six drivers reported being shot at. Officers were able to reconstruct the likely route taken by the perpetrator and they then looked at the license plate data collected by cameras stationed along that route. By filtering through the information gathered, they were able to identify one truck that could have been at each site where shots were reported. They were then able to match up the route with the mobile phone data of the driver. "The correspondence" between the two data sets "was clear," Zierke said on Tuesday. ...

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 8

Food networks Flavor network and the principles of food pairing

D

te

http://www.nature.com/srep/2011/111215/srep00196/full/srep00196.html FIT3152 Data analytics– Lecture 1

Slide 9

Food networks ... do we more frequently use ingredient pairs that are strongly linked in the flavor network or do we avoid them? To test this hypothesis we need data on ingredient combinations preferred by humans, information readily available in the current body of recipes. For generality, we used 56,498 recipes provided by two American repositories (epicurious.com and allrecipes.com) and to avoid a distinctly Western interpretation of the world's cuisine, we also used a Korean repository (menupan.com). The recipes are grouped into geographically distinct cuisines (North American,Western European, Southern European, Latin American, and East Asian)...

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 10

Food networks • 

Co-occurrence of major ingredients in 5 cuisines

Dra

FIT3152 Data analytics– Lecture 1

ate

Slide 11

Climate change Continental-scale temperature variability during the past two millennia

Dr

ate

http://www.nature.com/ngeo/journal/v6/n5/full/ngeo1797.html FIT3152 Data analytics– Lecture 1

Slide 12

Climate change ... The '2k Network' of the IGBP Past Global Changes (PAGES) project aims to produce a global array of regional climate reconstructions for the past 2000 years. ... Nine PAGES 2k working groups represent eight continental-scale regions and the oceans. Regional representation brings critical expert knowledge of individual proxy data sets, which is essential for improving palaeoclimate reconstructions. The PAGES 2k Network is coordinated with the National Oceanic and Atmospheric Administration (NOAA) World Data Center for Paleoclimatology to establish a benchmark database of proxy climate records for the past two millennia ...

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 13

Climate change • 

Data sources and locations

Dr

FIT3152 Data analytics– Lecture 1

ate

Slide 14

Climate change • 

Temperature variability over past 2000 years

FIT3152 Data analytics– Lecture 1

Slide 15

Twitter – Predicting the Dow Jones Twitter mood predicts the stock market

Dra

late

http://www.sciencedirect.com/science/article/pii/S187775031100007X FIT3152 Data analytics– Lecture 1

Slide 16

Twitter – Predicting the Dow Jones • 

Research highlights ► Public mood states along 7 different dimensions of mood are measured from the text content of large-scale Twitter feeds. ► Daily variations in public mood states show statistically significant correlation to daily changes in Dow Jones Industrial Average closing values. ► Certain dimensions of public mood states, in particular Calm, increase the accuracy of a Self Organizing Fuzzy Neural Network model in predicting up and down changes in DJIA closing values to 87.6%.

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 17

Twitter – Predicting the Dow Jones • 

Mood states

Draf

FIT3152 Data analytics– Lecture 1

ulate

Slide 18

Google Flu Trends From Wikipedia Google Flu Trends was a web service operated by Google providing up-to-date estimates of influenza activity for more than 25 countries. •  By aggregating Google search queries, accurate predictions can be made about flu activity. •  This project was run from 2008 – 2015 by Google.org to help predict outbreaks of flu. •  Data from the project is publically available.

• 

Draft: do not circulate

https://www.google.org/flutrends/about/ FIT3152 Data analytics– Lecture 1

Slide 19

Sports analytics Data analytics in this area is exploding! Some areas currently receiving a lot of interest are: •  •  •  •  •  • 

Individual and team performance tracking Wearable technologies and video tracking Optimizing team composition Analysis of supporter and fan engagement Training optimization, injury prevention Gambling: customer analysis, team analytics

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 20

Sports analytics Asia-Pacific Sports Analytics Conference, Melbourne July 22nd 2016. Sponsors/Presenters

FIT3152 Data analytics– Lecture 1

Slide 21

Sports analytics FIT3152 alumni involved in sports analytics : • 

• 

Andreas Limberopoulos (former student) CTO Sports Performance Tracking

Draft: do not circulate Dilpreet Singh (former tutor) ‘Activity Recognition in Sports Performance’ Winner of the Victorian iAwards (Tertiary Undegrad)

FIT3152 Data analytics– Lecture 1

Slide 22

Combatting terrorism

Dr

ate

http://orgnet.com/tnet.html FIT3152 Data analytics– Lecture 1

Slide 23

Social Science From the 2016 Federal Election. Seats of Wills, Batman and Melbourne. Red = NLP, Green = ALP polling booths.

Dra

late

http://www.smh.com.au/ FIT3152 Data analytics– Lecture 1

Slide 24

Personal analytics

http://www.databetes.com FIT3152 Data analytics– Lecture 1

Slide 25

Personal analytics

Draft

ulate

http://blog.stephenwolfram.com/2012/03/the-personal-analytics-of-my-life/ FIT3152 Data analytics– Lecture 1

Slide 26

Beautiful data visualization

Dra

late

http://xiaoji-chen.com/blog/2011/sky-color-of-10-chinese-cities/ FIT3152 Data analytics– Lecture 1

Slide 27

Application developer analytics

https://developer.yahoo.com/analytics/features.html FIT3152 Data analytics– Lecture 1

Slide 28

Data science: some common themes Previous examples illustrate: •  •  • 

•  • 

Complex problems of societal interest/utility Large data sets, multiple data sets (mashups), messy, incomplete, heterogeneous data, open data. Often using data repositories created for another purpose (food network): One description of Data Science is making a product out of data… Data collection and analysis would have been unthinkable 10 years ago (twitter, autobahn shooter) Use of graphics for communicating of results!

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 29

Data + Analytics = Data product

ot circulate

http://www.ukbiobank.ac.uk/ FIT3152 Data analytics– Lecture 1

http://www.ubble.co.uk/about/ Slide 30

Data science: for business Customer analytics • 

Website tracking, click to sales conversion, marketing and pricing strategy, social media sentiment analysis, demographic information, location data and traffic monitoring, tailored products…

Draft: do not circulate

Operations • 

Factors affecting demand, supply chain data, item tracking, sensor data, self regulating processes (automatic systems, pre-emptive repairs), fraud detection, productivity analysis, human resource management, …

FIT3152 Data analytics– Lecture 1

Slide 31

Data science: more broadly Science and medicine: • 

Search for habitable planets, DNA sequencing and disease genomics, prediction – eg., weather; automatic classification, biometrics (identification by physical characteristics), …

Draft: do not circulate

Arts, culture and society Social networks: LinkedIn, Facebook, Twitter etc., national security surveillance, ... •  Data journalism, data artists – see next slides

• 

FIT3152 Data analytics– Lecture 1

Slide 32

Data journalism From the data journalism handbook: ... What makes data journalism different to the rest of journalism? Perhaps it is the new possibilities that open up when you combine the traditional ‘nose for news’ and ability to tell a compelling story, with the sheer scale and range of digital information now available. ... Or using software to find connections between hundreds of thousands of documents, as The Telegraph did with MPs' expenses... http://datajournalismhandbook.org/1.0/en/introduction_0.html

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 33

Data artist – Ryoji Ikeda

http://www.ryojiikeda.com/project/datamatics/ FIT3152 Data analytics– Lecture 1

Slide 34

Data science: high-level skills Some necessary skills for a data scientist: •  •  •  •  • 

Understand a problem from client’s perspective Collect, cleanse, manage and combine data – which may come from disparate sources Understand the data, most likely using visualization tools as a starting point Analyze and model the data using statistical and (AI) machine learning techniques Communicate the results simply and effectively.

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 35

Data science: technical skills Some necessary technical skills include: •  •  •  •  •  •  • 

Statistical analysis Machine learning Programming (eg R, SQL, Python, Java …) Data storage and data handling Distributed computing, distributed algorithms Hacking mentality/problem solving/ Imagination and versatility…

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 36

Host data science competitions Their motto is “turning data science into a sport” •  You are encouraged to view their current and past competitions and perhaps enter some. •  We may use some competition data in this course •  See http://www.kaggle.com/ for details

• 

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 37

Jobs It seems every company is employing a data scientist. From Kaggle:

D

FIT3152 Data analytics– Lecture 1

e

Slide 38

FIT3152 Data analytics Overview

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 39

Unit objectives What the course is trying to achieve: We are concentrating on fundamental, generic, skills for a data scientist that are independent of software platform or problem domain. •  Problem solving skills, independence and ingenuity. Good communication skills.

• 

Draft: do not circulate

What it is not trying to achieve: • 

Introduction to the vast range of software, techniques and computing platforms available to data scientists.

FIT3152 Data analytics– Lecture 1

Slide 40

Unit outline Week by week (brief and subject to change) Date 27/07/16 3/08/16 10/08/16 17/08/16 24/08/16 31/08/16 7/09/16 14/09/16 21/09/16 28/09/16 5/10/16 12/10/16 19/10/16

Lecture 1 2 3 4 5 6 7 8 9

Topic Intro,to,Data,Science.,Intro,to,R.,Review,of,basic,statistics,using,R Exploring,data,using,graphics,in,R Data,manipulation,in,R Guest,Lecture, Linear,regression,in,R Network,analysis Classification,using,decision,trees Comparing,classification,models,,evaluating,algorithms KPMeans,and,hierarchical,clustering Break Text,analysis Student,Presentations,(Assignment,1,results) Review,of,the,course,and,exam,preparation

Draft: do not circulate 10 11 12

FIT3152 Data analytics– Lecture 1

Slide 41

Assessment details Assignment 1 • 

Working in groups, (20%). Initial report due 26th August, Brief class presentation Week 11, Final report due 14th October

Draft: do not circulate

Assignment 2 • 

Individual work, (10%) due 7th October

Tutorial • 

Participation (10%)

Examination (60%) FIT3152 Data analytics– Lecture 1

Slide 42

Unit Management Lecturers: John Betts (CE), •  Sylvester Olubolu Orimaye (at Monash Malaysia)

• 

Draft: do not circulate

Guest lecturer: • 

Bernie Kruger (TAC) W4 (Clayton).

Tutors: Heshan Kumarage and Brian Ramirez Espinosa. •  Tutorials start Week 2, Allocate+ queries:

• 

https://secure.monash.edu/scsd/timetables/allocate/help/ FIT3152 Data analytics– Lecture 1

Slide 43

Guest Lecture 18th August •  Bernie Kruger •  Data Science Lead at the Transport Accident Commission (TAC)

• 

Draft: do not circulate

• 

2 hour lecture on applied Data Science and consulting + Q&A with Bernie on working as a Graduate in the Data Science industry

FIT3152 Data analytics– Lecture 1

Slide 44

Contact list John: x55804, •  Sylvester: •  Heshan: •  Brian:

• 

[email protected] [email protected] [email protected] [email protected]

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 45

R

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 46

Review of basic statistics using R What is R? Obtaining and installing R? Using R Help and References in R

Draft: do not circulate

Help •  References you should read

• 

Review of basic statistics using R FIT3152 Data analytics– Lecture 1

Slide 47

What is R? R is a statistical computing environment and programming language A successor to the S language developed at AT&T Bell Laboratories •  Initially created by Ross Ihaka and Robert Gentleman University of Auckland (hence ‘R’). •  R is now developed R Development Core Team •  R is freely available under the GNU General Public License (free, open source etc.)

• 

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 48

Why we are using R R: •  •  •  • 

• 

Is the defacto platform for data science independent of operating system, problem domain and data type Has a large number of users, active user communities, e.g.: MelbURN-Melbourne-Users-of-R-Network Is free, open source, user-customisable… Has thousands of user-contributed packages covering all conceivable applications and data types, for visualisation, machine learning and data science One drawback: a steep learning curve!

Draft: do not circulate

FIT3152 Data analytics– Lecture 1

Slide 49

Obtaining and installing R Go to: http://cran.r-project.org/ Follow the link to download the latest version of R for your operating system. •  Install as usual for your OS (Mac/Win easy) •  Use default directories if possible to make installation of RStudio easier. •  Runs from Launchpad or Start Button

• 

Draft: do not circulate

LHS of main page has Documentation > Manuals • 

Click to get: An Introduction to R (Release)

FIT3152 Data analytics– Lecture 1

Slide 50

Obtaining and installing RStudio RStudio is an IDE that makes running R a lot easier – especially I/O, managing data and variables, and scripting. Go to: http://www.rstudio.com/ • 

Draft: do not circulate Click on download now

Install as usual for your OS •  Runs from Launchpad or Start Button

• 

RStudio also make Shiny – have a look at it! FIT3152 Data analytics– Lecture 1

Slide 51

Syntax R is command line driven (basic installation) >  Indicates+a+new+line,+Con1nued+lines+by+++

R is case sensitive

Draft: do not circulate Assignment >  TheData+is+different+to+Thedata+

>  Use:+x+  #+denotes+a+comment.+Anything+on+the+line+aHer+this+ point+is+ignored.+ FIT3152 Data analytics– Lecture 1

Slide 52

Console, Variables, Functions The R Console shows the command line interface R can be used for direct calculation and interprets each line as you press (Enter/Return) key, t...


Similar Free PDFs