Title | FIT3152 Lecture 01 |
---|---|
Author | Wong Kai Jeng |
Course | Data Science |
Institution | Monash University |
Pages | 94 |
File Size | 2.6 MB |
File Type | |
Total Downloads | 53 |
Total Views | 131 |
Lecture 1...
FIT3152 Data analytics– Lecture 1 Introduction to Data Science Recent examples • Common themes
•
Introduction to FIT3152
Draft: do not circulate
Unit objectives • Unit outline • Assessment details • Unit management
•
Review of basic statistics using R FIT3152 Data analytics– Lecture 1
Slide 1
What is data science? From Wikipedia: •
Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD).
Draft: do not circulate
•
https://en.wikipedia.org/wiki/Data_science (Accessed 21/07/2016)
FIT3152 Data analytics– Lecture 1
Slide 3
Data Science: Recent examples
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 4
Criminal investigation Road Shooter Found Via Mass Data Collection
Dr
ate
http://www.spiegel.de/international/germany/spectacular-highway-shooterinvestigation-raises-data-privacy-concerns-a-908006.html FIT3152 Data analytics– Lecture 1
Slide 5
Criminal investigation ... The case involves a truck driver who fired at least 762 shots at cars and trucks on German highways and at buildings in a shooting spree that began in 2008. In several cases, his targets were only barely able to avoid accidents as a result of the shots. In 2009, one woman was hit in the neck with a bullet fired by the truck driver, identified on Tuesday only as a 57-yearold truck driver from North Rhine-Westphalia, but survived. ...
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 6
Criminal investigation ... On seven sections of the autobahns in question, police erected equipment that was able recognize and store the license plate numbers of vehicles that drove by. Using that data, they were able to identify vehicles that passed a certain section of highway at roughly the same time as did a target vehicle. ...
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 7
Criminal investigation ... In April, the system hit pay-dirt. In just five days, six drivers reported being shot at. Officers were able to reconstruct the likely route taken by the perpetrator and they then looked at the license plate data collected by cameras stationed along that route. By filtering through the information gathered, they were able to identify one truck that could have been at each site where shots were reported. They were then able to match up the route with the mobile phone data of the driver. "The correspondence" between the two data sets "was clear," Zierke said on Tuesday. ...
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 8
Food networks Flavor network and the principles of food pairing
D
te
http://www.nature.com/srep/2011/111215/srep00196/full/srep00196.html FIT3152 Data analytics– Lecture 1
Slide 9
Food networks ... do we more frequently use ingredient pairs that are strongly linked in the flavor network or do we avoid them? To test this hypothesis we need data on ingredient combinations preferred by humans, information readily available in the current body of recipes. For generality, we used 56,498 recipes provided by two American repositories (epicurious.com and allrecipes.com) and to avoid a distinctly Western interpretation of the world's cuisine, we also used a Korean repository (menupan.com). The recipes are grouped into geographically distinct cuisines (North American,Western European, Southern European, Latin American, and East Asian)...
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 10
Food networks •
Co-occurrence of major ingredients in 5 cuisines
Dra
FIT3152 Data analytics– Lecture 1
ate
Slide 11
Climate change Continental-scale temperature variability during the past two millennia
Dr
ate
http://www.nature.com/ngeo/journal/v6/n5/full/ngeo1797.html FIT3152 Data analytics– Lecture 1
Slide 12
Climate change ... The '2k Network' of the IGBP Past Global Changes (PAGES) project aims to produce a global array of regional climate reconstructions for the past 2000 years. ... Nine PAGES 2k working groups represent eight continental-scale regions and the oceans. Regional representation brings critical expert knowledge of individual proxy data sets, which is essential for improving palaeoclimate reconstructions. The PAGES 2k Network is coordinated with the National Oceanic and Atmospheric Administration (NOAA) World Data Center for Paleoclimatology to establish a benchmark database of proxy climate records for the past two millennia ...
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 13
Climate change •
Data sources and locations
Dr
FIT3152 Data analytics– Lecture 1
ate
Slide 14
Climate change •
Temperature variability over past 2000 years
FIT3152 Data analytics– Lecture 1
Slide 15
Twitter – Predicting the Dow Jones Twitter mood predicts the stock market
Dra
late
http://www.sciencedirect.com/science/article/pii/S187775031100007X FIT3152 Data analytics– Lecture 1
Slide 16
Twitter – Predicting the Dow Jones •
Research highlights ► Public mood states along 7 different dimensions of mood are measured from the text content of large-scale Twitter feeds. ► Daily variations in public mood states show statistically significant correlation to daily changes in Dow Jones Industrial Average closing values. ► Certain dimensions of public mood states, in particular Calm, increase the accuracy of a Self Organizing Fuzzy Neural Network model in predicting up and down changes in DJIA closing values to 87.6%.
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 17
Twitter – Predicting the Dow Jones •
Mood states
Draf
FIT3152 Data analytics– Lecture 1
ulate
Slide 18
Google Flu Trends From Wikipedia Google Flu Trends was a web service operated by Google providing up-to-date estimates of influenza activity for more than 25 countries. • By aggregating Google search queries, accurate predictions can be made about flu activity. • This project was run from 2008 – 2015 by Google.org to help predict outbreaks of flu. • Data from the project is publically available.
•
Draft: do not circulate
https://www.google.org/flutrends/about/ FIT3152 Data analytics– Lecture 1
Slide 19
Sports analytics Data analytics in this area is exploding! Some areas currently receiving a lot of interest are: • • • • • •
Individual and team performance tracking Wearable technologies and video tracking Optimizing team composition Analysis of supporter and fan engagement Training optimization, injury prevention Gambling: customer analysis, team analytics
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 20
Sports analytics Asia-Pacific Sports Analytics Conference, Melbourne July 22nd 2016. Sponsors/Presenters
FIT3152 Data analytics– Lecture 1
Slide 21
Sports analytics FIT3152 alumni involved in sports analytics : •
•
Andreas Limberopoulos (former student) CTO Sports Performance Tracking
Draft: do not circulate Dilpreet Singh (former tutor) ‘Activity Recognition in Sports Performance’ Winner of the Victorian iAwards (Tertiary Undegrad)
FIT3152 Data analytics– Lecture 1
Slide 22
Combatting terrorism
Dr
ate
http://orgnet.com/tnet.html FIT3152 Data analytics– Lecture 1
Slide 23
Social Science From the 2016 Federal Election. Seats of Wills, Batman and Melbourne. Red = NLP, Green = ALP polling booths.
Dra
late
http://www.smh.com.au/ FIT3152 Data analytics– Lecture 1
Slide 24
Personal analytics
http://www.databetes.com FIT3152 Data analytics– Lecture 1
Slide 25
Personal analytics
Draft
ulate
http://blog.stephenwolfram.com/2012/03/the-personal-analytics-of-my-life/ FIT3152 Data analytics– Lecture 1
Slide 26
Beautiful data visualization
Dra
late
http://xiaoji-chen.com/blog/2011/sky-color-of-10-chinese-cities/ FIT3152 Data analytics– Lecture 1
Slide 27
Application developer analytics
https://developer.yahoo.com/analytics/features.html FIT3152 Data analytics– Lecture 1
Slide 28
Data science: some common themes Previous examples illustrate: • • •
• •
Complex problems of societal interest/utility Large data sets, multiple data sets (mashups), messy, incomplete, heterogeneous data, open data. Often using data repositories created for another purpose (food network): One description of Data Science is making a product out of data… Data collection and analysis would have been unthinkable 10 years ago (twitter, autobahn shooter) Use of graphics for communicating of results!
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 29
Data + Analytics = Data product
ot circulate
http://www.ukbiobank.ac.uk/ FIT3152 Data analytics– Lecture 1
http://www.ubble.co.uk/about/ Slide 30
Data science: for business Customer analytics •
Website tracking, click to sales conversion, marketing and pricing strategy, social media sentiment analysis, demographic information, location data and traffic monitoring, tailored products…
Draft: do not circulate
Operations •
Factors affecting demand, supply chain data, item tracking, sensor data, self regulating processes (automatic systems, pre-emptive repairs), fraud detection, productivity analysis, human resource management, …
FIT3152 Data analytics– Lecture 1
Slide 31
Data science: more broadly Science and medicine: •
Search for habitable planets, DNA sequencing and disease genomics, prediction – eg., weather; automatic classification, biometrics (identification by physical characteristics), …
Draft: do not circulate
Arts, culture and society Social networks: LinkedIn, Facebook, Twitter etc., national security surveillance, ... • Data journalism, data artists – see next slides
•
FIT3152 Data analytics– Lecture 1
Slide 32
Data journalism From the data journalism handbook: ... What makes data journalism different to the rest of journalism? Perhaps it is the new possibilities that open up when you combine the traditional ‘nose for news’ and ability to tell a compelling story, with the sheer scale and range of digital information now available. ... Or using software to find connections between hundreds of thousands of documents, as The Telegraph did with MPs' expenses... http://datajournalismhandbook.org/1.0/en/introduction_0.html
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 33
Data artist – Ryoji Ikeda
http://www.ryojiikeda.com/project/datamatics/ FIT3152 Data analytics– Lecture 1
Slide 34
Data science: high-level skills Some necessary skills for a data scientist: • • • • •
Understand a problem from client’s perspective Collect, cleanse, manage and combine data – which may come from disparate sources Understand the data, most likely using visualization tools as a starting point Analyze and model the data using statistical and (AI) machine learning techniques Communicate the results simply and effectively.
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 35
Data science: technical skills Some necessary technical skills include: • • • • • • •
Statistical analysis Machine learning Programming (eg R, SQL, Python, Java …) Data storage and data handling Distributed computing, distributed algorithms Hacking mentality/problem solving/ Imagination and versatility…
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 36
Host data science competitions Their motto is “turning data science into a sport” • You are encouraged to view their current and past competitions and perhaps enter some. • We may use some competition data in this course • See http://www.kaggle.com/ for details
•
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 37
Jobs It seems every company is employing a data scientist. From Kaggle:
D
FIT3152 Data analytics– Lecture 1
e
Slide 38
FIT3152 Data analytics Overview
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 39
Unit objectives What the course is trying to achieve: We are concentrating on fundamental, generic, skills for a data scientist that are independent of software platform or problem domain. • Problem solving skills, independence and ingenuity. Good communication skills.
•
Draft: do not circulate
What it is not trying to achieve: •
Introduction to the vast range of software, techniques and computing platforms available to data scientists.
FIT3152 Data analytics– Lecture 1
Slide 40
Unit outline Week by week (brief and subject to change) Date 27/07/16 3/08/16 10/08/16 17/08/16 24/08/16 31/08/16 7/09/16 14/09/16 21/09/16 28/09/16 5/10/16 12/10/16 19/10/16
Lecture 1 2 3 4 5 6 7 8 9
Topic Intro,to,Data,Science.,Intro,to,R.,Review,of,basic,statistics,using,R Exploring,data,using,graphics,in,R Data,manipulation,in,R Guest,Lecture, Linear,regression,in,R Network,analysis Classification,using,decision,trees Comparing,classification,models,,evaluating,algorithms KPMeans,and,hierarchical,clustering Break Text,analysis Student,Presentations,(Assignment,1,results) Review,of,the,course,and,exam,preparation
Draft: do not circulate 10 11 12
FIT3152 Data analytics– Lecture 1
Slide 41
Assessment details Assignment 1 •
Working in groups, (20%). Initial report due 26th August, Brief class presentation Week 11, Final report due 14th October
Draft: do not circulate
Assignment 2 •
Individual work, (10%) due 7th October
Tutorial •
Participation (10%)
Examination (60%) FIT3152 Data analytics– Lecture 1
Slide 42
Unit Management Lecturers: John Betts (CE), • Sylvester Olubolu Orimaye (at Monash Malaysia)
•
Draft: do not circulate
Guest lecturer: •
Bernie Kruger (TAC) W4 (Clayton).
Tutors: Heshan Kumarage and Brian Ramirez Espinosa. • Tutorials start Week 2, Allocate+ queries:
•
https://secure.monash.edu/scsd/timetables/allocate/help/ FIT3152 Data analytics– Lecture 1
Slide 43
Guest Lecture 18th August • Bernie Kruger • Data Science Lead at the Transport Accident Commission (TAC)
•
Draft: do not circulate
•
2 hour lecture on applied Data Science and consulting + Q&A with Bernie on working as a Graduate in the Data Science industry
FIT3152 Data analytics– Lecture 1
Slide 44
Contact list John: x55804, • Sylvester: • Heshan: • Brian:
•
[email protected] [email protected] [email protected] [email protected]
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 45
R
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 46
Review of basic statistics using R What is R? Obtaining and installing R? Using R Help and References in R
Draft: do not circulate
Help • References you should read
•
Review of basic statistics using R FIT3152 Data analytics– Lecture 1
Slide 47
What is R? R is a statistical computing environment and programming language A successor to the S language developed at AT&T Bell Laboratories • Initially created by Ross Ihaka and Robert Gentleman University of Auckland (hence ‘R’). • R is now developed R Development Core Team • R is freely available under the GNU General Public License (free, open source etc.)
•
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 48
Why we are using R R: • • • •
•
Is the defacto platform for data science independent of operating system, problem domain and data type Has a large number of users, active user communities, e.g.: MelbURN-Melbourne-Users-of-R-Network Is free, open source, user-customisable… Has thousands of user-contributed packages covering all conceivable applications and data types, for visualisation, machine learning and data science One drawback: a steep learning curve!
Draft: do not circulate
FIT3152 Data analytics– Lecture 1
Slide 49
Obtaining and installing R Go to: http://cran.r-project.org/ Follow the link to download the latest version of R for your operating system. • Install as usual for your OS (Mac/Win easy) • Use default directories if possible to make installation of RStudio easier. • Runs from Launchpad or Start Button
•
Draft: do not circulate
LHS of main page has Documentation > Manuals •
Click to get: An Introduction to R (Release)
FIT3152 Data analytics– Lecture 1
Slide 50
Obtaining and installing RStudio RStudio is an IDE that makes running R a lot easier – especially I/O, managing data and variables, and scripting. Go to: http://www.rstudio.com/ •
Draft: do not circulate Click on download now
Install as usual for your OS • Runs from Launchpad or Start Button
•
RStudio also make Shiny – have a look at it! FIT3152 Data analytics– Lecture 1
Slide 51
Syntax R is command line driven (basic installation) > Indicates+a+new+line,+Con1nued+lines+by+++
R is case sensitive
Draft: do not circulate Assignment > TheData+is+different+to+Thedata+
> Use:+x+ #+denotes+a+comment.+Anything+on+the+line+aHer+this+ point+is+ignored.+ FIT3152 Data analytics– Lecture 1
Slide 52
Console, Variables, Functions The R Console shows the command line interface R can be used for direct calculation and interprets each line as you press (Enter/Return) key, t...