OIDD 245 notes - Session 6 PDF

Title OIDD 245 notes - Session 6
Course Analytics & the Digital Economy
Institution University of Pennsylvania
Pages 2
File Size 97.6 KB
File Type PDF
Total Downloads 70
Total Views 133

Summary

OIDD245 - Spring 2021 - Professor Tambe...


Description

Session 6 - Data Wrangling (Mon Feb 8th) For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights - The New York Times - “big data” : the modern abundance of digital data from many sources can be mined with clever software for discoveries and insights - smarter, data-driven decision-making in every field. - why data scientist is the economy’s hot new job. - too much handcrafted work --- “data wrangling,”“data munging”, “data janitor work” — is still required - spend from 50 percent to 80percent of their time in this more mundane labor - combining different data sets - Data from sensors, documents, the web and conventional databases all come indifferent formats - must be cleaned up and converted into a unified form that the algorithm can understand - Startups creating software tools to improve this problem - ClearStory Data - Trifacta makes a tool for data professionals - Paxata is focused squarely on automating data preparation Data Scientist: The Dirtiest Job of the 21st Century 40% a vacuum, 40% a janitor, and 20% a fortune-teller

-

Data lakes are centralised repositories that store all the company’s data. Unfortunately, some people assume that data lakes are data dumping grounds Many organisations started implementing data lakes without a clear idea of what to do with the data collected. it is still vital to design it with specific projects needs Data scientists often find themselves contacting different departments for data Merely storing data without cataloguing it is a big mistake. The key to having a useful data lake is to ensure that the metadata is well-defined. Dirty data takes away the integrity of the dataset - Incomplete data is when some essential features are empty

-

-

Inaccurate and inconsistent data is when the values are technically correct, but wrong based on the context. For example, when an employee changed his address, and it wasn’t updated. - Duplicate data the 80/20 rule. Data scientists spend only 20 percent of their time on building models and the other 80 percent gathering, analysing, cleaning, and reorganising data...


Similar Free PDFs