Syllabus-3280 - Syllabus of this new STAT Course at UVa PDF

Title	Syllabus-3280 - Syllabus of this new STAT Course at UVa
Course	Data Visualization and Management
Institution	University of Virginia
Pages	3
File Size	75.6 KB
File Type	PDF
Total Downloads	17
Total Views	183

Preview

CLICK TO PREVIEW PDF

Summary

Syllabus of this new STAT Course at UVa...

Description

STAT 3280: Data Visualization and Management Spring 2021 Instructor: Lecture: Instructor office hour: TA: TA office hour:

Tianxi Li Email: [email protected] Asynchronous Lectures Video Recordings (available in Collab) Wed 2:30 pm – 4:30 pm Zoom (available in Collab) Tonghao Zhang Email: [email protected] Mon 3 pm – 4 pm Zoom ID: 929 095 3095

Course description: We will focus on two critical skills in data analysis as a statistician: visualization and database operations. The visualization is a crucial component in understanding your data and also present the finding of data analysis. Since, in many cases, data are stored in a database, understanding how to access and manipulate data in a database environment becomes a useful skill for a statistician. We will split the efforts roughly in 6:4 between the visualization and database in this course. Visualization will be the focus for the early part of the course. We will mainly use R as the software but introduce the visualization methods, principles, and ideas in a much broader context. The oral presentation of visualization results is also part of the requirement. The later part of the course will introduce basic concepts of a database system, relational design, and SQL operations. Then we will also introduce how to use R for database operations as a statistical interface. Prerequisites: Any prior statistics courses and prior experience with R programing. Expectations: At the end of the course, the students are expected to 1. understand data visualization results in articles and academic papers easily 2. given a specific data set, be able extracted the need information select proper ways to visualize the information 3. be able to produce clear and informative visualizations for report and presentations 4. have basic understanding the design and structure of a database system 5. be able the use SQL for basic data manipulation and extraction in a database 6. be able to use R for SQL operations in connection with a database What you will expect to experience in the course: The course is targeting senior undergrads who may enter the job market soon. Therefore, we will focus on more practical settings of real statistician jobs. The real-world situation means, do not expect a one-step operation with standard solutions as in entry-level homework assignments. There is nothing like “this is the data set with x and y ready for you, so please generate a scatter plot between x and y”. The realistic situation is, “There is a database in US census about 500 economic factors of Virginia households in the past twenty years, could you grad some relevant data there, explore it and see how to measure and visualize the economic recovery since the 2007 financial crisis?” Usually, your clients and managers would not give you a well-defined problem; They only have a vague idea about what they want. You, the statistician, should understand the data, explore the information hidden there, figure out a concrete problem, and provide a solution to it. Moreover, your data set available is never clean and informative. You need to access the raw data. There may be tons of errors, missing values, non-informative variables. So it is your job to process the data, clean it, identify potential data quality problems and then explore the information there strategically to discover the answer to your original question. page 1 of 3

Be prepared! This is the type of training you would have in this course. That is why we do not have exams or final projects, and the grade will be based on homework submissions. Because I hope all students taking this course will focus on such a procedure in a period that needs focus, patients, critical and creative thinking. Sometimes, even if I know the specific way to go, I will only give you vague or less informative answers, because this is what you may experience in real jobs. So a successful strategy would be getting your hands dirty on the data, carefully think about your problem and the information you want to get. Do not expect to solve the problem with five lines of direct function calls. There might be such trivial situations in a real-world application, but not a lot. The procedure from the raw data to a good visualization can be called “data grind”. If you believe that you would not enjoy such a thing, perhaps you would consider if this course would be proper for you. Comparison with CS 4750 (and other database courses) : CS 4750 is a thorough course about database systems that covers all concepts and ideas about the design, construction, operations of a database systematically. In contrast, the current course is a statistical course and only puts half emphasis on the database. Evidently, it is not realistic to cover the same scope of topics in this course as in CS 4750. As a statistical course, we will only touch on the part that is most relevant to statisticians. We will only introduce the database concepts and designs on a light basis. Most of the discussions will be on using basic SQL operations to get, insert and manipulate the data. And then how to connect the database with R for a more sophisticated down-stream analysis. At the end, a statistician’s job is to analyze the data with statistical methods, which will not be applicable in a database system anyway. Therefore, this course will only teach you how to use the database as a statistician. Students who aim for a systematic and more in-depth course about database systems are recommended to take CS 4750 or another similar course instead. Lectures: For most of the course, the lecture will be given in the form of real-time coding in R, and you are supposed to follow in your R environment at the same. Therefore, we are expected to be familiar with basic R operations. To ensure it is flexible for you to pause and go back, the lectures will be asynchronous. It is also easier to keep the same progress between the two sections. Lecture recording will be uploaded to Collab on Mon & Wed mornings. Meanwhile, the GTA (Tonghao) and I will put our Q&A sessions during the regular lecture time. The GTA will hold a 1-hour session on Monday, and I will hold a 2-hour session on Wed. So you can go through the lectures by yourself, ponder the parts that you do not understand, and come to the Q&A session to ask questions. You are strongly encouraged to spend time exploring answers before asking for help. It takes more time in a short time, but it will eventually be much more beneficial in the long run, especially when you take a statistician job. Piazza: We will have Piazza as our online discussion platform. You can talk about homework, interesting tools and ideas about the course there. The GTA and I try to participate in the discussion. You are encouraged to post your questions, and if you feel puzzled about something, it is very likely that other students are having the same question. Actively problem solver and helper on Piazza will earn bonus points at the end of the semester. Grading: The A/B/C grade corresponds to 90%, 75% and 65% respectively. This is mainly based on the five homework assignments with weight 15%, 20%, 25%, 20%, 20%, respectively. There are two categories of bonus points that you can potentially earn, on top of the 100%. The first one by actively help others on Piazza. You may earn up to 5% bonus points from this category. The second category will be the optional presentation (see details in the homework section). You can earn up to 5% bonus points from this category. Homework: There will be five homework assignments. HW 1-3 will be about visualization, and HW 4-5 will be about databases. The dates when they are assigned may depend on our progress. You are expected to have 2-3 weeks for each of assignment 1,2, 4, 5. HW3 takes longer and you may have roughly page 2 of 3

4 weeks for it (which may overlap with HW 4 to some extent). The assignment should be submitted by individuals or by groups of 2-3 students. You can work with different people for different assignments, but more than three members for one assignment is not acceptable. You can collaborate with students from the other section as well. Make sure you list all group members in the submission. The grade will be given equally to all group members. Regarding the optional presentation: Towards the end of the semester, we will schedule a live presentation session. For each of HW1 and HW 2, I will work with our GTA to select 1-3 best submissions and invite the team to present their result. This is completely optional – the selected team does not have to accept the invitation. Also, for HW3, this is the larger data analysis and visualization task; all teams can volunteer to present their work. However, the decision should be made by April 2, so we can plan for a proper length of the presentation session and our progress on the database part. Each team will have 5-10 mins to explain their analysis by visualization, including Q&A at the end. The audience would then vote for the a bonus grade (1-5) for the team that will be counted towards the final grade. Homework submission policy: Please email your submission to the GTA. To ensure fairness in grading all assignments, late homework submissions cannot be accepted. Because all assignments are graded at the same time, late ”arrivals” makes it hard to ensure that all were graded in the same way. Exceptions such as conference deadlines (with a minimum of two weeks’ notice) or family or medical emergencies can be accommodated. “Another course deadline / exam” is not an acceptable “emergency.” Reasonable extensions will be granted for genuine emergencies only. Reference textbooks: We do not follow any specific textbook. But if you do prefer to have a book that can be useful for some of the topics of the course, then you can try the following one 1. Wickham, Hadley. ggplot2: elegant graphics for data analysis. Springer, 2016. 2. Trueblood, R.P. and Lovett, J.N., 2001. Data Mining and Statistical Analysis Using SQL (Vol. 1). Berkeley, CA: Apress.

page 3 of 3...