HW03-1 - HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 PDF

Title	HW03-1 - HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3
Author	Anonymous User
Course	Data Management for Analytics
Institution	Northeastern University
Pages	3
File Size	83 KB
File Type	PDF
Total Views	149

Preview

CLICK TO PREVIEW PDF

Summary

HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3...

Description

Assignment 3 DS 5110 : Introduction to Data Management

Instruction: Create a directory with the following structure: -

`hw3-your-name`

In this directory you must add your python source code and your report (hw3-your-name.pdf) and a README.txt (or *.md) file describing the code files. Also    

Include data in the directory. Compress the directory as `.zip`. Your solution should include all of the code necessary to answer the problems. All your code should run (assuming the data is available). All plots should be generated using functions taught in class. Missing values should be handled appropriately. Axes should be labeled clearly and accurately.

## Part A Problems 1--2 use data from the US Department of Education's Civil Rights Data Collection. Download the zipped 2015-2016 data from https://www2.ed.gov/about/offices/list/ocr/docs/crdc-2015-16.html. The Public Use Data File User's Manual should be included in the zipped files, or can be downloaded at the same location. Use it as a reference to help you understand the dataset. Use Pandas appropriate functions to import the dataset into your Python program. Check the Data File Manual (within the zip version of the dataset) for how missing values were reported, and handle them appropriately. Treat all reserve codes as missing. ### Problem 1 We would like to investigate whether Hispanic and Native American (American Indian / Alaska Native) students are over- or under-represented in Gifted & Talented programs. Create a new `DataFrame` containing only schools with a Gifted & Talented program with the following columns: • • • • • •

The total number of students enrolled at each school The number of Hispanic students and Native American students at each school The total number of students in the school's GT program The number of students in the GT program who are Hispanic or Native American The proportion of students at each school who are Hispanic or Native American among all students The proportion of students in the GT program who are Hispanic or Native American among students in the GT program

Plot the proportion of Hispanic and Native American students at each school (on the x-axis) versus the proportion of GT students who are Hispanic and Native American (on the y-axis). Include a smoothing line on the plot.

What do you observe in the plot? Does the plot indicate an over- or under-representation of Hispanic and Native American students in Gifted & Talented programs? Calculate the overall proportion of Hispanic and Native American students across all schools and the overall proportion of GT students who are Hispanic and Native American. Are Hispanic and Native American students over- or under-represented in Gifted & Talented programs? ### Problem 2 We would like to investigate whether disabled students are disproportionately referred to law enforcement for discipline. Create a new DataFrame with the following columns: • • • • • •

The total number of students enrolled at each school The number of disabled students (served by IDEA) at each school The total number of students who were referred to law enforcement The number of disabled students (served by IDEA) who were referred to law enforcement The proportion of disabled students (served by IDEA) at each school among of all students The proportion of students who were referred to law enforcement and are disabled (served by IDEA) among all students referred to law enforcement

Plot the proportion of disabled students at each school (on the x-axis) versus the proportion of students were referred to law enforcement and are disabled (on the y-axis). Include a smoothing line on the plot. What do you observe in the plot? Does the plot indicate an over- or under-representation of disabled students among students who are referred to law enforcement? Calculate the overall proportion of disabled students across all schools and the overall proportion of students who were referred to law enforcement and are disabled (served by IDEA) among all students referred to law enforcement across all schools. Are disabled students disproportionately referred to law enforcement? ************************************************************************************

## Part B Problems 3-5 uses a subset of the DBLP database of bibliographic information on major computer science journals and proceedings, available from https://data.mendeley.com/datasets/3p9w84t5mr. The dataset has been processed to include predictions of the author's genders using the open-source Genderize API. The processed data has been made available in the form of `SQL` scripts that import the data into a MySQL database. We are primarily interested in the "general" and "authors" tables created by the "main.sql" and "authors.sql" scripts, respectively.

There are many options to load SQL database into Python. In class we briefly discussed the sqlite3 and sqlalchemy modules we can interpret sqlite databases. But other methods exist, and can be used, such as “MySQL Connector” which can read MySQL datasets directly.

To use sqlite3 and sqlalchemy we need to first convert the database to sqlite3 format. A possible tool for this (if you are using a unix-like operational system) is using the script at (https://github.com/dumblob/ mysql2sqlite). The dataset comes with a “DBLP-CSR-README.pdf” file describing how access the data using mysql and If you are using a POSIX compliant operating system, you could import the relevant tables into a SQLite database named `dblp.db` using the following commands in a compatible shell: ./mysql2sqlite main.sql | sqlite3 dblp.db ./mysql2sqlite authors.sql | sqlite3 dblp.db and use sqlite3 and sqlalchemy modules to work with the data in Python. ### Problem 3 Filter the data to include only the authors for whom a gender was predicted as 'male' or 'female' with a probability of 0.90 or greater, and then create a bar plot showing the total number of *distinct* male and female authors published each year. Comment on the visualization. ### Problem 4 Still including only the authors for whom a gender was predicted with a probability of 0.90 or greater, create a stacked bar plot showing the *proportions* of distinct male authors vs. distinct female authors published each year. (The stacked bars for each year will sum to one.) Comment on the visualization. ## Problem 5 Still including only the authors for whom a gender was predicted with a probability of 0.90 or greater, create a faceted bar plots showing the *proportions* of female *first authorships* for each year for each domain (CS, DE, SE, and TH). (If a conference belongs to multiple domains, you may include it for both of them.) Comment on the visualization. Which domains have the highest and lowest representation of papers with female first authors?...