Midterm 2 Solution PDF

Title	Midterm 2 Solution
Course	Computing for Data Analysis
Institution	Georgia Institute of Technology
Pages	37
File Size	967.2 KB
File Type	PDF
Total Downloads	50
Total Views	142

Preview

CLICK TO PREVIEW PDF

Summary

CSE 6040 Midterm-2 exam solution (Fall-2020)...

Description

11/24/2019

GT learners: Sample solutions | Midterm 2: Sample solutions | CSE6040x Courseware | edX

Course  Midter…

 Midter…

 GT lear…

GT learners: Sample solutions

Problem 0: Soccer Guru Version 1.5

Soccer season is on and teams need to start preparing for the World Cup 2022. We need your help as a Soccer Guru to analyse different statistics and come up with insights to help the teams prepare better. This problem tests your understanding of Pandas and SQL concepts. Important note. Due to a limitation in Vocareum's software stack, this notebook is set to use the Python 3.5 kernel (rather than a more up-to-date 3.6 or 3.7 kernel). If you are developing on your local machine and are using a different version of Python, you may need to adapt your solution before submitting to the autograder.

Exercise 0 (0 points). Run the code cell below to load the data, which is a SQLite3 database containing results and fixtures of various soccer matches that have been played around the globe since 1980. Observe that the code loads all rows from the table, soccer_results, contained in the database file, prob0.db. You do not need to do anything for this problem other than run the next two code cells and familiarize yourself with the resulting dataframe, which is stored in the variable df.

In [1]: import sqlite3 as db import pandas as pd from datetime import datetime from collections import defaultdict disk_engine = db.connect('file:prob0.db?mode=ro', uri=True) def load_data(): df = pd.read_sql_query("SELECT * FROM soccer_results", disk_engine) return df In [2]: # Test: Exercise 0 (exposed) df = load_data() assert df.shape[0] == 22851, "Row counts do not match. Try loading the data again" assert df.shape[1] == 9, "You don't have all the columns. Try loading the data again" print("\n(Passed!)") df.head() (Passed!) Out[2]: date home_team

away_team home_score away_score tournament

city

country neutral

0 1994-01-02 Barbados

Grenada

0

0

Friendly

Bridgetown Barbados

FALSE

1 1994-01-02 Ghana

Egypt

2

1

Friendly

Accra

Ghana

FALSE

Mali

FALSE

2 1994-01-05 Mali

Burkina Faso 1

1

Friendly

Bamako

3 1994-01-09 Mauritania

Mali

1

3

Friendly

Nouakchott Mauritania FALSE

4 1994-01-11 Thailand

Nigeria

1

1

Friendly

Bangkok

Thailand

FALSE

Each row of this dataframe is a game, which is played between a "home team" (column home_team) and an "away team" (away_team). The number of goals scored by each team appears in the home_score and away_score columns, respectively.

https://courses.edx.org/courses/course-v1:GTx+CSE6040x+2T2019a/courseware/737ee8e0d5a54baca4624995e917c36d/75d6b5741cbd428cbce5e… 1/37

11/24/2019

GT learners: Sample solutions | Midterm 2: Sample solutions | CSE6040x Courseware | edX

Exercise 1 (1 point): Write an SQL query find the ten (10) teams that have the highest average away-scores since the year 2000. Your query should satisfy the following criteria: It should return two columns: team: The name of the team ave_goals: The team's average number of goals in "away" games. An "away game" is one in which the team's name appars in away_team and the game takes place at a "non-neutral site" (neutral value equals FALSE). It should only include teams that have played at least 30 away matches. It should round the average goals value (ave_goals) to three decimal places. It should only return the top 10 teams in descending order by average away-goals. It should only consider games played since 2000 (including the year 2000). Store your query string as the variable, query_top10_away, below. The test cell will run this query string against the input dataframe, df, defined above and return the result in a dataframe named offensive_teams. (See the test cell.) Note. The following exercises have hidden test cases and you'll be awarded full points for passing both the exposed and hidden test cases.

In [3]: query_top10_away = ''

# Write your query here!

### BEGIN SOLUTION query_top10_away = """ SELECT away_team AS team, ROUND(AVG(away_score), 3) AS ave_goals FROM soccer_results WHERE STRFTIME('%Y', date) >= '2000' AND neutral = 'FALSE' GROUP BY away_team HAVING COUNT(*) >= 30 ORDER BY ave_goals DESC LIMIT 10""" ### END SOLUTION print(query_top10_away) SELECT away_team AS team, ROUND(AVG(away_score), 3) AS ave_goals FROM soccer_results WHERE STRFTIME('%Y', date) >= '2000' AND neutral = 'FALSE' GROUP BY away_team HAVING COUNT(*) >= 30 ORDER BY ave_goals DESC LIMIT 10 In [4]: # Test: Exercise 1 (exposed) offensive_teams = pd.read_sql_query(query_top10_away, disk_engine) df_cols = offensive_teams.columns.tolist() df_cols.sort() desired_cols = ['team', 'ave_goals'] desired_cols.sort() print(offensive_teams.head(10)) assert offensive_teams.shape[0] == 10, "Expected 10 rows but returned dataframe has {}".format(offensive_teams.shap e[0]) assert offensive_teams.shape[1] == 2, "Expected 2 columns but returned dataframe has {}".format(offensive_teams.sha pe[1]) assert df_cols == desired_cols, "Column names should be: {}. Returned dataframe has: {}".format(desired_cols, df_co ls) tolerance = .001 team_4 = offensive_teams.iloc[3].team team_4_ave = offensive_teams.iloc[3].ave_goals desired_team_4_ave = 1.763 assert (team_4 == "England" and abs(team_4_ave - 1.763) = first) and (x[:4] row['away_score']: return row['home_team'] if row['home_score'] < row['away_score']: return row['away_team'] return 'Draw' winners_df = game_df.copy() winners_df['winner'] = game_df.apply(who_won, axis=1) return winners_df ### END SOLUTION In [10]: # Test: Exercise 3 (exposed) game_df = pd.read_sql_query("SELECT * FROM soccer_results", disk_engine) winners_df = determine_winners(game_df) game_winner = winners_df.iloc[1].winner assert game_winner == "Ghana", "Expected Ghana to be winner. Got {}".format(game_winner) game_winner = winners_df.iloc[2].winner assert game_winner == "Draw", "Match was Draw. Got {}".format(game_winner) game_winner = winners_df.iloc[3].winner assert game_winner == "Mali", "Expected Mali to be winner. Got {}".format(game_winner) print("\n(Passed!)") (Passed!) In [11]: # Hidden test cell: exercise3_hidden print(""" In addition to the tests above, this cell will include some hidden tests. You will only know the result when you submit your solution to the autograder. """) ### BEGIN HIDDEN TESTS def get_winner(row): home_score = row['home_score'] away_score = row['away_score'] winner = 'Draw' if home_score > away_score: winner = row['home_team'] elif home_score < away_score: winner = row['away_team'] return(winner) def sol_3(game_df_sol): game_df_sol = game_df_sol.assign(winner = "None") dummy_df = game_df.copy() dummy_df['winner'] = dummy_df.apply(lambda x : get_winner(x), axis =1) return dummy_df game_df = pd.read_sql_query("SELECT * FROM soccer_results", disk_engine) game_df_sol = sol_3(game_df) winners_df = determine_winners(game_df) winners_df = winners_df.reset_index(drop=True) game_df_sol = game_df_sol.reset_index(drop=True) merged = game_df_sol.merge(winners_df, indicator=True, how='outer') diffs merged[merged[' merge'] 'right only']

https://courses.edx.org/courses/course-v1:GTx+CSE6040x+2T2019a/courseware/737ee8e0d5a54baca4624995e917c36d/75d6b5741cbd428cbce5e… 5/37

11/24/2019

GT learners: Sample solutions | Midterm 2: Sample solutions | CSE6040x Courseware | edX diffs = merged[merged[ _merge ] == right_only ] assert len(diffs) == 0, "The dataframe doesn't match the expected output on \n{}".format(diffs) assert winners_df.equals(game_df_sol), "The dataframe doesn't match the expected output" print("\nPassed!") ### END HIDDEN TESTS In addition to the tests above, this cell will include some hidden tests. You will only know the result when you submit your solution to the autograder.

Passed!

Exercise 4 (3 points): Given a team, its home advantage ratio is the number of home games it has won divided by the number of home games it has played. For this exercise, we'll try to answer the question, how important is the home advantage in soccer? It's importance is factored into draws for competitions, for example, teams wanting to play at home the second leg of the matches of great importance such as tournament knockouts. (This exercise has a pre-requisite of finishing Exercise 3 as we'll be using the results of the dataframe from that exercise in this one.) Complete the function, calc_home_advantage(winners_df), below, so that it returns the top 5 countries, among those that have played at least 50 home games, having the highest home advantage ratio. It should return a dataframe with two columns, team and ratio, holding the name of the team and its home advantage ratio, respectively. The ratio should be rounded to three decimal places. The rows should be sorted in descending order of ratio. If there are two teams with the same winning ratio, the teams should appear in alphabetical order by name. Note 0. As with our definition of away-games, a team plays a home game if it is the home team (home_team) and the field is non-neutral (i.e., neutral is FALSE). Note 1. You should find, for example, that Brazil is the number two team, with a home advantage ratio of 0.773.

In [12]: def calc_home_advantage(winners_df): ### BEGIN SOLUTION return calc_home_advantage__v1(winners_df) # Solution method 0 (explicit loop) def calc_home_advantage__v0(winners_df): home_teams = winners_df.home_team.unique() final_df = [] for team in home_teams: home_game_df = winners_df[(winners_df.home_team == team) & (winners_df.neutral == 'FALSE')] number_of_home_games = home_game_df.shape[0] if number_of_home_games < 50: continue home_team_winners = home_game_df[home_game_df.winner == team] num_home_team_wins = home_team_winners.shape[0] winning_percentage = round(num_home_team_wins / number_of_home_games, 3) data_point = {"team":team, "ratio": winning_percentage} final_df.append(data_point) final_df = pd.DataFrame(final_df) final_df = final_df.sort_values(by = ['ratio', 'team'], ascending = [False, True]) final_df = final_df[['team', 'ratio']] return final_df.iloc[0:5] # Solution method 1 (no explicit loops) def calc_home_advantage__v1(winners_df): home_games_played = winners_df.groupby('home_team')['home_team'].count() home_teams_50 = home_games_played[home_games_played >= 50] wdf_50 = winners_df.set_index('home_team').loc[home_teams_50.index].reset_index() wdf_50 = wdf_50[wdf_50['neutral'] == 'FALSE'] wdf_50['home_team_won'] = wdf_50['home_team'] == wdf_50['winner'] wdf_50_played = wdf_50.groupby('home_team')['home_team'].count() wdf_50_won = wdf_50.groupby('home_team')['home_team_won'].sum() wdf_50_ratio = (wdf_50_won / wdf_50_played).round(3) return wdf_50_ratio.sort_values(ascending=False).head(5).to_frame().reset_index() \ .rename(columns={'home_team': 'team', 0: 'ratio'}) \ .head(5) ### END SOLUTION In [13]: # Test: Exercise 4 (exposed) from IPython.display import display win_perc = calc_home_advantage(winners_df) print("The solution, according to you:") display(win_perc) df

l

i

l

t li t()

https://courses.edx.org/courses/course-v1:GTx+CSE6040x+2T2019a/courseware/737ee8e0d5a54baca4624995e917c36d/75d6b5741cbd428cbce5e… 6/37

11/24/2019

GT learners: Sample solutions | Midterm 2: Sample solutions | CSE6040x Courseware | edX df_cols = win_perc.columns.tolist() df_cols.sort() desired_cols = ['team', 'ratio'] desired_cols.sort() assert win_perc.shape[0] == 5, "Expected 5 rows, got {}".format(win_perc.shape[0]) assert win_perc.shape[1] == 2, "Expected 2 columns, got {}".format(win_perc.shape[1]) assert df_cols == desired_cols, "Expected {} columns but got {} columns".format(desired_cols, df_cols) tolerance = .001 sec_team = win_perc.iloc[1].team sec_perc = win_perc.iloc[1].ratio assert (sec_team == "Brazil" and abs(sec_perc - .773) (14054, 144) (14053, 144)

Exercise 1 (1 point). Several of the survey questions offered the choice, Other. These are encoded in the column names. For instance, Question 31 has 12 parts and an "other" option (Q31_OTHER_TEXT), as does Question 32 (Q32_OTHER). Remove these columns from the data dataframe. (That is, overwrite data with a version of itself omitting columns that contain the substring OTHER.) Note. There are also non-split questions having an option named Other. For instance, see Q5 in row 1 or Q6 in row 2. Do not remove these columns.

In [4]: ### BEGIN SOLUTION data.drop(list(data.filter(regex='OTHER')), axis=1, inplace=True) ### END SOLUTION In [5]: # Test cell : `exercise1` (exposed) assert data.shape[0]==14053,"Incorrect number of rows in the dataframe!" if data.shape[1]!=138: how_many = "few" if data.shape[1] Took %g seconds." % elapsed_time) ==> Took 9.0582 seconds.

Exercise 1 (5 points). Let's start by seeing if we can make FC_naive() function faster by rewriting the two innermost loops, i.e., the d1 and d2 loops: for d1 in range(H): for d2 in range(W): out[ni, mi] += x[ni, d1, d2] * w[mi, d1, d2]

https://courses.edx.org/courses/course-v1:GTx+CSE6040x+2T2019a/courseware/737ee8e0d5a54baca4624995e917c36d/75d6b5741cbd428cbce5… 24/37

11/24/2019

GT learners: Sample solutions | Midterm 2: Sample solutions | CSE6040x Courseware | edX

For this exercise, complete the function two_inner_loops(x_i, w_l, b_j), below, so that it implements the same computation as these two d1 and d2 loops, but is much faster. It should return out[ni, mi]. The input x_i is the i-th image, w_l is the l-th weight matrix, and b_l is the l-th component of the bias vector. The test code will check your results and benchmark a complete FC layer using the function FC_two_loops(), defined below. You'll see that it calls your two_inner_loops() routine to implement the two innermost loops. To get credit on this exercise, the resulting execution time of FC_two_loops() must be at least 100 times faster than FC_naive() on the problem sizes being tested below when running on the Vocareum platform. There is no partial credit for smaller speedups. Having said that, a little bit of basic Numpy should go a long way. In [4]: def two_inner_loops(x_i, w_l, b_l): """ Inputs: - x_i: A numpy array of images of shape (H, W) - w_l: A numpy array of weights of shape (H, W) - b_l: A float (single number) Returns: - out: A float (single number) """ ### BEGIN SOLUTION return np.sum(np.multiply(x_i, w_l)) + b_l ### END SOLUTION In [5]: # Test cell: 'FC_two_loops_1' (5 points) def FC_two_loops(x, w, b): """ Inputs: - x: A numpy array of images of shape (N, H, W) - w: A numpy array of weights of shape (M, H, W) - b: A numpy vector of biases of size M Returns: - out: a numpy array of shape (N, M) """ N, H, W = x.shape M, _, _ = w.shape out = np.zeros((N,M)) for ni in range(N): for mi in range(M): out[ni, mi] = two_inner_loops(x[ni, return out

:, :], w[mi,

:, :], b[mi])

num_inputs = 50 input_shape = (128, 256) output_dim = 10 x = np.random.rand(num_inputs, *input_shape) w = np.random.rand(output_dim, *input_shape) b = np.random.rand(output_dim) start_time = time.time () for i in range(5): out_fast = FC_two_loops(x, w, b) elapsed_time_fast = (time.time () - start_time)/5 print ("==> Took %g seconds." % elapsed_time_fast) start_time = time.time () for i in range(5): out_naive = FC_naive(x, w, b) elapsed_time_naive = (time.time () - start_time)/5 print ("==> Took %g seconds." % elapsed_time_naive) error = rel_error(out_naive, out_fast) print("Output error:", error) assert error < 1e-12, "The value of your output is incorrect or not accurate enough" speed_up = elapsed_time_naive/elapsed_time_fast print("Speed-up:", speed_up) assert speed_up > 100, "The speed-up of your method is less than 100" print("\n(Passed!)") ==> Took 0.0204387 seconds. ==> Took 9.16408 seconds. Output error: 7.529419728091092e-15

https://courses.edx.org/courses/course-v1:GTx+CSE6040x+2T2019a/courseware/737ee8e0d5a54baca4624995e917c36d/75d6b5741cbd428cbce5… 25/37

11/24/2019

GT learners: Sample solutions | Midterm 2: Sample solutions | CSE6040x Courseware | edX Speed-up: 448.3683867942356 (Passed!)

Question 2 (5 points). Now, completely rewrite the FC_naive() function by at least 1,000 times. This improvement can be attained with basic Numpy and no explicit loops.

In [6]: def FC_no_loop(x, w, b): """ Inputs: - x: A numpy array of images of shape (N, H, W) - w: A numpy array of weights of shape (M, H, W) - b: A numpy vector of biases of size M Returns: - out: a numpy array of shape (N, M) """ N, H, W = x.shape M, _, _ = w.shape out = np.zeros((N,M)) ### BEGIN SOLUTION x = np.reshape(x, (N, H*W)) w = np.reshape(w, (M, H*W)) out = x @ w.T + b ### END SOLUTION return out In [7]: # Test cell: 'FC_no_loop' (5 points) num_inputs = 50 input_shape = (128, 256) output_dim = 10 x = np.random.rand(num_inputs, *input_shape) w = np.random.rand(output_dim, *input_shape) b = np.random.rand(output_dim) start_time = time.time () for i in range(5): out_fast = FC_no_loop(x, w, b) elapsed_time_fast = (time.time () - start_time)/5 print ("==> Took %g seconds." % elapsed_time_fast) start_time = time.time () for i in range(5): out_naive = FC_naive(x, w, b) elapsed_time_naive = (time.time () - start_time)/5 print ("==> Took %g seconds." % elapsed_time_naive) error = rel_error(out_naive, out_fast) print("Output error:", error) assert error < 1e-12, "The value of your output is incorrect or not accurate enough" speed_up = elapsed_time_naive/elapsed_time_fast print("Speed-up:", speed_up) assert speed_up > 1000, "The speed-up of your method is less than 1000" print("\n(Passed!)") ==> Took 0.00248585 seconds. ==> Took 9.00437 seconds. Output error: 6.669391769098386e-15 Speed-up: 3622.2518031151694 (Passed!)

Fin! You've reached the end of this problem. Don't forget to restart the kernel and run the entire notebook from top-to-bottom to make sure you did everything correctly. If that is working, try submitting this problem. (Recall that you must submit and pass the autograder to get credit for your work!)

https://courses.edx.org/courses/course-v1:GTx+CSE6040x+2T2019a/courseware/737ee8e0d5a54baca4624995e917c36d/75d6b5741cbd428cbce5… 26/37

11/24/2019

GT learners: Sample solutions | Midterm 2: Sample solutions | CSE6040x Courseware | edX

Problem 3: Data cleaning with Pandas Version 1.8 Pro-tips. If your program behavior seem strange, try resetting the kernel and rerunning everything. If you mess up this notebook or just want to start from scratch, save copies of all your partial responses and use Actions Reset Assignment to get a fresh, original copy of this notebook. (Resetting will wipe out any answers you've written so far, so be sure to stash those somewhere safe if you intend to keep or reuse them!)

Often (nearly always) the data we are working with will need some type of cleaning or rearranging before it can be used. In this problem, you will be working with some messy data. The dataset that we will be working with is called "sales.csv". It contains some ticket sale data, accessed from DataCamp's R Tutorial (https://www.datacamp.com/courses/importing-cleaning-data-in-r-case-studies). You will be completing this problem with Pandas. Our goal will be to take this messy data set, and after doing some data cleaning, we will determine which 5 cities have the highest ticket prices on average, after accounting for inflation. The excercises will step you through this process.

Exercise 0 (ungraded). Before we can begin, let's import the dataset, called sales.csv, into a pandas dataframe. We will alias i...