HW3 PDF

Title HW3
Author Minh Nhật
Course Data Warehousing For Analytics
Institution Baruch College CUNY
Pages 16
File Size 759.4 KB
File Type PDF
Total Downloads 72
Total Views 162

Summary

Cloud Hadoop Assignment...


Description

Dang 1

Minh Dang

Professor Richard Holowczak CIS 4400 Section CMWA

Assignment #3 – Cloud Computing / Hadoop Assignment

11 December 2019

Dang 2 Query 0. Create an appropriate EXTERNAL table for the taxi trip files. • SQL: CREATE EXTERNAL TABLE yellow_taxi (VendorID INT, tpep_pickup_datetime TIMESTAMP, tpep_dropoff_datetime TIMESTAMP, Passenger_count INT, Trip_distance DOUBLE, RateCodeID INT, Store_and_fwd_flag CHAR(1), PULocationID INT, DOLocationID INT, Payment_type INT, Fare_amount DOUBLE, Extra DOUBLE, MTA_tax DOUBLE, Tip_amount DOUBLE, Tolls_amount DOUBLE, Improvement_surcharge DOUBLE, Total_amount DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://yellow-taxi-bucket/data/';



Output (after running DESCRIBE yellow_taxi): VendorID tpep_pickup_datetime tpep_dropoff_datetime Passenger_count Trip_distance RateCodeID Store_and_fwd_flag PULocationID DOLocationID Payment_type Fare_amount Extra MTA_tax Tip_amount Tolls_amount Improvement_surcharge Total_amount



Execution time: 4.493 seconds

int timestamp timestamp int double int string int int int double double double double double double double

NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL

Dang 3 •

Cluster Monitoring: o Cluster Status:

o Node Status:

Dang 4 o IO:

Dang 5 Question 1. For each month of 2018, find the total fare amount for yellow taxi trips taken charged at the standard rate, with 2 or more passengers who paid with a credit card or cash. (12) • SQL: SELECT MONTH(tpep_pickup_datetime) AS Month_of_2018, FORMAT_NUMBER(SUM(Fare_amount),2) AS Total_Fare_Amount FROM yellow_taxi WHERE RateCodeID = 1 AND Passenger_count >= 2 AND Payment_type IN (1, 2) AND Trip_distance = 0 AND Tolls_amount >= 0 AND Improvement_surcharge IN (0,0.3) AND Total_amount > 0 GROUP BY Month(tpep_pickup_datetime) ORDER BY Month(tpep_pickup_datetime);



Output: Query1



Month_of_2018

Total_Fare_Amount

1

26,698,121.12

2

25,780,643.47

3

29,570,741.41

4

29,955,538.13

5

30,225,146.45

6

28,535,262.05

7

25,559,459.15

8

25,459,718.99

9

26,400,844.77

10

28,475,398.60

11

26,445,826.93

12

27,275,036.86

Execution time: 200.167 seconds

Dang 6 •

Cluster Monitoring: o Cluster Status:

o Node Status:

Dang 7 o IO

Dang 8 Question 2. What was the average cost (fare amount only) per mile of a yellow taxi standard rate ride in 2018 that traveled more than 10 miles but less than 25 miles (not including trips to airports)? (1) •

SQL: SELECT ROUND(AVG(Fare_amount/Trip_distance),3) AS Average_Cost_Per_Mile FROM

yellow_taxi

WHERE

RateCodeID = 1

AND

Trip_distance > 10

AND

Trip_distance < 25

AND

DOLocationID NOT IN (1,132,138)

AND (unix_timestamp(tpep_dropoff_datetime) - unix_timestamp(tpep_pickup_datetime))/3600 0 AND Extra IN (0,0.5,1) AND MTA_tax IN (0,0.5) AND Tip_amount >= 0 AND Tolls_amount >= 0 AND Improvement_surcharge IN (0,0.3) AND Total_amount > 0;



Output: Query2 Average_Cost_Per_Mile 3.122



Execution time: 187.68 seconds

Dang 9 •

Cluster Monitoring: o Cluster Status:

o Node Status:

Dang 10 o IO:

Dang 11 Question 3. Which day of the week in 2018 has the lowest number of single rider trips on average? (1) •

SQL: SELECT DATE_FORMAT(tpep_pickup_datetime, 'EEEE') AS Day_of_Week, ROUND(SUM(CASE WHEN passenger_count = 1 THEN 1 ELSE 0 END)/ COUNT(DISTINCT TO_DATE(tpep_pickup_datetime)),3) AS Average_Single_Rider_Trips FROM

yellow_taxi

WHERE

RateCodeID = 1

AND

Trip_distance 0 AND Extra IN (0,0.5,1) AND MTA_tax IN (0,0.5) AND Tip_amount >= 0 AND Tolls_amount >= 0 AND Improvement_surcharge IN (0,0.3) AND Total_amount > 0 GROUP BY DATE_FORMAT(tpep_pickup_datetime, 'EEEE') ORDER BY Average_Single_Rider_Trips ASC LIMIT 1;



Output: Query3 Day_of_Week

Average_Single_Rider_Trips

Sunday

162293.904

Dang 12 •

Execution time: 326.78 seconds



Cluster Monitoring: o Cluster Status:

o Node Status:

Dang 13 o IO:

Dang 14 Conclusion Paragraph. I chose Amazon Web Services (AWS) since I have already had some experiences working with Google Cloud Platform (GCP.) Also the challenging set up of AWS catches my attention and makes me want to try. The lengths of time and dollars in services spent on the tutorial and assignment are shown in the table below: Tutorial

Assignment

Time Needed (Hours)

1.5

8.0

Cost (Dollars in Services)

$0

$10.25

If I could change/improve something about the assignment, I would tried running the queries with the January 2018 dataset as soon as I had it formed. I tried to write the queries without testing it in fear of losing credits since I know with the large amount of data, it would be very costly; and errors in queries are unavoidable so that I had to rerun it several times. However, a large portion of the assignments ended up using the AWS Free Tier, and only cost me a bit more than 10% of the given credit. If I had known this in advance, I would have saved a lot of time instead of forming and fixing queries while being uncertain what the exact problems and errors were.

Dang 15 Exploratory Data Analysis. 1. Check the parameters regarding trips: SELECT COUNT(*) AS Number_of_Records, COUNT(DISTINCT VendorID) AS Number_of_Vendors, MIN(TO_DATE(tpep_pickup_datetime)) AS Oldest_Pickup, MAX(TO_DATE(tpep_pickup_datetime)) AS Newest_Pickup, MIN(TO_DATE(tpep_dropoff_datetime)) AS Oldest_Dropoff, MAX(TO_DATE(tpep_dropoff_datetime)) AS Newest_Dropoff, MIN(passenger_count) AS Min_Passengers, MAX(passenger_count) AS Max_Passengers, AVG(passenger_count) AS Average_Passengers, MIN(trip_distance) AS Min_Distance, MAX(trip_distance) AS Max_Distance, AVG(trip_distance) AS Average_Distance, COUNT(DISTINCT RateCodeID) AS Number_of_RateCodes, COUNT(DISTINCT store_and_fwd_flag) AS Types_of_Store_Forward_Flag, COUNT(DISTINCT PULocationID) as Number_of_Pickup_Zones, COUNT(DISTINCT DOLocationID) as Number_of_Dropoff_Zones, COUNT(DISTINCT payment_type) as Number_of_Payment_Types FROM yellow_taxi;

There are 102,804,274 records in total, which matches with the given raw record counts. There are 3 distinct vendors returned – contradictory with the 2 given vendors. o Vendors 1 and 2 should be kept. • The pickup/dropoff dates range from 1 January 2001/10 August 1926 respectively to 4 November 2084 for both, where we only want the 2018 records. o The records fall into January to December 2018 should be retained. • The minimum, maximum and average of number of passenger is 0, 192 and 1.59. 0 passenger doesn’t make sense since there can’t be a trip without passenger. The maximum number of passengers allowed by law is 6 with the 7-seater containing the driver. o The number of passenger greater than 0 and smaller than or equal to 6 is reasonable. • The minimum, maximum and average of trip distance is 0, 189,438.84 and 2.94. 0 mile makes no sense since a trip with such mile cannot be charged. Per the requirement, trips that have distances greater than 50 will be crossed out. o The trip distance greater than 0 and smaller than or equal to 50 is satisfactory. • There are 7 distinct rate codes, while the data dictionary only shows 6. o The rate codes from 1 through 6 should be kept. • There are 3 distinct flags for stored and forward trips, while only ‘Y’ and ‘N’ values are shown. o The flag of ‘Y’ and ‘N’ should be retained. • There are 264 locations for both pickup and dropoff. • There are 5 distinct payment types, agrees with the data dictionary. However, to eliminate the 24 NULL values, data checks for the fields that satisfactory are still presented in the WHERE clauses. • •

Dang 16 2. Check the parameters regarding fares: SELECT MIN(fare_amount) AS min_fare_charge, MAX(fare_amount) AS max_fare_charge, AVG(fare_amount) AS average_fare_charge, MIN(Extra) AS min_extra_charge, MAX(Extra) AS max_extra_charge, AVG(Extra) AS average_extra_charge, MIN(MTA_Tax) AS min_mta_tax_charge, MAX(MTA_Tax) AS max_mta_tax_charge, AVG(MTA_Tax) AS average_mta_tax_charge, MIN(tip_amount) AS min_tip_amount, MAX(tip_amount) AS max_tip_amount, AVG(tip_amount) AS average_tip_amount, MIN(tolls_amount) AS min_toll_charge, MAX(tolls_amount) AS max_toll_charge, AVG(tolls_amount) AS average_toll_charge, MIN(improvement_surcharge) AS min_surcharge, MAX(improvement_surcharge) AS max_surcharge, AVG(improvement_surcharge) AS average_surcharge, MIN(total_amount) AS min_total_charge, MAX(total_amount) AS max_total_charge, AVG(total_amount) AS average_total_charge FROM

yellow_taxi; MIN

MAX

AVG

Fare_amount

–800.0

907,070.24

13.06

Extra

–80.0

96.64

0.33

MTA_tax

–0.5

150.0

0.50

–322.42

945.97

1.87

Tolls_amount

–52.5

1650.0

0.35

Improvement_surcharge

–0.3 –800.3

4000.3 907,071.04

0.30 16.41

Fields

Tip_amount

Total_amount



All the maximum outliers are retained with caution.



Fare and total amount should be strictly greater than 0.



Tips and tolls amount that are lower than 0 (negative) should be eliminated.



Extra charges should only be $0 for no extra, $0.5 for rush hour and $1 for overnight.



MTA tax should only be either $0 (no tax) or $0.5, which are the retained values.



Improvement surcharge should only be $0 (no surcharge) or $0.3, whose values are kept....


Similar Free PDFs