Novice to Grandmaster Kaggle PDF

Title Novice to Grandmaster Kaggle
Author Galina Dovgyi
Course First-Year Interdisciplinary Seminar: From Mishima to Murakami: Postwar Japanese Fiction and Film
Institution New York University
Pages 48
File Size 3.4 MB
File Type PDF
Total Downloads 28
Total Views 124

Summary

Download Novice to Grandmaster Kaggle PDF


Description

30/01/2020

Novice to Grandmaster | Kaggle Got it

Search

Learn more

Register

Copy and Edit

2017 Kaggle ML & DS Survey

501

Version 45  45 com









Notebook

Data

Output

Comments

https://www.kaggle.com/ash316/novice-to-grandmaster

1/48

3

Novice to Grandmaster- What Data Scientists say?

Kaggle is the world's largest Data Science platform with more than 1 million users, and it is an excellent platform for students like me to learn and grow in the field of Data Science and Machine Learning. It has users from various domains,like statisticians,Data Scientists and Machine Learning Practitioners.This dataset published by Kaggle is a gem for people like me, who like to analyse and investigate data. In this notebook, we will try to find some trending or some common questions, each budding data scientist would like to know, like the most used tools, the resources to learn data science ,etc. The biggest problem that we might face is fake and bogus responses. As it is a survey, not everyone will answer with proper credentials, and thus I assume that there will be a lot many outlier. Let's dive in straight into the pool of data and gain some insights..

Introduction

Who are Data Scientists? A data scientist is a statistician or a programmer, who cleans, manages and organizes data, perform descriptive statistics and analysis to develop insights,build predictive models and solve business related problems. Let's see what do Data Scientists on kaggle say..

In[1]:

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import squarify plt.style.use('fivethirtyeight') import warnings warnings.filterwarnings('ignore') import numpy as np import plotly.offline as py py.init_notebook_mode(connected=True) import plotly.graph_objs as go import plotly tools as tls

https://www.kaggle.com/ash316/novice-to-grandmaster

2/48

30/01/2020

Novice to Grandmaster | Kaggle import plotly.tools as tls import base64 import io from scipy.misc import imread import codecs from IPython.display import HTML from matplotlib_venn import venn2 from subprocess import check_output print(check_output(["ls", "../input"]).decode("utf8"))

RespondentTypeREADME.txt conversionRates.csv freeformResponses.csv multipleChoiceResponses.csv schema.csv

In[2]:

response=pd.read_csv('../input/multipleChoiceResponses.csv',encoding ='ISO-8859-1')

In[3]:

response.head() Out[3]: GenderSelect 

Country

Age

EmploymentStatus

StudentStatus

LearningDataScienc

Non-binary genderqueer NaN

or gender

NaN

Employed full-time

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

nonconforming 



Female

Male

United States Canada







Not employed but looking for work Not employed but looking for work Independent

Male

United States



contractor freelancer or selfem



Male

Taiwan



Employed full-time

5 rows × 228 columns

Some Basic Analysis

In[4]:

print('The total number of respondents:',response.shape[0]) print('Total number of Countries with respondents:',response['Countr y'].nunique()) print('Country with highest respondents:',response['Country'].value_ counts().index[0],'with',response['Country'].value_counts().values[0 ],'respondents') print('Youngest respondent:',response['Age'].min(),' and Oldest resp ondent:',response['Age'].max())

The total number of respondents: 16716 Total number of Countries with respondents: 52 Country with highest respondents: United States with 4197 respon dents

https://www.kaggle.com/ash316/novice-to-grandmaster

3/48

30/01/2020

Novice to Grandmaster | Kaggle Youngest respondent: 0.0

and Oldest respondent: 100.0

Seriously?? Youngest Rspondent is not even a year old. LOL!! And how come grandpa is still coding at the age of 100. It may be a fake response.

Gender Split

In[5]:

plt.subplots(figsize=(22,12)) sns.countplot(y=response['GenderSelect'],order=response['GenderSelec t'].value_counts().index) plt.show()

The graph clearly shows that there are a lot more male respondents as compared to female. It seems that Ladies were either busy with their coding, or ladies don't code...:p. Just Kidding.

Respondents By Country

In[6]:

resp_coun=response['Country'].value_counts()[:15].to_frame() sns.barplot(resp_coun['Country'],resp_coun.index,palette='inferno') plt.title('Top 15 Countries by number of respondents') plt.xlabel('') fig=plt.gcf() fig.set_size_inches(10,10) plt.show() tree=response['Country'].value_counts().to_frame() squarify.plot(sizes=tree['Country'].values,label=tree.index,color=sn s.color_palette('RdYlGn_r',52)) plt.rcParams.update({'font.size':20}) fig=plt.gcf() fig.set_size_inches(40,15) plt.show()

https://www.kaggle.com/ash316/novice-to-grandmaster

4/48

30/01/2020

Novice to Grandmaster | Kaggle

USA and India, constitute maximum respondents, about 1/3 of the total. Similarly Chile has the lowest number of respondents. Is this graph sufficient enough to say that majority of Kaggle Users are from India and USA. I don't think so, as the total users on Kaggle are more than 1 million while the number of respondents are only 16k.

Compensation Data Scientists are one of the most highest payed indviduals. Lets check what the surveyors say..

In[7]:

response['CompensationAmount']=response['CompensationAmount'].str.re place(',','') response['CompensationAmount']=response['CompensationAmount'].str.re place('-','') rates=pd.read_csv('../input/conversionRates.csv') rates.drop('Unnamed: 0',axis=1,inplace=True) salary=response[['CompensationAmount','CompensationCurrency','Gender Select','Country','CurrentJobTitleSelect']].dropna() salary=salary.merge(rates,left_on='CompensationCurrency',right_on='o riginCountry',how='left') salary['Salary']=pd.to_numeric(salary['CompensationAmount'])*salary[ 'exchangeRate'] print('Maximum Salary is USD $',salary['Salary'].dropna().astype(int ).max()) print('Minimum Salary is USD $',salary['Salary'].dropna().astype(int ).min()) print('Median Salary is USD $',salary['Salary'].dropna().astype(int) .median())

Maximum Salary is USD $ 28297400000 Minimum Salary is USD $ 0 Median Salary is USD $ 53812.0

Look at that humungous Salary!! Thats even larger than GDP of many countries. Another example of bogus response. The minimum salary maybe a case of a student. The median salary shows that Data Scientist enjoy good salary benefits.

In[8]:

https://www.kaggle.com/ash316/novice-to-grandmaster

5/48

30/01/2020

Novice to Grandmaster | Kaggle plt.subplots(figsize=(15,8)) salary=salary[salary['Salary']...


Similar Free PDFs