Title | Novice to Grandmaster Kaggle |
---|---|
Author | Galina Dovgyi |
Course | First-Year Interdisciplinary Seminar: From Mishima to Murakami: Postwar Japanese Fiction and Film |
Institution | New York University |
Pages | 48 |
File Size | 3.4 MB |
File Type | |
Total Downloads | 28 |
Total Views | 124 |
Download Novice to Grandmaster Kaggle PDF
30/01/2020
Novice to Grandmaster | Kaggle Got it
Search
Learn more
Register
Copy and Edit
2017 Kaggle ML & DS Survey
501
Version 45 45 com
Notebook
Data
Output
Comments
https://www.kaggle.com/ash316/novice-to-grandmaster
1/48
3
Novice to Grandmaster- What Data Scientists say?
Kaggle is the world's largest Data Science platform with more than 1 million users, and it is an excellent platform for students like me to learn and grow in the field of Data Science and Machine Learning. It has users from various domains,like statisticians,Data Scientists and Machine Learning Practitioners.This dataset published by Kaggle is a gem for people like me, who like to analyse and investigate data. In this notebook, we will try to find some trending or some common questions, each budding data scientist would like to know, like the most used tools, the resources to learn data science ,etc. The biggest problem that we might face is fake and bogus responses. As it is a survey, not everyone will answer with proper credentials, and thus I assume that there will be a lot many outlier. Let's dive in straight into the pool of data and gain some insights..
Introduction
Who are Data Scientists? A data scientist is a statistician or a programmer, who cleans, manages and organizes data, perform descriptive statistics and analysis to develop insights,build predictive models and solve business related problems. Let's see what do Data Scientists on kaggle say..
In[1]:
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import squarify plt.style.use('fivethirtyeight') import warnings warnings.filterwarnings('ignore') import numpy as np import plotly.offline as py py.init_notebook_mode(connected=True) import plotly.graph_objs as go import plotly tools as tls
https://www.kaggle.com/ash316/novice-to-grandmaster
2/48
30/01/2020
Novice to Grandmaster | Kaggle import plotly.tools as tls import base64 import io from scipy.misc import imread import codecs from IPython.display import HTML from matplotlib_venn import venn2 from subprocess import check_output print(check_output(["ls", "../input"]).decode("utf8"))
RespondentTypeREADME.txt conversionRates.csv freeformResponses.csv multipleChoiceResponses.csv schema.csv
In[2]:
response=pd.read_csv('../input/multipleChoiceResponses.csv',encoding ='ISO-8859-1')
In[3]:
response.head() Out[3]: GenderSelect
Country
Age
EmploymentStatus
StudentStatus
LearningDataScienc
Non-binary genderqueer NaN
or gender
NaN
Employed full-time
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
nonconforming
Female
Male
United States Canada
Not employed but looking for work Not employed but looking for work Independent
Male
United States
contractor freelancer or selfem
Male
Taiwan
Employed full-time
5 rows × 228 columns
Some Basic Analysis
In[4]:
print('The total number of respondents:',response.shape[0]) print('Total number of Countries with respondents:',response['Countr y'].nunique()) print('Country with highest respondents:',response['Country'].value_ counts().index[0],'with',response['Country'].value_counts().values[0 ],'respondents') print('Youngest respondent:',response['Age'].min(),' and Oldest resp ondent:',response['Age'].max())
The total number of respondents: 16716 Total number of Countries with respondents: 52 Country with highest respondents: United States with 4197 respon dents
https://www.kaggle.com/ash316/novice-to-grandmaster
3/48
30/01/2020
Novice to Grandmaster | Kaggle Youngest respondent: 0.0
and Oldest respondent: 100.0
Seriously?? Youngest Rspondent is not even a year old. LOL!! And how come grandpa is still coding at the age of 100. It may be a fake response.
Gender Split
In[5]:
plt.subplots(figsize=(22,12)) sns.countplot(y=response['GenderSelect'],order=response['GenderSelec t'].value_counts().index) plt.show()
The graph clearly shows that there are a lot more male respondents as compared to female. It seems that Ladies were either busy with their coding, or ladies don't code...:p. Just Kidding.
Respondents By Country
In[6]:
resp_coun=response['Country'].value_counts()[:15].to_frame() sns.barplot(resp_coun['Country'],resp_coun.index,palette='inferno') plt.title('Top 15 Countries by number of respondents') plt.xlabel('') fig=plt.gcf() fig.set_size_inches(10,10) plt.show() tree=response['Country'].value_counts().to_frame() squarify.plot(sizes=tree['Country'].values,label=tree.index,color=sn s.color_palette('RdYlGn_r',52)) plt.rcParams.update({'font.size':20}) fig=plt.gcf() fig.set_size_inches(40,15) plt.show()
https://www.kaggle.com/ash316/novice-to-grandmaster
4/48
30/01/2020
Novice to Grandmaster | Kaggle
USA and India, constitute maximum respondents, about 1/3 of the total. Similarly Chile has the lowest number of respondents. Is this graph sufficient enough to say that majority of Kaggle Users are from India and USA. I don't think so, as the total users on Kaggle are more than 1 million while the number of respondents are only 16k.
Compensation Data Scientists are one of the most highest payed indviduals. Lets check what the surveyors say..
In[7]:
response['CompensationAmount']=response['CompensationAmount'].str.re place(',','') response['CompensationAmount']=response['CompensationAmount'].str.re place('-','') rates=pd.read_csv('../input/conversionRates.csv') rates.drop('Unnamed: 0',axis=1,inplace=True) salary=response[['CompensationAmount','CompensationCurrency','Gender Select','Country','CurrentJobTitleSelect']].dropna() salary=salary.merge(rates,left_on='CompensationCurrency',right_on='o riginCountry',how='left') salary['Salary']=pd.to_numeric(salary['CompensationAmount'])*salary[ 'exchangeRate'] print('Maximum Salary is USD $',salary['Salary'].dropna().astype(int ).max()) print('Minimum Salary is USD $',salary['Salary'].dropna().astype(int ).min()) print('Median Salary is USD $',salary['Salary'].dropna().astype(int) .median())
Maximum Salary is USD $ 28297400000 Minimum Salary is USD $ 0 Median Salary is USD $ 53812.0
Look at that humungous Salary!! Thats even larger than GDP of many countries. Another example of bogus response. The minimum salary maybe a case of a student. The median salary shows that Data Scientist enjoy good salary benefits.
In[8]:
https://www.kaggle.com/ash316/novice-to-grandmaster
5/48
30/01/2020
Novice to Grandmaster | Kaggle plt.subplots(figsize=(15,8)) salary=salary[salary['Salary']...