Hackerrank is a programming community platform for coders ! It helds competition and programming challenges to brush up/hone coding skills in various languages (including Java, C++, PHP, Python, SQL, JavaScript) ! Not unlike Kaggle which is focused on Data Scientist/Machine Learning engineers, Hackerrank is a good way to practice and show your skills to potential employers. It is part of the growing gamification trend within competitive computer programming. We could ask ourselves what insights about women in tech the data provided by Hackerrank survey reveal !
As a young 2017 Graduate in Computer Science and Data Science and Woman in Tech myself, I am curious to see which trends we'll uncover :) Plus, I also wanted to gain more experience in data viz with Python. ^^
**RECAP ** The data set we are releasing here is the full dataset of 25K responses from Hackerrank developer survey, which includes both students and professionals.
**Methodology for the survey **
- A total of 25,090 professional and student developers completed our 10-minute online survey.
- The survey was live from October 16 through November 1, 2017.
- The survey was hosted by SurveyMonkey and we recruited respondents via email from our community of over 3.4 million members and through social media sites.
- We removed responses that were incomplete as well as obvious spam submissions.
- Not every question was shown to every respondent, as some questions were specifically for those involved in hiring. The codebook (HackerRank-Developer-Survey-2018-Codebook.csv) highlights under what conditions some questions were shown.
- The Women In Tech 2018 report is based only on the 14K responses from professionals
- Respondents who identified as students (q8Student=1; N=10351) were excluded from this report.
- Respondents who identify as “non-binary” (q3Gender=3; N=76) were excluded from the male-female comparisons.
Women in Tech
We know that Women in Tech are a minority, but what is the current situation in the past years ? More and more countries are putting effort into making women go into tech, has the situation improved from the past ? Let us get more in depth with this quick survey dataset !
Summary
- Q1 - Which languages are the most popular ?
- Q2 - Age distribution ?
- Q3 - At which age do they begin coding, differences between genders ?
- Q4 - Countries of Respondents ?
- Q5 - Top countries characteristics - age began coding ?
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
import plotly
import plotly.offline as py
py.init_notebook_mode(connected=True)
#import plotly.plotly as py
import plotly.graph_objs as go
df=pd.read_csv('HackerRank-Developer-Survey-2018-Values.csv', parse_dates=['StartDate','EndDate'])
df_n = pd.read_csv('HackerRank-Developer-Survey-2018-Numeric.csv', parse_dates=['StartDate','EndDate'])
df_women = df[df.q3Gender == 'Female']
df_men = df[df.q3Gender != 'Female']
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning:
Columns (10,19,137,138) have mixed types. Specify dtype option on import or set low_memory=False.
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning:
Columns (10,19,137,138,250) have mixed types. Specify dtype option on import or set low_memory=False.
df.shape
(25093, 251)
df = df.dropna(axis=0, how='all')
df.shape
(25092, 251)
#c = 0
#for i in df.columns :
# print(i + " "+ str(c))
# c+= 1
df.head(1)
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
RespondentID | StartDate | EndDate | CountryNumeric | q1AgeBeginCoding | q2Age | q3Gender | q4Education | q0004_other | q5DegreeFocus | ... | q30LearnCodeOther | q0030_other | q31Level3 | q32RecommendHackerRank | q0032_other | q33HackerRankChallforJob | q34PositiveExp | q34IdealLengHackerRankTest | q0035_other | q36Level4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.464454e+09 | 2017-10-19 11:51:00 | 2017-10-20 12:05:00 | South Korea | 16 - 20 years old | 18 - 24 years old | Female | Some college | NaN | Computer Science | ... | Other (please specify) | datacamp | num%2 == 0 | Yes | NaN | No | NaN | #NULL! | NaN | Queue |
1 rows × 251 columns
**Let's explore which languages are the most popular amongst the respondents classified by gender ! **
I will go back to think about this section later...
prog = df[df.columns[139:163]]
prog['Gender'] = df['q3Gender']
prog = prog.dropna(axis=0, how='all')
prog.columns
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Index(['q25LangC', 'q25LangCPlusPlus', 'q25LangJava', 'q25LangPython',
'q25LangRuby', 'q25LangJavascript', 'q25LangCSharp', 'q25LangGo',
'q25Scala', 'q25LangPerl', 'q25LangSwift', 'q25LangPascal',
'q25LangClojure', 'q25LangPHP', 'q25LangHaskell', 'q25LangLua',
'q25LangR', 'q25LangRust', 'q25LangTypescript', 'q25LangKotlin',
'q25LangJulia', 'q25LangErlang', 'q25LangOcaml', 'q25LangOther',
'Gender'],
dtype='object')
prog[0:5]
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
q25LangC | q25LangCPlusPlus | q25LangJava | q25LangPython | q25LangRuby | q25LangJavascript | q25LangCSharp | q25LangGo | q25Scala | q25LangPerl | ... | q25LangLua | q25LangR | q25LangRust | q25LangTypescript | q25LangKotlin | q25LangJulia | q25LangErlang | q25LangOcaml | q25LangOther | Gender | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Will Learn | Will Learn | Know | Will Learn | Will Learn | Will Learn | Will Learn | Will Learn | Will Learn | Will Learn | ... | Will Learn | Know | Will Learn | Will Learn | Will Learn | Will Learn | Will Learn | Will Learn | NaN | Female |
1 | NaN | NaN | Know | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Will Learn | NaN | NaN | NaN | NaN | Male |
2 | Will Learn | Will Learn | Will Learn | Know | Will Learn | Know | Will Learn | Will Learn | Will Learn | Will Learn | ... | Will Learn | Will Learn | Will Learn | Will Learn | Will Learn | Will Learn | Will Learn | Will Learn | NaN | Female |
3 | NaN | Know | Will Learn | Will Learn | Know | Will Learn | Know | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Male |
4 | NaN | NaN | NaN | NaN | NaN | Know | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Female |
5 rows × 25 columns
for i in prog.columns[:-1] :
print(i + ": "+str(prog[i].isnull().sum()))
q25LangC: 5904
q25LangCPlusPlus: 6218
q25LangJava: 3847
q25LangPython: 3440
q25LangRuby: 14793
q25LangJavascript: 4695
q25LangCSharp: 12500
q25LangGo: 14665
q25Scala: 16921
q25LangPerl: 18456
q25LangSwift: 16734
q25LangPascal: 19084
q25LangClojure: 19958
q25LangPHP: 12663
q25LangHaskell: 18584
q25LangLua: 19888
q25LangR: 15862
q25LangRust: 19340
q25LangTypescript: 16449
q25LangKotlin: 17186
q25LangJulia: 20517
q25LangErlang: 19933
q25LangOcaml: 20503
q25LangOther: 24011
colors = ["blue", "orange", "greyish", "faded green", "dusty purple"]
fig, ax = plt.subplots(figsize=(20,20), ncols=5, nrows=5)
count = 0
times = 0
for i in prog.columns[:-1]:
#sns.regplot(x='value', y='wage', data=df_melt, ax=axs[count])
sns.countplot(x=str(i), hue="Gender", data=prog, palette = sns.xkcd_palette(colors), ax=ax[times][count])
count += 1
if count == 5 :
times += 1
count = 0
To be continued
Let's see how many women there are and the age distribution for both. The AgeBeginCoding value might also be interesting
trace1 = go.Bar(
x=df_men['q2Age'].value_counts().index.tolist(),
y=np.multiply(np.divide(df_men['q2Age'].value_counts().tolist(),np.sum(df_men['q2Age'].value_counts().tolist())).tolist(),100).tolist(),
name='Men Respondents'
)
trace2 = go.Bar(
x=df_women['q2Age'].value_counts().index.tolist(),
y=np.multiply(np.divide(df_women['q2Age'].value_counts().tolist(),np.sum(df_women['q2Age'].value_counts().tolist())).tolist(),100).tolist(),
name='Female Respondents'
)
data = [trace1, trace2]
layout = go.Layout(
barmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='grouped-bar')
trace1 = go.Bar(
x=df_men['q1AgeBeginCoding'].value_counts().index.tolist(),
y=np.multiply(np.divide(df_men['q1AgeBeginCoding'].value_counts().tolist(),np.sum(df_men['q1AgeBeginCoding'].value_counts().tolist())).tolist(),100).tolist(),
name='Men Respondents'
)
trace2 = go.Bar(
x=df_women['q1AgeBeginCoding'].value_counts().index.tolist(),
y=np.multiply(np.divide(df_women['q1AgeBeginCoding'].value_counts().tolist(),np.sum(df_women['q1AgeBeginCoding'].value_counts().tolist())).tolist(),100).tolist(),
name='Female Respondents'
)
data = [trace1, trace2]
layout = go.Layout(
barmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='grouped-bar')
We can see that women tend to learn later on compared to men, especially regarding the "11-15 years-old" (22% for men and 13.8% for women) begineers category. More than the half of women learn between 16-20 years old.
#df['time']=(df['EndDate']-df['StartDate']).astype('timedelta64[m]')
focus_country = df['CountryNumeric'].value_counts().to_frame()
print("our TOP 10 country respondents is :")
print(focus_country.head(10).index)
our TOP 10 country respondents is :
Index(['Ghana', 'India', 'United States', 'Sudan', 'Malaysia', 'Brazil',
'Russian Federation', 'United Kingdom', 'Canada', 'Indonesia'],
dtype='object')
data = [ dict(
type = 'choropleth',
locations = focus_country.index,
locationmode = 'country names',
z = focus_country['CountryNumeric'],
text = focus_country['CountryNumeric'],
colorscale = [[0,"rgb(5, 10, 172)"],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],\
[0.6,"rgb(90, 120, 245)"],[0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"]],
autocolorscale = False,
reversescale = True,
marker = dict(
line = dict (
color = 'rgb(180,180,180)',
width = 1
) ),
colorbar = dict(
autotick = False,
tickprefix = '',
title = 'Respondents'),
) ]
layout = dict(
title = 'Number of respondents by country',
geo = dict(
showframe = True,
showcoastlines = True,
projection = dict(
type = 'Mercator'
)
)
)
fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='d3-world-map' )
Source here : https://plot.ly/python/choropleth-maps/
It's surprising to see Ghana winning the race, a map of beginning of code per country would be useful to see if every country needs to put on efforts (?) I will also explore the career/ school degrees and specialty of the individuals #To follow
df_men_c = [0,0,0]
df_women_c = [0,0,0]
count = 0
for i in focus_country.head(3).index :
df_men_c[count] = df_men[df_men['CountryNumeric'] == i]
df_women_c[count] = df_women[df_women['CountryNumeric'] == i]
print('N° of Male respondents for '+ i + ' is : '+ str(df_men_c[count].shape[0]))
print('N° of Female respondents for '+ i + ' is : '+ str(df_women_c[count].shape[0]))
trace1 = go.Bar(
x=df_men_c[count]['q1AgeBeginCoding'].value_counts().index.tolist(),
y=np.multiply(np.divide(df_men_c[count]['q1AgeBeginCoding'].value_counts().tolist(),np.sum(df_men_c[count]['q1AgeBeginCoding'].value_counts().tolist())).tolist(),100).tolist(),
name='Men Respondents in '+i
)
trace2 = go.Bar(
x=df_women_c[count]['q1AgeBeginCoding'].value_counts().index.tolist(),
y=np.multiply(np.divide(df_women_c[count]['q1AgeBeginCoding'].value_counts().tolist(),np.sum(df_women_c[count]['q1AgeBeginCoding'].value_counts().tolist())).tolist(),100).tolist(),
name='Female Respondents in '+i
)
data = [trace1, trace2]
layout = go.Layout(
barmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='grouped-bar')
count = count + 1
N° of Male respondents for Ghana is : 3510
N° of Female respondents for Ghana is : 892
N° of Male respondents for India is : 3167
N° of Female respondents for India is : 567
N° of Male respondents for United States is : 2640
N° of Female respondents for United States is : 546
We observe that most people learn to code between 16 and 20 years old. However, we also notice that in India the 2nd most represented group of beginners is 21-25 years old ! that is not the case in Ghana and USA where 2nd most seems to be 11-15 years. however girls are underrepresented in the USA for the 11-15 years old category. Maybe USA and India should put effort to make them learn to code earlier ?