Hackerrank is a programming community platform for coders ! It helds competition and programming challenges to brush up/hone coding skills in various languages (including Java, C++, PHP, Python, SQL, JavaScript) ! Not unlike Kaggle which is focused on Data Scientist/Machine Learning engineers, Hackerrank is a good way to practice and show your skills to potential employers. It is part of the growing gamification trend within competitive computer programming. We could ask ourselves what insights about women in tech the data provided by Hackerrank survey reveal !

As a young 2017 Graduate in Computer Science and Data Science and Woman in Tech myself, I am curious to see which trends we'll uncover :) Plus, I also wanted to gain more experience in data viz with Python. ^^

**RECAP ** The data set we are releasing here is the full dataset of 25K responses from Hackerrank developer survey, which includes both students and professionals.

**Methodology for the survey **

A total of 25,090 professional and student developers completed our 10-minute online survey.
The survey was live from October 16 through November 1, 2017.
The survey was hosted by SurveyMonkey and we recruited respondents via email from our community of over 3.4 million members and through social media sites.
We removed responses that were incomplete as well as obvious spam submissions.
Not every question was shown to every respondent, as some questions were specifically for those involved in hiring. The codebook (HackerRank-Developer-Survey-2018-Codebook.csv) highlights under what conditions some questions were shown.
The Women In Tech 2018 report is based only on the 14K responses from professionals
Respondents who identified as students (q8Student=1; N=10351) were excluded from this report.
Respondents who identify as “non-binary” (q3Gender=3; N=76) were excluded from the male-female comparisons.

Women in Tech

We know that Women in Tech are a minority, but what is the current situation in the past years ? More and more countries are putting effort into making women go into tech, has the situation improved from the past ? Let us get more in depth with this quick survey dataset !

Summary

Q1 - Which languages are the most popular ?
Q2 - Age distribution ?
Q3 - At which age do they begin coding, differences between genders ?
Q4 - Countries of Respondents ?
Q5 - Top countries characteristics - age began coding ?

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
import plotly
import  plotly.offline as py
py.init_notebook_mode(connected=True)
#import plotly.plotly as py
import plotly.graph_objs as go

df=pd.read_csv('HackerRank-Developer-Survey-2018-Values.csv', parse_dates=['StartDate','EndDate'])
df_n = pd.read_csv('HackerRank-Developer-Survey-2018-Numeric.csv', parse_dates=['StartDate','EndDate'])
df_women = df[df.q3Gender == 'Female']
df_men = df[df.q3Gender != 'Female']

C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning:

Columns (10,19,137,138) have mixed types. Specify dtype option on import or set low_memory=False.

C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning:

Columns (10,19,137,138,250) have mixed types. Specify dtype option on import or set low_memory=False.

df.shape

(25093, 251)

df = df.dropna(axis=0, how='all')
df.shape

(25092, 251)

#c = 0 
#for i in df.columns : 
#    print(i + " "+ str(c))
#    c+= 1

df.head(1)

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}

</style>

	RespondentID	StartDate	EndDate	CountryNumeric	q1AgeBeginCoding	q2Age	q3Gender	q4Education	q0004_other	q5DegreeFocus	...	q30LearnCodeOther	q0030_other	q31Level3	q32RecommendHackerRank	q0032_other	q33HackerRankChallforJob	q34PositiveExp	q34IdealLengHackerRankTest	q0035_other	q36Level4
0	6.464454e+09	2017-10-19 11:51:00	2017-10-20 12:05:00	South Korea	16 - 20 years old	18 - 24 years old	Female	Some college	NaN	Computer Science	...	Other (please specify)	datacamp	num%2 == 0	Yes	NaN	No	NaN	#NULL!	NaN	Queue

1 rows × 251 columns

Let's explore which languages are the most popular amongst the respondents classified by gender !

I will go back to think about this section later...

prog = df[df.columns[139:163]]
prog['Gender'] = df['q3Gender']
prog = prog.dropna(axis=0, how='all')
prog.columns

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy






Index(['q25LangC', 'q25LangCPlusPlus', 'q25LangJava', 'q25LangPython',
       'q25LangRuby', 'q25LangJavascript', 'q25LangCSharp', 'q25LangGo',
       'q25Scala', 'q25LangPerl', 'q25LangSwift', 'q25LangPascal',
       'q25LangClojure', 'q25LangPHP', 'q25LangHaskell', 'q25LangLua',
       'q25LangR', 'q25LangRust', 'q25LangTypescript', 'q25LangKotlin',
       'q25LangJulia', 'q25LangErlang', 'q25LangOcaml', 'q25LangOther',
       'Gender'],
      dtype='object')

prog[0:5]

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}

</style>

	q25LangC	q25LangCPlusPlus	q25LangJava	q25LangPython	q25LangRuby	q25LangJavascript	q25LangCSharp	q25LangGo	q25Scala	q25LangPerl	...	q25LangLua	q25LangR	q25LangRust	q25LangTypescript	q25LangKotlin	q25LangJulia	q25LangErlang	q25LangOcaml	q25LangOther	Gender
0	Will Learn	Will Learn	Know	Will Learn	Will Learn	Will Learn	Will Learn	Will Learn	Will Learn	Will Learn	...	Will Learn	Know	Will Learn	Will Learn	Will Learn	Will Learn	Will Learn	Will Learn	NaN	Female
1	NaN	NaN	Know	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Will Learn	NaN	NaN	NaN	NaN	Male
2	Will Learn	Will Learn	Will Learn	Know	Will Learn	Know	Will Learn	Will Learn	Will Learn	Will Learn	...	Will Learn	Will Learn	Will Learn	Will Learn	Will Learn	Will Learn	Will Learn	Will Learn	NaN	Female
3	NaN	Know	Will Learn	Will Learn	Know	Will Learn	Know	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Male
4	NaN	NaN	NaN	NaN	NaN	Know	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Female

5 rows × 25 columns

for i in prog.columns[:-1] :
    print(i + ": "+str(prog[i].isnull().sum()))

q25LangC: 5904
q25LangCPlusPlus: 6218
q25LangJava: 3847
q25LangPython: 3440
q25LangRuby: 14793
q25LangJavascript: 4695
q25LangCSharp: 12500
q25LangGo: 14665
q25Scala: 16921
q25LangPerl: 18456
q25LangSwift: 16734
q25LangPascal: 19084
q25LangClojure: 19958
q25LangPHP: 12663
q25LangHaskell: 18584
q25LangLua: 19888
q25LangR: 15862
q25LangRust: 19340
q25LangTypescript: 16449
q25LangKotlin: 17186
q25LangJulia: 20517
q25LangErlang: 19933
q25LangOcaml: 20503
q25LangOther: 24011

colors = ["blue", "orange", "greyish", "faded green", "dusty purple"]
fig, ax = plt.subplots(figsize=(20,20), ncols=5, nrows=5)
count = 0
times = 0
for i in prog.columns[:-1]:
    #sns.regplot(x='value', y='wage', data=df_melt, ax=axs[count])
    sns.countplot(x=str(i), hue="Gender", data=prog, palette = sns.xkcd_palette(colors), ax=ax[times][count])
    count += 1
    if count == 5 :
        times += 1
        count = 0

To be continued

Let's see how many women there are and the age distribution for both. The AgeBeginCoding value might also be interesting

trace1 = go.Bar(
    x=df_men['q2Age'].value_counts().index.tolist(),
    y=np.multiply(np.divide(df_men['q2Age'].value_counts().tolist(),np.sum(df_men['q2Age'].value_counts().tolist())).tolist(),100).tolist(),
    name='Men Respondents'
)
trace2 = go.Bar(
    x=df_women['q2Age'].value_counts().index.tolist(),
    y=np.multiply(np.divide(df_women['q2Age'].value_counts().tolist(),np.sum(df_women['q2Age'].value_counts().tolist())).tolist(),100).tolist(),
    name='Female Respondents'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='grouped-bar')

trace1 = go.Bar(
    x=df_men['q1AgeBeginCoding'].value_counts().index.tolist(),
    y=np.multiply(np.divide(df_men['q1AgeBeginCoding'].value_counts().tolist(),np.sum(df_men['q1AgeBeginCoding'].value_counts().tolist())).tolist(),100).tolist(),
    name='Men Respondents'
)
trace2 = go.Bar(
    x=df_women['q1AgeBeginCoding'].value_counts().index.tolist(),
    y=np.multiply(np.divide(df_women['q1AgeBeginCoding'].value_counts().tolist(),np.sum(df_women['q1AgeBeginCoding'].value_counts().tolist())).tolist(),100).tolist(),
    name='Female Respondents'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='grouped-bar')

We can see that women tend to learn later on compared to men, especially regarding the "11-15 years-old" (22% for men and 13.8% for women) begineers category. More than the half of women learn between 16-20 years old.

#df['time']=(df['EndDate']-df['StartDate']).astype('timedelta64[m]')

Let's draw a global map to see from where are the majority of our respondents

focus_country = df['CountryNumeric'].value_counts().to_frame()
print("our TOP 10 country respondents is :") 
print(focus_country.head(10).index)

our TOP 10 country respondents is :
Index(['Ghana', 'India', 'United States', 'Sudan', 'Malaysia', 'Brazil',
       'Russian Federation', 'United Kingdom', 'Canada', 'Indonesia'],
      dtype='object')

data = [ dict(
        type = 'choropleth',
        locations = focus_country.index,
        locationmode = 'country names',
        z = focus_country['CountryNumeric'],
        text = focus_country['CountryNumeric'],
        colorscale = [[0,"rgb(5, 10, 172)"],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],\
            [0.6,"rgb(90, 120, 245)"],[0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"]],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 1
            ) ),
        colorbar = dict(
            autotick = False,
            tickprefix = '',
            title = 'Respondents'),
      ) ]

layout = dict(
    title = 'Number of respondents by country',
    geo = dict(
        showframe = True,
        showcoastlines = True,
        projection = dict(
            type = 'Mercator'
        )
    )
)

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='d3-world-map' )

Source here : https://plot.ly/python/choropleth-maps/

It's surprising to see Ghana winning the race, a map of beginning of code per country would be useful to see if every country needs to put on efforts (?) I will also explore the career/ school degrees and specialty of the individuals #To follow

Let's see the age at which the top countries respondents learned to code

df_men_c = [0,0,0]
df_women_c = [0,0,0]
count = 0
for i in focus_country.head(3).index : 
    df_men_c[count] = df_men[df_men['CountryNumeric'] == i]
    df_women_c[count] = df_women[df_women['CountryNumeric'] == i]
    print('N° of Male respondents for '+ i + ' is : '+ str(df_men_c[count].shape[0]))
    print('N° of Female respondents for '+ i + ' is : '+ str(df_women_c[count].shape[0]))
    
    trace1 = go.Bar( 
    x=df_men_c[count]['q1AgeBeginCoding'].value_counts().index.tolist(),
    y=np.multiply(np.divide(df_men_c[count]['q1AgeBeginCoding'].value_counts().tolist(),np.sum(df_men_c[count]['q1AgeBeginCoding'].value_counts().tolist())).tolist(),100).tolist(),
    name='Men Respondents in '+i
    )
    trace2 = go.Bar(
    x=df_women_c[count]['q1AgeBeginCoding'].value_counts().index.tolist(),
    y=np.multiply(np.divide(df_women_c[count]['q1AgeBeginCoding'].value_counts().tolist(),np.sum(df_women_c[count]['q1AgeBeginCoding'].value_counts().tolist())).tolist(),100).tolist(),
    name='Female Respondents in '+i
    )

    data = [trace1, trace2]
    layout = go.Layout(
        barmode='group'
    )

    fig = go.Figure(data=data, layout=layout)
    py.iplot(fig, filename='grouped-bar')
    count = count + 1

N° of Male respondents for Ghana is : 3510
N° of Female respondents for Ghana is : 892

N° of Male respondents for India is : 3167
N° of Female respondents for India is : 567

N° of Male respondents for United States is : 2640
N° of Female respondents for United States is : 546

We observe that most people learn to code between 16 and 20 years old. However, we also notice that in India the 2nd most represented group of beginners is 21-25 years old ! that is not the case in Ghana and USA where 2nd most seems to be 11-15 years. however girls are underrepresented in the USA for the 11-15 years old category. Maybe USA and India should put effort to make them learn to code earlier ?

jasminyas / hackerrank_developer_survey_2018 Goto Github PK

hackerrank_developer_survey_2018's Introduction

Let's explore which languages are the most popular amongst the respondents classified by gender !

Let's see how many women there are and the age distribution for both. The AgeBeginCoding value might also be interesting

We can see that women tend to learn later on compared to men, especially regarding the "11-15 years-old" (22% for men and 13.8% for women) begineers category. More than the half of women learn between 16-20 years old.

Let's draw a global map to see from where are the majority of our respondents

It's surprising to see Ghana winning the race, a map of beginning of code per country would be useful to see if every country needs to put on efforts (?) I will also explore the career/ school degrees and specialty of the individuals #To follow

Let's see the age at which the top countries respondents learned to code

Let's see if people who started to code continued. to be continued :) Don't hesitate to comment and upvote if you liked this kernel !

hackerrank_developer_survey_2018's People

Contributors

Stargazers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

jasminyas / hackerrank_developer_survey_2018 Goto Github PK

hackerrank_developer_survey_2018's Introduction

**Let's explore which languages are the most popular amongst the respondents classified by gender ! **

Let's see how many women there are and the age distribution for both. The AgeBeginCoding value might also be interesting

We can see that women tend to learn later on compared to men, especially regarding the "11-15 years-old" (22% for men and 13.8% for women) begineers category. More than the half of women learn between 16-20 years old.

Let's draw a global map to see from where are the majority of our respondents

It's surprising to see Ghana winning the race, a map of beginning of code per country would be useful to see if every country needs to put on efforts (?) I will also explore the career/ school degrees and specialty of the individuals #To follow

**Let's see the age at which the top countries respondents learned to code **

Let's see if people who started to code continued. to be continued :) Don't hesitate to comment and upvote if you liked this kernel !

hackerrank_developer_survey_2018's People

Contributors

Stargazers

Recommend Projects

Recommend Topics

Recommend Org

Let's explore which languages are the most popular amongst the respondents classified by gender !

Let's see the age at which the top countries respondents learned to code