Light

heisid / hinativescrap Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 2.07 MB

Scraping HiNative users' data

Python 100.00%

hinativescrap's Introduction

hinativescrap

Scraping HiNative Users Data

Usage

I provide my results in file results_clean.csv, containing:

Username (48979 users)
Native languages
Languages of interest

There's also results.csv - 50000 entries, including users no longer exist, maybe unconfirmed registrations or deleted accounts. It took about 2 hours to collect the data.

Just in case you want to try scraping by yourself

Install Python 3, venv, and pip
Create virtual environment

python -m venv env
source env/bin/activate

Install Scrapy

pip install scrapy

Play with range of users id (I don't know the correct term for this, I refer to unique integer numbers in users profile URL) in hinative/spiders/basic.py. Look at variable user_range. Warning: Don't be greedy, once I crawled 100,000 pages and got my IP blacklisted.
Run it!

scrapy crawl basic -o filename.csv

To clean the data from empty rows, run cleaning.py. To reorder the columns run reordering.py
You can modify items.py and basic.py to get more informations like Countries they know well, or anything.
To make your life easier, install pandas to analyze further.

pip install pandas

Counting

I count the data, with counting.py and the results is listed in rank_results.txt. (Well, I was wrong when I said that there were more Korean natives than Arabic, a mistake). From 48979 samples, I got:

Native languages

Native Language	Number of Users	Percentage
English (US)	8483	17.320%
Russian	6596	13.467%
Arabic	6318	12.899%
Polish	3764	7.685%
Portuguese (Brazil)	3509	7.164%

Languages of Interest

Language of Interest	Number of Users	Percentage
English (US)	28619	58.431%
Japanese	11771	24.033%
Korean	9996	20.409%
English (UK)	6457	13.183%
French (France)	4642	9.478%

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.