Coder Social home page Coder Social logo

hinativescrap's Introduction

hinativescrap

Scraping HiNative Users Data

Usage

I provide my results in file results_clean.csv, containing:

  • Username (48979 users)
  • Native languages
  • Languages of interest

There's also results.csv - 50000 entries, including users no longer exist, maybe unconfirmed registrations or deleted accounts. It took about 2 hours to collect the data.

Just in case you want to try scraping by yourself

  1. Install Python 3, venv, and pip
  2. Create virtual environment
python -m venv env
source env/bin/activate
  1. Install Scrapy
pip install scrapy
  1. Play with range of users id (I don't know the correct term for this, I refer to unique integer numbers in users profile URL) in hinative/spiders/basic.py. Look at variable user_range. Warning: Don't be greedy, once I crawled 100,000 pages and got my IP blacklisted.
  2. Run it!
scrapy crawl basic -o filename.csv
  1. To clean the data from empty rows, run cleaning.py. To reorder the columns run reordering.py
  2. You can modify items.py and basic.py to get more informations like Countries they know well, or anything.
  3. To make your life easier, install pandas to analyze further.
pip install pandas

Counting

I count the data, with counting.py and the results is listed in rank_results.txt. (Well, I was wrong when I said that there were more Korean natives than Arabic, a mistake). From 48979 samples, I got:

Native languages

Native Language Number of Users Percentage
English (US) 8483 17.320%
Russian 6596 13.467%
Arabic 6318 12.899%
Polish 3764 7.685%
Portuguese (Brazil) 3509 7.164%

Languages of Interest

Language of Interest Number of Users Percentage
English (US) 28619 58.431%
Japanese 11771 24.033%
Korean 9996 20.409%
English (UK) 6457 13.183%
French (France) 4642 9.478%

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.