Coder Social home page Coder Social logo

scrape-linkedin's Introduction

pylinkedin

Travis-CI Build Status

Introduction

pylinkedin is a python package to scrape all details from public LinkedIn profiles. It can also be used as a parser to transform html LinkedIn profiles into structured json.

Some precautions you should take if you want scrape LinkedIn with python :

  • Change the default python requests user-agent for a browser user-agent like "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1".
  • By default LinkedIn has strong anti-scraping policies. LinkedIn will quickly blacklist ips making unauthentified requests from by responding with a non standard http status code 999 to the http requests. Especially LinkedIn banned most ips from cloud providers (Aws, Digital Ocean, ...).
  • You can use proxies if you want to do concurrent requests but they may also fail a lot.

Installation

Install with pip

Run pip install git+git://github.com/ericfourrier/scrape-linkedin.git

Install from source

git clone https://github.com/ericfourrier/scrape-linkedin.git

Run python setup.py install

Tests

The tests are runs with a html file from a LinkedIn profile. The main reason is because Travis use aws machine and its ips are banned by Linkedin.

Especially the fact that the test suite is passed is not a good indicator than the package will work (Your ip can be banned or LinkedIn html source code changed).

You can still run the test suite at the root of the package with pytest: py.test test.py.

Using this package

Command line

pylinkedin comes with a simple command line argument module pylinkedin.

Options:

  • -u --url : Url of the profile you want to scrape
  • -a --attribute : Display only a specific attribute, display everything by default
  • -f --file_path : Raw path to html file of the profile you want to scrape
  • -h --help : Show this screen.

Examples:

  • Get a json of Jeff Weiner profile: pylinkedin -u https://www.linkedin.com/in/jeffweiner08
  • Get only the skills of Jeff Weiner: pylinkedin -u https://www.linkedin.com/in/jeffweiner08 -a skills
  • Parse html stored profile and get json: pylinkedin -f /path/file.html

Python Package

It relies on two class:

CustomRequest which is just a way to customise your http request specifying a list of user-agents or proxies.

from pylinkedin.utils import CustomRequest
c = CustomRequest() # default with rotating proxies
c = CustomRequest(rotating_ua=False) # without rotating user-agent
c = CustomRequest(list_proxies=[{'https':'http://186.233.94.106:8080',
'http':'http://186.233.94.106:8080'}]))

LinkedinItem is the main class, you can instantiate it with the URL of public profile using the url parameter, or with the HTML contents of the profile page, using html_string. See test.py for an example of using a save HTML file as input for the scrapper.

from pylinkedin.scraper import LinkedinItem
l = LinkedinItem(url='https://www.linkedin.com/in/kennethreitz')
l = LinkedinItem(html_string=profile_string)

You can customize your requests using CustomRequest class for LinkedinItem

c = CustomRequest(rotating_ua = True)
url_to_scrape = "https://www.linkedin.com/in/jeffweiner08"
l = LinkedinItem(url=url_to_scrape, crequest=c) # passing requests with rotating user-agent

To use the html_string, make sure to browse to the public version of the profile page, as the private version will not work. The private version is the one showing the edit controls next to each section.

'LinkedinItem' has the folowing syntax the get the info :

l.name # to get the name
l.skills # to get the skills
l.publications  # to get the publications
...
# the most important
l.to_dict() to get all infos

Exhaustive list of the fields scraped

[volunteerings, last_name, number_recommendations, number_connections, current_location, honors, first_name, current_title, test_scores, current_industry, languages, similar_profiles, interests, profile_img_url, current_education, educations, experiences, groups, organizations, certifications, name, skills, websites, summary, project, courses, publications,recommendations]

Issues

Package is not actively maintained.

You can post bugs and issues here.

scrape-linkedin's People

Contributors

ericfourrier avatar haschdl avatar

Stargazers

 avatar Avraam Piperidis avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.