Coder Social home page Coder Social logo

eracle / linkedin Goto Github PK

View Code? Open in Web Editor NEW
659.0 34.0 112.0 269 KB

Linkedin Scraper using Selenium Web Driver, Chromium headless, Docker and Scrapy

License: Other

Python 93.89% Shell 1.55% Makefile 2.24% Dockerfile 2.33%
scrapy selenium-webdriver bot scraper scraping linkedin chromium-browser docker docker-compose

linkedin's Introduction

Sponsor:

Proxycurl APIs enrich people and company profiles with structured data

Scrape public LinkedIn people and company profile data at scale with Proxycurl APIs.

  • Scraping Public profiles are battle tested in court in HiQ VS LinkedIn case
  • GDPR, CCPA, SOC2 compliant
  • High rate limit - 300 requests/minute
  • Fast - APIs respond in ~2s
  • Fresh data - 88% of data is scraped real-time, other 12% are not older than 29 days
  • High accuracy
  • Tons of data points returned per profile

Built for developers, by developers.

LinkedIn Data Scraper

built with Python3 built with Selenium

LinkedIn Data Scraper is a powerful open-source tool designed to extract valuable data from LinkedIn. It leverages technologies such as Scrapy, Selenium WebDriver, Chromium, Docker, and Python3 to navigate LinkedIn profiles and gather insightful information.

Features

Profile Data Extraction

The tool is designed to visit LinkedIn user pages and extract valuable data. This includes phone numbers, emails, education, work experiences, and much more. The data is formatted in a CSV file, making it easy to use for further analysis or input for LinkedIn automation software like lemlist.

Company Data Extraction

The tool can also gather information about all users working for a specific company on LinkedIn. It navigates to the company's LinkedIn page, clicks on the "See all employees" button, and collects user-related data.

Name-Based Data Extraction

The tool also offers a unique feature that allows you to extract data based on a specific name. By having the name of a person on the names.txt file, the tool will navigate to the LinkedIn profiles associated with that name and extract the relevant data. This feature can be incredibly useful for targeted research or networking. To use this feature, simply use the make byname command and input the name when prompted.

Installation and Setup

You will need the following:

  • Docker
  • Docker Compose
  • A VNC viewer (e.g., Vinagre for Ubuntu)

Steps

  1. Prepare your environment: Install Docker from the official website. If you don't have a VNC viewer, install one. For Ubuntu, you can use Vinagre:
sudo apt-get update
sudo apt-get install vinagre
  1. Set up LinkedIn login and password: Copy conf_template.py to conf.py and fill in your LinkedIn credentials.

  2. Run and build containers with Docker Compose: Open your terminal, navigate to the project folder, and type:

make companies
or
make random
or
make byname
  1. Monitor the browser's activity: Open Vinagre and connect to localhost:5900. The password is secret. Alternatively, you can use the command:
make view
  1. Stop the scraper: To stop the scraper, use the command:
make down

Testing

make test

Legal Disclaimer

This code is not affiliated with, authorized, maintained, sponsored, or endorsed by LinkedIn or any of its affiliates or subsidiaries. This is an independent and unofficial project. Use at your own risk.

This project violates LinkedIn's User Agreement Section 8.2. As a result, LinkedIn may temporarily or permanently ban your account. We are not responsible for any actions taken by LinkedIn in response to the use of this tool.


linkedin's People

Contributors

eminetto avatar eracle avatar josephlimtech avatar pyup-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

linkedin's Issues

Scrape by profile urls

Appreciate the tool you've built but its not clear in the documentation if we can also scrape by a specific linkedin url if we've already collected a list of people but just want to scrape all their contact/company data

make companies execution error

After running make companies the following error occured:

scrapy_companies_1  | 2023-09-18 22:13:43 [scrapy.middleware] INFO: Enabled item pipelines:
scrapy_companies_1  | []
scrapy_companies_1  | INFO:scrapy.core.engine:Spider opened
scrapy_companies_1  | 2023-09-18 22:13:43 [scrapy.core.engine] INFO: Spider opened
scrapy_companies_1  | INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
scrapy_companies_1  | 2023-09-18 22:13:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
scrapy_companies_1  | Traceback (most recent call last):
scrapy_companies_1  |   File "/usr/local/bin/scrapy", line 8, in <module>
scrapy_companies_1  |     sys.exit(execute())
scrapy_companies_1  |   File "/usr/local/lib/python3.8/site-packages/scrapy/cmdline.py", line 158, in execute
scrapy_companies_1  |     _run_print_help(parser, _run_command, cmd, args, opts)
scrapy_companies_1  |   File "/usr/local/lib/python3.8/site-packages/scrapy/cmdline.py", line 111, in _run_print_help
scrapy_companies_1  |     func(*a, **kw)
scrapy_companies_1  |   File "/usr/local/lib/python3.8/site-packages/scrapy/cmdline.py", line 166, in _run_command
scrapy_companies_1  |     cmd.run(args, opts)
scrapy_companies_1  |   File "/usr/local/lib/python3.8/site-packages/scrapy/commands/crawl.py", line 30, in run
scrapy_companies_1  |     self.crawler_process.start()
scrapy_companies_1  |   File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 383, in start
scrapy_companies_1  |     install_shutdown_handlers(self._signal_shutdown)
scrapy_companies_1  |   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/ossignal.py", line 19, in install_shutdown_handlers
scrapy_companies_1  |     reactor._handleSignals()
scrapy_companies_1  | AttributeError: 'EPollReactor' object has no attribute '_handleSignals'

Im not exactly sure what the issue is, but may be related to scrapy dependency issues - https://stackoverflow.com/questions/76995567/error-when-crawl-data-epollreactor-object-has-no-attribute-handlesignals

DJANGO_SETTINGS_MODULE

[root@iZj6cay283146nzk0n83czZ scrapyLinkedIn]# scrapy crawl linkedin
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 11, in
sys.exit(execute())
File "/usr/local/lib/python3.4/site-packages/scrapy/cmdline.py", line 141, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "/usr/local/lib/python3.4/site-packages/scrapy/crawler.py", line 238, in init
super(CrawlerProcess, self).init(settings)
File "/usr/local/lib/python3.4/site-packages/scrapy/crawler.py", line 129, in init
self.spider_loader = _get_spider_loader(settings)
File "/usr/local/lib/python3.4/site-packages/scrapy/crawler.py", line 325, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/usr/local/lib/python3.4/site-packages/scrapy/spiderloader.py", line 33, in from_settings
return cls(settings)
File "/usr/local/lib/python3.4/site-packages/scrapy/spiderloader.py", line 20, in init
self._load_all_spiders()
File "/usr/local/lib/python3.4/site-packages/scrapy/spiderloader.py", line 29, in _load_all_spiders
self._load_spiders(module)
File "/usr/local/lib/python3.4/site-packages/scrapy/spiderloader.py", line 23, in _load_spiders
for spcls in iter_spider_classes(module):
File "/usr/local/lib/python3.4/site-packages/scrapy/utils/spider.py", line 25, in iter_spider_classes
if inspect.isclass(obj) and
File "/usr/local/lib/python3.4/inspect.py", line 83, in isclass
return isinstance(object, type)
File "/usr/local/lib/python3.4/site-packages/django/utils/functional.py", line 226, in inner
self._setup()
File "/usr/local/lib/python3.4/site-packages/django/conf/init.py", line 42, in _setup
% (desc, ENVIRONMENT_VARIABLE))
django.core.exceptions.ImproperlyConfigured: Requested settings, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.

Adapting for comment scraping?

Hello. I have a single post with over 10,000 comments, and I need to scrape every one of them. How would I go about adapting this code to do that?
Thank you for your help,
Tyler S

use travis ci or other ci tool for github

It would be cool to have a CI configuration that checks for each PR that the tests pass.
By the way, taking into account that, right now, in order to launch the tests there is the need to insert linkedin username and password. So a better approach, like having some html templates to test on, would be cool: in that case copyright issues there should be taken into account.

Build fail

Just checkout out the repo, so fresh setup:

# make companies
docker-compose build
[+] Building 5.2s (4/4) FINISHED                                                                              docker-container:autogpt
 => CANCELED [scrapy_random internal] booting buildkit                                                                            5.2s
 => => pulling image moby/buildkit:buildx-stable-1                                                                                5.1s
 => => creating container buildx_buildkit_autogpt0                                                                                0.0s
 => ERROR [scrapy_test internal] booting buildkit                                                                                 5.2s
 => => pulling image moby/buildkit:buildx-stable-1                                                                                5.1s
 => => creating container buildx_buildkit_autogpt0                                                                                0.0s
 => CANCELED [scrapy_companies internal] booting buildkit                                                                         5.2s
 => => pulling image moby/buildkit:buildx-stable-1                                                                                5.1s
 => => creating container buildx_buildkit_autogpt0                                                                                0.0s
 => CANCELED [scrapy_byname internal] booting buildkit                                                                            5.2s
 => => pulling image moby/buildkit:buildx-stable-1                                                                                5.1s
 => => creating container buildx_buildkit_autogpt0                                                                                0.0s
------
 > [scrapy_test internal] booting buildkit:
------
Error response from daemon: Conflict. The container name "/buildx_buildkit_autogpt0" is already in use by container "c38d40c96d8bedc15eb11c9d499f07cf80fa347ba7bfd3cc5e5967136aacf754". You have to remove (or rename) that container to be able to reuse that name.
make: *** [build] Error 17

search.py -- Exception Occurred: # extract_profile_id_from_url

Hi, I get errors if the profile has limited visibility.
I tried to solve it adding a way to know if the profile has limited visibility and skip it in those cases.

linkedin/spiders/search.py

        # Profiles out of your network have limited visibility.
        profile_out = '//*[contains(@class,"actor-name") and contains(text(),"LinkedIn Member")]'
        profile_limited_visibility = get_by_xpath_or_none(driver=driver,
                                                  xpath=profile_out,
                                                  wait_timeout=NO_RESULT_WAIT_TIMEOUT,
                                                  logs=False)

        if no_result_response is not None:
            # no results message shown: stop crawling this company
            driver.close()
            return
        else:
            users = extracts_linkedin_users(driver, api_client=self.api_client)
            for user in users:
                if profile_limited_visibility is not "LinkedIn Member":
                    if stop_criteria is not None:
                        if stop_criteria(user, stop_criteria_args):
                            # if stop criteria is matched stops the crawl, and also next pages
                            driver.close()
                            return

                    yield user

image

this is the error:

Exception Occurred:
XPATH:.//*[@class="search-result__result-link ember-view"]
Error:Message:

Exception Occurred:
XPATH:.//*[@class="name actor-name"]
Error:Message:

ERROR:scrapy.core.scraper:Spider error processing <GET https://www.linkedin.com/search/results/people/?facetCurrentCompany=%5B%2211885%22%5D&page=3> (referer: https://www.linkedin.com/search/results/people/?facetCurrentCompany=%5B%2211885%22%5D&page=2)
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/scrapy/utils/defer.py", line 117, in iter_errback
    yield next(it)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 345, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 345, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 338, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/app/projects/linkedin/spiders/search.py", line 55, in parser_search_results_page
    for user in users:
  File "/app/projects/linkedin/spiders/search.py", line 110, in extracts_linkedin_users
    profile_id = link.split('/')[-2]
AttributeError: 'NoneType' object has no attribute 'split'
2020-05-30 19:58:23 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.linkedin.com/search/results/people/?facetCurrentCompany=%5B%2211885%22%5D&page=3> (referer: https://www.linkedin.com/search/results/people/?facetCurrentCompany=%5B%2211885%22%5D&page=2)
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/scrapy/utils/defer.py", line 117, in iter_errback
    yield next(it)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 345, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 345, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 338, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/app/projects/linkedin/spiders/search.py", line 55, in parser_search_results_page
    for user in users:
  File "/app/projects/linkedin/spiders/search.py", line 110, in extracts_linkedin_users
    profile_id = link.split('/')[-2]
AttributeError: 'NoneType' object has no attribute 'split'


Somebody how know fix it?
Thanks in advance,

update cryptography dependency

Here's the following message github is showing me

 We found a potential security vulnerability in one of your dependencies.

A dependency defined in requirements.txt has known security vulnerabilities and should be updated.

Only the owner of this repository can see this message.
Learn more about vulnerability alerts 

LinkedIn data structure change

Hi - I have limited programming knowledge but read up enough to make good use of your code (thanks!).

LinkedIn has overhauled how it presents position information in a Person’s Jobs history. Chiefly, if a Person has more than one role in a company, LinkedIn now collapses those positions under a single Company header, as opposed to for every such position. The output now is only the top role in this new ‘nest’ of roles, and the Company field yields ‘null’. I’ll spend the weekend looking at the code but unfortunately I lack the raw programming skill to do much about it (for now).

Wanted to bring this to your attention.

Cheers
Kaan

User error?

Firstly, I love that you've made this. However, I'm having some trouble getting it to work properly and I think maybe it's just a user/documentation error. I don't quite get the:

make companies
or
make random
or
make byname

Like, which one is for what?
If I try make companies, I get the following in the log. If I connect to the vnc, I do the security check and it passes. Then scrape logs in, gets my linkedin page and then just sits there doing nothing, which then results in scrape exiting.

Successfully built 4d01d326b5743b603e198a3c558391123e315f4965f96722a5d4b4703b967ab7
docker-compose up scrapy_companies
selenium is up-to-date
Starting linkedin_scrapy_companies_1 ... done
Attaching to linkedin_scrapy_companies_1
scrapy_companies_1  | --2023-10-10 18:38:35--  http://selenium:4444/wd/hub
scrapy_companies_1  | Resolving selenium (selenium)... 172.21.0.2
scrapy_companies_1  | Connecting to selenium (selenium)|172.21.0.2|:4444... connected.
scrapy_companies_1  | HTTP request sent, awaiting response... 302 Found
scrapy_companies_1  | Location: http://selenium:4444/wd/hub/static/resource/hub.html [following]
scrapy_companies_1  | --2023-10-10 18:38:35--  http://selenium:4444/wd/hub/static/resource/hub.html
scrapy_companies_1  | Reusing existing connection to selenium:4444.
scrapy_companies_1  | HTTP request sent, awaiting response... 200 OK
scrapy_companies_1  | Length: 160 [text/html]
scrapy_companies_1  | Saving to: ‘STDOUT’
scrapy_companies_1  |
scrapy_companies_1  |      0K                                                       100%<!DOCTYPE html>
scrapy_companies_1  | <title>WebDriver Hub</title>
scrapy_companies_1  | <link rel="stylesheet" href="style.css">
scrapy_companies_1  | <script src="client.js"></script>
scrapy_companies_1  | <body>
scrapy_companies_1  | <script>init();</script>
scrapy_companies_1  | </body>
scrapy_companies_1  |  29.7M=0s
scrapy_companies_1  |
scrapy_companies_1  | 2023-10-10 18:38:35 (29.7 MB/s) - written to stdout [160/160]
scrapy_companies_1  |
scrapy_companies_1  | Selenium is up - executing command
scrapy_companies_1  | INFO:root:***** SECURITY CHECK IN PROGRESS *****
scrapy_companies_1  | INFO:root:Please perform the security check on selenium, you have 30 seconds...
scrapy_companies_1  | INFO:root:***** SECURITY CHECK COMPLETED *****
linkedin_scrapy_companies_1 exited with code 0

improve portability with a container solution

Freely inspired by Instapy, it would be great to use a better solution for deployment.
Actually the code-base consists mainly of non-python code (it contains the chromium binaries for Ubuntu 16.04!), It would be great to use docker or another container solution.

Page not found on LinkedIn

Thanks for your amazing job! I'm trying to use your scraper but it didn't works... It redirect to a 404 page.... Can you help me?

`scrapy crawl companies -a selenium_hostname=localhost -o output.csv
INFO:scrapy.utils.log:Scrapy 2.0.1 started (bot: scrapybot)
2020-04-19 12:10:18 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic
2020-04-19 12:10:18 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic
DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
2020-04-19 12:10:18 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.crawler:Overridden settings:
{'AUTOTHROTTLE_DEBUG': True,
'COOKIES_ENABLED': False,
'DEPTH_PRIORITY': -1,
'DOWNLOAD_DELAY': 0.25,
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
'NEWSPIDER_MODULE': 'linkedin.spiders',
'SPIDER_MODULES': ['linkedin.spiders'],
'TELNETCONSOLE_ENABLED': False}
2020-04-19 12:10:18 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_DEBUG': True,
'COOKIES_ENABLED': False,
'DEPTH_PRIORITY': -1,
'DOWNLOAD_DELAY': 0.25,
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
'NEWSPIDER_MODULE': 'linkedin.spiders',
'SPIDER_MODULES': ['linkedin.spiders'],
'TELNETCONSOLE_ENABLED': False}
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-04-19 12:10:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
DEBUG:linkedin_api.client:Attempting to use cached cookies
2020-04-19 12:10:18 [linkedin_api.client] DEBUG: Attempting to use cached cookies
Initializing chromium, remote url: http://localhost:4444/wd/hub
^CINFO:scrapy.crawler:Received SIGINT, shutting down gracefully. Send again to force
2020-04-19 12:10:18 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
^CINFO:scrapy.crawler:Received SIGINT twice, forcing unclean shutdown
2020-04-19 12:10:19 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
^CSearching for the Login btn
Searching for the password btn
Unhandled error in Deferred:
CRITICAL:twisted:Unhandled error in Deferred:
2020-04-19 12:10:22 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 177, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 181, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- ---
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler
spider = cls(args, **kwargs)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init
self.cookies = login(driver)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login
get_by_xpath(driver, '//
[@id="password"]').send_keys(PASSWORD)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath
(By.XPATH, xpath)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until
value = method(self._driver)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call
return _find_element(driver, self.locator)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element
raise e
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element
return driver.find_element(*by)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element
'value': value})['value']
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
(Session info: chrome=81.0.4044.92)

CRITICAL:twisted:
Traceback (most recent call last):
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler
spider = cls(args, **kwargs)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init
self.cookies = login(driver)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login
get_by_xpath(driver, '//
[@id="password"]').send_keys(PASSWORD)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath
(By.XPATH, xpath)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until
value = method(self._driver)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call
return _find_element(driver, self.locator)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element
raise e
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element
return driver.find_element(*by)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element
'value': value})['value']
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
(Session info: chrome=81.0.4044.92)

2020-04-19 12:10:22 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler
spider = cls(args, **kwargs)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init
self.cookies = login(driver)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login
get_by_xpath(driver, '//
[@id="password"]').send_keys(PASSWORD)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath
(By.XPATH, xpath)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until
value = method(self._driver)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call
return _find_element(driver, self.locator)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element
raise e
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element
return driver.find_element(*by)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element
'value': value})['value']
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
(Session info: chrome=81.0.4044.92)

(.venv) root@glassback-virtual-machine:/home/glassback/linkedin# sudo scrapy crawl companies -a selenium_hostname=localhost -o output.csv
sudo: scrapy: command not found
(.venv) root@glassback-virtual-machine:/home/glassback/linkedin# scrapy crawl companies -a selenium_hostname=localhost -o output.csv
INFO:scrapy.utils.log:Scrapy 2.0.1 started (bot: scrapybot)
2020-04-19 12:10:43 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic
2020-04-19 12:10:43 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic
DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
2020-04-19 12:10:43 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.crawler:Overridden settings:
{'AUTOTHROTTLE_DEBUG': True,
'COOKIES_ENABLED': False,
'DEPTH_PRIORITY': -1,
'DOWNLOAD_DELAY': 0.25,
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
'NEWSPIDER_MODULE': 'linkedin.spiders',
'SPIDER_MODULES': ['linkedin.spiders'],
'TELNETCONSOLE_ENABLED': False}
2020-04-19 12:10:43 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_DEBUG': True,
'COOKIES_ENABLED': False,
'DEPTH_PRIORITY': -1,
'DOWNLOAD_DELAY': 0.25,
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
'NEWSPIDER_MODULE': 'linkedin.spiders',
'SPIDER_MODULES': ['linkedin.spiders'],
'TELNETCONSOLE_ENABLED': False}
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-04-19 12:10:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
DEBUG:linkedin_api.client:Attempting to use cached cookies
2020-04-19 12:10:43 [linkedin_api.client] DEBUG: Attempting to use cached cookies
Initializing chromium, remote url: http://localhost:4444/wd/hub
Searching for the Login btn
Searching for the password btn
Searching for the submit
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.stats.DownloaderStats',
'linkedin.middlewares.SeleniumDownloaderMiddleware']
2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.stats.DownloaderStats',
'linkedin.middlewares.SeleniumDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
2020-04-19 12:11:05 [scrapy.core.engine] INFO: Spider opened
INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-19 12:11:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Initializing chromium, remote url: http://localhost:4444/wd/hub
ERROR:scrapy.core.scraper:Error downloading <GET https://www.linkedin.com/company/twitter>
Traceback (most recent call last):
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 36, in process_request
response = yield deferred_from_coro(method(request=request, spider=spider))
File "/home/glassback/linkedin/linkedin/middlewares.py", line 12, in process_request
driver = init_chromium(spider.selenium_hostname, cookies)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 109, in init_chromium
driver.add_cookie(cookie)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 894, in add_cookie
self.execute(Command.ADD_COOKIE, {'cookie': cookie_dict})
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: invalid 'expiry'
(Session info: chrome=81.0.4044.92)

2020-04-19 12:11:08 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.linkedin.com/company/twitter>
Traceback (most recent call last):
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 36, in process_request
response = yield deferred_from_coro(method(request=request, spider=spider))
File "/home/glassback/linkedin/linkedin/middlewares.py", line 12, in process_request
driver = init_chromium(spider.selenium_hostname, cookies)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 109, in init_chromium
driver.add_cookie(cookie)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 894, in add_cookie
self.execute(Command.ADD_COOKIE, {'cookie': cookie_dict})
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: invalid 'expiry'
(Session info: chrome=81.0.4044.92)

INFO:scrapy.core.engine:Closing spider (finished)
2020-04-19 12:11:08 [scrapy.core.engine] INFO: Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/selenium.common.exceptions.InvalidArgumentException': 1,
'downloader/request_bytes': 57,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'elapsed_time_seconds': 2.54757,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 4, 19, 19, 11, 8, 390260),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'memusage/max': 59977728,
'memusage/startup': 59977728,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 4, 19, 19, 11, 5, 842690)}
2020-04-19 12:11:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/selenium.common.exceptions.InvalidArgumentException': 1,
'downloader/request_bytes': 57,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'elapsed_time_seconds': 2.54757,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 4, 19, 19, 11, 8, 390260),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'memusage/max': 59977728,
'memusage/startup': 59977728,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 4, 19, 19, 11, 5, 842690)}
INFO:scrapy.core.engine:Spider closed (finished)
2020-04-19 12:11:08 [scrapy.core.engine] INFO: Spider closed (finished)`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.