Thanks for your amazing job! I'm trying to use your scraper but it didn't works... It redirect to a 404 page.... Can you help me?
`scrapy crawl companies -a selenium_hostname=localhost -o output.csv
INFO:scrapy.utils.log:Scrapy 2.0.1 started (bot: scrapybot)
2020-04-19 12:10:18 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic
2020-04-19 12:10:18 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic
DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
2020-04-19 12:10:18 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.crawler:Overridden settings:
{'AUTOTHROTTLE_DEBUG': True,
'COOKIES_ENABLED': False,
'DEPTH_PRIORITY': -1,
'DOWNLOAD_DELAY': 0.25,
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
'NEWSPIDER_MODULE': 'linkedin.spiders',
'SPIDER_MODULES': ['linkedin.spiders'],
'TELNETCONSOLE_ENABLED': False}
2020-04-19 12:10:18 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_DEBUG': True,
'COOKIES_ENABLED': False,
'DEPTH_PRIORITY': -1,
'DOWNLOAD_DELAY': 0.25,
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
'NEWSPIDER_MODULE': 'linkedin.spiders',
'SPIDER_MODULES': ['linkedin.spiders'],
'TELNETCONSOLE_ENABLED': False}
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-04-19 12:10:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
DEBUG:linkedin_api.client:Attempting to use cached cookies
2020-04-19 12:10:18 [linkedin_api.client] DEBUG: Attempting to use cached cookies
Initializing chromium, remote url: http://localhost:4444/wd/hub
^CINFO:scrapy.crawler:Received SIGINT, shutting down gracefully. Send again to force
2020-04-19 12:10:18 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
^CINFO:scrapy.crawler:Received SIGINT twice, forcing unclean shutdown
2020-04-19 12:10:19 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
^CSearching for the Login btn
Searching for the password btn
Unhandled error in Deferred:
CRITICAL:twisted:Unhandled error in Deferred:
2020-04-19 12:10:22 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 177, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 181, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- ---
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler
spider = cls(args, **kwargs)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init
self.cookies = login(driver)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login
get_by_xpath(driver, '//[@id="password"]').send_keys(PASSWORD)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath
(By.XPATH, xpath)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until
value = method(self._driver)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call
return _find_element(driver, self.locator)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element
raise e
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element
return driver.find_element(*by)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element
'value': value})['value']
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
(Session info: chrome=81.0.4044.92)
CRITICAL:twisted:
Traceback (most recent call last):
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler
spider = cls(args, **kwargs)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init
self.cookies = login(driver)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login
get_by_xpath(driver, '//[@id="password"]').send_keys(PASSWORD)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath
(By.XPATH, xpath)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until
value = method(self._driver)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call
return _find_element(driver, self.locator)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element
raise e
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element
return driver.find_element(*by)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element
'value': value})['value']
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
(Session info: chrome=81.0.4044.92)
2020-04-19 12:10:22 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 88, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/crawler.py", line 100, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/spiders/init.py", line 49, in from_crawler
spider = cls(args, **kwargs)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 39, in init
self.cookies = login(driver)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 126, in login
get_by_xpath(driver, '//[@id="password"]').send_keys(PASSWORD)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 87, in get_by_xpath
(By.XPATH, xpath)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 71, in until
value = method(self._driver)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 64, in call
return _find_element(driver, self.locator)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 415, in _find_element
raise e
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/support/expected_conditions.py", line 411, in _find_element
return driver.find_element(*by)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 978, in find_element
'value': value})['value']
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
(Session info: chrome=81.0.4044.92)
(.venv) root@glassback-virtual-machine:/home/glassback/linkedin# sudo scrapy crawl companies -a selenium_hostname=localhost -o output.csv
sudo: scrapy: command not found
(.venv) root@glassback-virtual-machine:/home/glassback/linkedin# scrapy crawl companies -a selenium_hostname=localhost -o output.csv
INFO:scrapy.utils.log:Scrapy 2.0.1 started (bot: scrapybot)
2020-04-19 12:10:43 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic
2020-04-19 12:10:43 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.9, Platform Linux-5.0.0-23-generic-x86_64-with-Ubuntu-18.04-bionic
DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
2020-04-19 12:10:43 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.crawler:Overridden settings:
{'AUTOTHROTTLE_DEBUG': True,
'COOKIES_ENABLED': False,
'DEPTH_PRIORITY': -1,
'DOWNLOAD_DELAY': 0.25,
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
'NEWSPIDER_MODULE': 'linkedin.spiders',
'SPIDER_MODULES': ['linkedin.spiders'],
'TELNETCONSOLE_ENABLED': False}
2020-04-19 12:10:43 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_DEBUG': True,
'COOKIES_ENABLED': False,
'DEPTH_PRIORITY': -1,
'DOWNLOAD_DELAY': 0.25,
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
'NEWSPIDER_MODULE': 'linkedin.spiders',
'SPIDER_MODULES': ['linkedin.spiders'],
'TELNETCONSOLE_ENABLED': False}
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-04-19 12:10:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
DEBUG:linkedin_api.client:Attempting to use cached cookies
2020-04-19 12:10:43 [linkedin_api.client] DEBUG: Attempting to use cached cookies
Initializing chromium, remote url: http://localhost:4444/wd/hub
Searching for the Login btn
Searching for the password btn
Searching for the submit
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.stats.DownloaderStats',
'linkedin.middlewares.SeleniumDownloaderMiddleware']
2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.stats.DownloaderStats',
'linkedin.middlewares.SeleniumDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
2020-04-19 12:11:05 [scrapy.middleware] INFO: Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
2020-04-19 12:11:05 [scrapy.core.engine] INFO: Spider opened
INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-19 12:11:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Initializing chromium, remote url: http://localhost:4444/wd/hub
ERROR:scrapy.core.scraper:Error downloading <GET https://www.linkedin.com/company/twitter>
Traceback (most recent call last):
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 36, in process_request
response = yield deferred_from_coro(method(request=request, spider=spider))
File "/home/glassback/linkedin/linkedin/middlewares.py", line 12, in process_request
driver = init_chromium(spider.selenium_hostname, cookies)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 109, in init_chromium
driver.add_cookie(cookie)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 894, in add_cookie
self.execute(Command.ADD_COOKIE, {'cookie': cookie_dict})
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: invalid 'expiry'
(Session info: chrome=81.0.4044.92)
2020-04-19 12:11:08 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.linkedin.com/company/twitter>
Traceback (most recent call last):
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 36, in process_request
response = yield deferred_from_coro(method(request=request, spider=spider))
File "/home/glassback/linkedin/linkedin/middlewares.py", line 12, in process_request
driver = init_chromium(spider.selenium_hostname, cookies)
File "/home/glassback/linkedin/linkedin/spiders/selenium.py", line 109, in init_chromium
driver.add_cookie(cookie)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 894, in add_cookie
self.execute(Command.ADD_COOKIE, {'cookie': cookie_dict})
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/glassback/linkedin/.venv/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: invalid 'expiry'
(Session info: chrome=81.0.4044.92)
INFO:scrapy.core.engine:Closing spider (finished)
2020-04-19 12:11:08 [scrapy.core.engine] INFO: Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/selenium.common.exceptions.InvalidArgumentException': 1,
'downloader/request_bytes': 57,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'elapsed_time_seconds': 2.54757,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 4, 19, 19, 11, 8, 390260),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'memusage/max': 59977728,
'memusage/startup': 59977728,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 4, 19, 19, 11, 5, 842690)}
2020-04-19 12:11:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/selenium.common.exceptions.InvalidArgumentException': 1,
'downloader/request_bytes': 57,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'elapsed_time_seconds': 2.54757,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 4, 19, 19, 11, 8, 390260),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'memusage/max': 59977728,
'memusage/startup': 59977728,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 4, 19, 19, 11, 5, 842690)}
INFO:scrapy.core.engine:Spider closed (finished)
2020-04-19 12:11:08 [scrapy.core.engine] INFO: Spider closed (finished)`