scrapinghub / sample-projects Goto Github PK

View Code? Open in Web Editor NEW

137.0 137.0 205.0 50 KB

Sample projects showcasing Scrapinghub tech

Python 87.76% HTML 0.89% Lua 11.35%

sample-projects's Introduction

Scrapinghub command line client

shub is the Scrapinghub command line client. It allows you to deploy projects or dependencies, schedule spiders, and retrieve scraped data or logs without leaving the command line.

Requirements

Python >= 3.6

Installation

If you have pip installed on your system, you can install shub from the Python Package Index:

pip install shub

Please note that if you are using Python < 3.6, you should pin shub to 2.13.0 or lower.

We also supply stand-alone binaries. You can find them in our latest GitHub release.

Documentation

Documentation is available online via Read the Docs: https://shub.readthedocs.io/, or in the docs directory.

sample-projects's People

Contributors

Stargazers

Watchers

Forkers

plackz fungyip bluebears15 bernhardbuss vadi88 danilobenq ardlian thrilok austinseg mahmoud76 jonathanbowker defpunk evcoin kevinbga jgabriellima mraddy onceupon philratcliffe hammadrauf fullstackenviormentss rafrox doreln tumregels beesazee terminalkitten justin-thon jackey-qiu brokmini panosdotk bradleybrecher ksnreddyin ldmichiels chaziu sarah-wright-23 av4veena sembug axgreen mrnshar hamiltek jaysheldon chrisyoungr slapbassify soringp0 rupeshkzope threwsear dkanyana jurya asirxing alanpmullane eyejayvee bommalatanikhil006 maoandmoon lhostrovski affanusman mirzahassan105 yy2792 viralsteroids reg202 stjordanis arunaug jwsolve virajrch alvesmauri simbesim jaspzz gpdev1 lgrosu eupendra yuanpingsong kw468 reymond190 tactik8 alexaltumint buthker ventss sharadhin ravitejagudivada silo86 ying-solomon 818rizzo vivianangela nemanjastolic0927 mariofsoares hassan-zahid captaint33mo guiddxon sasireddy24 jke00 honzasecure sajidsadeeq bpowers4 gnaneshreddy547 sabavnk elint-tech dvhenacy calinwebdev dp9494 krutika-2508 anveshpanjala aryoung2010

sample-projects's Issues

Scrapy Splash Crawlera doesn't work for JSP based sites.

Try it on this JSP based site https://www.vitaminshoppe.com/
It's not working

Unauthorized Crawlera Header: "x-crawlera-session" Error

Trying to run the example with my API key returns the following response:

<html><head></head>
    <body><pre style="word-wrap: break-word; white-space: pre-wrap;">
       Unauthorized Crawlera Header: "x-crawlera-session"
    </pre>
  </body>
</html>

How to use Splash on my side?

I want to use my own Splash instance on my server and Crawlera service(I use C10).
I try this example and a response is Website crawl ban its mean that Crawlera doesn't deal with it. But if I just use only Crawlera all works well.
Also, I tried this, just deleted request:set_header("X-Crawlera-UA", "desktop").
The result is the same. Is something change in Crawlera API? Or I should use C50 plan or smt bad in my code?
Thanks!

Scrapy Splash Crawlera doesn't render JS based web sites

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://storage.scrapinghub.com/collections/233792/s/nikoncoolpix

Following the Scrapy Price Monitor I encountered an error after successfully deploying project to Scrapy Cloud. Running for example amazon.com spider job, it is completed with 0 items and 5 errors (1 for each 'product name'). In job log I get (for 'product_name': 'nikoncoolpix'):
[scrapy.core.scraper] Error processing {'retailer': 'amazon.com', 'product_name': 'nikoncoolpix', 'when': '2017/09/07 03:57:21', 'price': 256.95, 'title': 'Nikon COOLPIX B500 Digital Camera (Red)', 'url': 'https://www.amazon.com/Nikon-COOLPIX-B500-Digital-Camera/dp/B01C3LEE9G'}

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/app/__main__.egg/price_monitor/pipelines.py", line 20, in process_item
  File "/usr/local/lib/python3.5/site-packages/scrapinghub/hubstorage/collectionsrt.py", line 152, in set
    return self._collections.set(self.coltype, self.colname, *args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/scrapinghub/hubstorage/collectionsrt.py", line 56, in set
    return self.apipost((_type, _name), is_idempotent=True, jl=_values)
  File "/usr/local/lib/python3.5/site-packages/scrapinghub/hubstorage/resourcetype.py", line 74, in apipost
    return self.apirequest(_path, method='POST', **kwargs)
  File "/usr/local/lib/python3.5/site-packages/scrapinghub/hubstorage/resourcetype.py", line 71, in apirequest
    return jldecode(self._iter_lines(_path, **kwargs))
  File "/usr/local/lib/python3.5/site-packages/scrapinghub/hubstorage/resourcetype.py", line 60, in _iter_lines
    r = self.client.request(**kwargs)
  File "/usr/local/lib/python3.5/site-packages/scrapinghub/hubstorage/client.py", line 107, in request
    return self.retrier.call(invoke_request)
  File "/usr/local/lib/python3.5/site-packages/retrying.py", line 206, in call
    return attempt.get(self._wrap_exception)
  File "/usr/local/lib/python3.5/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python3.5/site-packages/six.py", line 686, in reraise
    raise value
  File "/usr/local/lib/python3.5/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/local/lib/python3.5/site-packages/scrapinghub/hubstorage/client.py", line 100, in invoke_request
    r.raise_for_status()
  File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 844, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://storage.scrapinghub.com/collections/233792/s/nikoncoolpix

Same error is encountered when running spider in local environment. I would really appreciate any help.

System specifications:

OS Windows 10
Python 3.6.1.
Scrapy 1.4.0.

scrapy_price_monito (requirements_error)

I followed the tutorial to run the project "scrapy_price_monito" as the link:
(https://github.com/scrapinghub/sample-projects/tree/master/scrapy_price_monitor#installing-and-running)
When I go to "step 7 - hub deploy <your_project_id_here>"

The ScrapingHub throwed error as below:

##############START################

shub deploy 482071
Packing version 23fadd3-master
Deploying to Scrapy Cloud project "482071"
Deploy log last 30 lines:
Removing intermediate container 8a32420b04ae
---> bc794867e1a9
Step 12/12 : ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
---> [Warning] Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
---> Running in f21786580c1f
Removing intermediate container f21786580c1f
---> 673538660e3a
Successfully built 673538660e3a
Successfully tagged i.scrapinghub.com/kumo_project/482071:1
Step 1/3 : FROM alpine:3.5
---> f80194ae2e0c
Step 2/3 : ADD kumo-entrypoint /kumo-entrypoint
---> Using cache
---> b6085fc56e21
Step 3/3 : RUN chmod +x /kumo-entrypoint
---> Using cache
---> 1bbe2a121e2b
Successfully built 1bbe2a121e2b
Successfully tagged kumo-entrypoint:latest
Entrypoint container is created successfully

Checking python dependencies
Collecting pip<20.0,>=9.0.3
Downloading https://files.pythonhosted.org/packages/00/b6/9cfa56b4081ad13874b0c6f96af8ce16cfbc1cb06bedf8e9164ce5551ec1/pip-19.3.1-py2.py3-none-any.whl (1.4MB)
Installing collected packages: pip
Successfully installed pip-19.3.1
requests 2.25.0 has requirement idna<3,>=2.5, but you have idna 2.1.
Warning: Pip checks failed, please fix the conflicts.
{"message": "Dependencies check exit code: 1", "details": "Pip checks failed, please fix the conflicts", "error": "requirements_error"}

{"status": "error", "message": "Requirements error"}
Deploy log location: /tmp/shub_deploy_3xl5dnaq.log

##############END################

Could you please help me to fix it?

Thanks alot!,

Harry

scrapinghub / sample-projects Goto Github PK

sample-projects's Introduction

Scrapinghub command line client

Requirements

Installation

Documentation

sample-projects's People

Contributors

Stargazers

Watchers

Forkers

sample-projects's Issues

Scrapy Splash Crawlera doesn't work for JSP based sites.

Unauthorized Crawlera Header: "x-crawlera-session" Error

How to use Splash on my side?

Scrapy Splash Crawlera doesn't render JS based web sites

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://storage.scrapinghub.com/collections/233792/s/nikoncoolpix

scrapy_price_monito (requirements_error)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent