jschnurr / scrapyscript Goto Github PK

Run a Scrapy spider programmatically from a script or a Celery task - no project required.

License: MIT License

Makefile 3.91% Python 96.09%

scrapyscript's Introduction

Embed Scrapy jobs directly in your code

What is Scrapyscript?

Scrapyscript is a Python library you can use to run Scrapy spiders directly from your code. Scrapy is a great framework to use for scraping projects, but sometimes you don't need the whole framework, and just want to run a small spider from a script or a Celery job. That's where Scrapyscript comes in.

With Scrapyscript, you can:

wrap regular Scrapy Spiders in a Job
load the Job(s) in a Processor
call processor.run() to execute them

... returning all results when the last job completes.

Let's see an example.

import scrapy
from scrapyscript import Job, Processor

processor = Processor(settings=None)

class PythonSpider(scrapy.spiders.Spider):
    name = "myspider"

    def start_requests(self):
        yield scrapy.Request(self.url)

    def parse(self, response):
        data = response.xpath("//title/text()").extract_first()
        return {'title': data}

job = Job(PythonSpider, url="http://www.python.org")
results = processor.run(job)

print(results)

[{ "title": "Welcome to Python.org" }]

See the examples directory for more, including a complete Celery example.

Install

pip install scrapyscript

Requirements

Linux or MacOS
Python 3.8+
Scrapy 2.5+

API

Job (spider, *args, **kwargs)

A single request to call a spider, optionally passing in *args or **kwargs, which will be passed through to the spider constructor at runtime.

# url will be available as self.url inside MySpider at runtime
myjob = Job(MySpider, url='http://www.github.com')

Processor (settings=None)

Create a multiprocessing reactor for running spiders. Optionally provide a scrapy.settings.Settings object to configure the Scrapy runtime.

settings = scrapy.settings.Settings(values={'LOG_LEVEL': 'WARNING'})
processor = Processor(settings=settings)

Processor.run(jobs)

Start the Scrapy engine, and execute one or more jobs. Blocks and returns consolidated results in a single list. jobs can be a single instance of Job, or a list.

results = processor.run(myjob)

results = processor.run([myjob1, myjob2, ...])

A word about Spider outputs

As per the scrapy docs, a Spider must return an iterable of Request and/or dict or Item objects.

Requests will be consumed by Scrapy inside the Job. dict or scrapy.Item objects will be queued and output together when all spiders are finished.

Due to the way billiard handles communication between processes, each dict or Item must be pickle-able using pickle protocol 0. It's generally best to output dict objects from your Spider.

Contributing

Updates, additional features or bug fixes are always welcome.

Setup

Install Poetry
git clone [email protected]:jschnurr/scrapyscript.git
poetry install

Tests

make test or make tox

Version History

See CHANGELOG.md

License

The MIT License (MIT). See LICENCE file for details.

scrapyscript's People

Contributors

Stargazers

Watchers

scrapyscript's Issues

Cannot pass *args and **kwargs to spiders

New issue based on a comment from @JoeJasinski:

(As an aside, I also overrode the init() method in my spider to accept arguments, but noticed that I have to pass in the custom arguments to both MySpider and as the payload, but maybe I should open a different ticket for that?)

README example blows up

$ pip freeze
appdirs==1.4.1
attrs==16.3.0
Automat==0.5.0
billiard==3.3.0.23
cffi==1.9.1
constantly==15.1.0
cryptography==1.7.2
cssselect==1.0.1
enum34==1.1.6
idna==2.2
incremental==16.10.1
ipaddress==1.0.18
lxml==3.7.2
packaging==16.8
parsel==1.1.0
pkg-resources==0.0.0
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycparser==2.17
PyDispatcher==2.0.5
pyOpenSSL==16.2.0
pyparsing==2.1.10
queuelib==1.4.2
Scrapy==1.3.0
scrapyscript==0.0.6
service-identity==16.0.0
six==1.10.0
Twisted==17.1.0
w3lib==1.16.0
zope.interface==4.3.3
$  cat testit.py
from scrapyscript import Job, Processor
from scrapy.spiders import Spider

class PythonSpider(Spider):
    name = 'myspider'
    start_urls = ['http://www.python.org']

    def parse(self, response):
        title = response.xpath('//title/text()').extract()
        return {'title': title}

job = Job(PythonSpider())
Processor().run(job)
$ python testit.py
...
2017-02-23 08:02:44 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.python.org>
Traceback (most recent call last):
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
    result = f(*args, **kw)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request
    return handler.download_request(request, spider)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 61, in download_request
    return agent.download_request(request)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 286, in download_request
    method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1631, in request
    parsedURI.originForm)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1408, in _requestWithEndpoint
    d = self._pool.getConnection(key, endpoint)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1294, in getConnection
    return self._newConnection(key, endpoint)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/twisted/web/client.py", line 1306, in _newConnection
    return endpoint.connect(factory)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/twisted/internet/endpoints.py", line 788, in connect
    EndpointReceiver, self._hostText, portNumber=self._port
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/twisted/internet/_resolver.py", line 174, in resolveHostName
    onAddress = self._simpleResolver.getHostByName(hostName)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/scrapy/resolver.py", line 21, in getHostByName
    d = super(CachingThreadedResolver, self).getHostByName(name, timeout)
  File "/home/exarkun/Environments/scrapy/local/lib/python2.7/site-packages/twisted/internet/base.py", line 276, in getHostByName
    timeoutDelay = sum(timeout)
TypeError: 'float' object is not iterable
...
$

Issue with "return self.results.get()" in "Processor().run()" causing processing to hang forever

Thank you for writing scrapyscript; it's been very helpful!

However, I have a script that looks something like the below, written in Python 3.5. I noticed that when I run Process().run() as below, the spider runs to completion, but hangs after the scrape is done (which causes my Celery 4 jobs to hang and never finish).

I verified that in the scrapyscript code, in the run() method, processing makes it through p.start(), p.join(), p.terminate(), but hangs on the return statement. I noticed that the return statement is looking up results in a Queue. If I comment out the return statement (I personally don't care about the returned results), the processing finishes.

from scrapy.utils.project import get_project_settings

from scrapyscript import Job, Processor
from myproject.spiders.myspider import MySpider

scraper_args = dict(arg1="1", arg2="2")

config = get_project_settings()
spider = MySpider(**scraper_args)
# The api for scrapyscript requires us to pass in the
# scraper_args a second time. The constructor for the spider
# is called twice: once above and once again in the Job.
job = Job(spider, payload=scraper_args)
Processor(settings=config).run(job)

Impacted area:
https://github.com/jschnurr/scrapyscript/blob/master/scrapyscript.py#L118

(As an aside, I also overrode the init() method in my spider to accept arguments, but noticed that I have to pass in the custom arguments to both MySpider and as the payload, but maybe I should open a different ticket for that?)

Error upgrade scrapy 1.6

I upgraded scrapy to 1.6, and I get this error, it is possible that we can make an improvement ?, I do not know if it can generate an error when running my spiders

ERROR: scrapyscript 1.0.0 has requirement Scrapy==1.4.0, but you'll have scrapy 1.6.0 which is incompatible.

Thanks!

set CrawlSpider settings from Job()

Hi,

I am trying to use a CrawlSpider with Celery. The settings for the CrawlSpider are stored in the database. But what i try i can not get the settings passed to the CrawlSpider class.

My spider

class FindProductSpider(CrawlSpider):
    name = 'FindProductSpider'
    allowed_domains = ['']
    start_urls = ['']
    webshopid = ''
    rule = ''

    rules = [Rule(LinkExtractor(allow=rule), callback='parse_item', follow=True)]

    def parse_item(self, response):
        p = Product(url=response.url, from_webshop_id=self.webshopid)
        p.save()

and my celery task:

@shared_task()
def getproducts():
    webshops = Webshop.objects.all()

    for webshop in webshops:
        job = Job(FindProductSpider,
                  start_urls=[webshop.spider_start_url],
                  allowed_domain=[webshop.spider_allowed_domain],
                  rule=webshop.spider_allow_regex,
                  webshopid=webshop.id
                  )
        processor = Processor(settings=settings)
        data = processor.run([job])

But when i print the settings they keep being empty.
Some help would be super nice

loading settings file

Hi. I'm trying to figure out the best way to load a Settings object into the scrapyscript module. I'm trying to build this into an app within Django and the tutorial didn't point me in a direction pertaining to loading existing settings instead of the global defaults from Scrapy.

This is my current directory with the scripting going after the crawlers within crawlers.py. I'm following the example code provided in the ReadMe and I'm not sure how to proceed.

Allow billiard version version to float

Billiard is currently locked to "3.6.3.0". This makes it impossible to install scrapyscript if you have dependencies requesting a different range. This should be allowed to float within reason (e.g. ^3.6.3.0).

process not complete after spider close

in my case process never reach to the last line.

data = processor.run([githubJob])

print(json.dumps(data, indent=4))

it never reach to print line ,so i am not able to start new process because process is not complete and task is also not display as success task

Example for Celery use

Is there an example on how to use scrapyscript within a Celery task somewhere? Tried the blog, but it seems part II never came out :)

it is printing all log

i have enabled only error log to print using below command
celery -A scraper_api worker -l error

when i run scraping as celery task and i have enable only error log and in scrapy setting i have disabled log still it is printing all logs.

AttributeError: Can't get attribute 'PythonSpider' on <module 'main' (built-in)>

Hey all, this is exactly what I was looking for, but running into a few problems trying to test it out on Windows. Using the following I get the error above:

import scrapy
from scrapyscript import Job, Processor

processor = Processor(settings=None)


class PythonSpider(scrapy.spiders.Spider):
    name = "myspider"

    def start_requests(self):
        yield scrapy.Request(self.url)

    def parse(self, response):
        data = response.xpath("//title/text()").extract_first()
        return {'title': data}


job = Job(PythonSpider, url="http://www.python.org")
results = processor.run(job)

print(results)

When I move the Spider into a separate file and import that in, it seems to run without an error, but the results print as an empty array.

import scrapy
from scrapyscript import Job, Processor

from PythonSpider import PythonSpider

settings = scrapy.settings.Settings(values={'LOG_LEVEL': 'WARNING'})
processor = Processor(settings=settings)


job = Job(PythonSpider, url="http://www.python.org")
results = processor.run(job)

print(results)

twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

I can't run many jobs.
please tell me what I need to do.

my code

import scrapy
from scrapyscript import Job, Processor

settings = scrapy.settings.Settings(values={"LOG_LEVEL": "WARNING"})
processor = Processor(settings=None)


class PythonSpider(scrapy.spiders.Spider):
    name = "myspider"

    def start_requests(self):
        yield scrapy.Request(self.url)

    def parse(self, response):
        return {"title": 0}

jobs = [Job(PythonSpider, url="http://www.python.org") for i in range(50)]

results = Processor().run(jobs)

print(results)

result

2022-05-29 15:37:39 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-05-29 15:37:39 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.4 (main, May 25 2022, 00:14:12) [GCC 11.2.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Linux-5.15.0-1008-raspi-aarch64-with-glibc2.35
2022-05-29 15:37:39 [scrapy.crawler] INFO: Overridden settings:
{}
2022-05-29 15:37:39 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-05-29 15:37:40 [scrapy.extensions.telnet] INFO: Telnet Password: b16d3b0a35179414
2022-05-29 15:37:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-05-29 15:37:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-05-29 15:37:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-05-29 15:37:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-05-29 15:37:40 [scrapy.core.engine] INFO: Spider opened
2022-05-29 15:37:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-05-29 15:37:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-05-29 15:37:40 [scrapy.crawler] INFO: Overridden settings:
{}
Process Process-1:
Traceback (most recent call last):
File "/home/hamashou/.local/share/virtualenvs/test-5h1bg2cX/lib/python3.10/site-packages/billiard/process.py", line 327, in _bootstrap
self.run()
File "/home/hamashou/.local/share/virtualenvs/test-5h1bg2cX/lib/python3.10/site-packages/billiard/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/hamashou/.local/share/virtualenvs/test-5h1bg2cX/lib/python3.10/site-packages/scrapyscript/init.py", line 69, in _crawl
self.crawler.crawl(req.spider, *req.args, **req.kwargs)
File "/home/hamashou/.local/share/virtualenvs/test-5h1bg2cX/lib/python3.10/site-packages/scrapy/crawler.py", line 205, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/home/hamashou/.local/share/virtualenvs/test-5h1bg2cX/lib/python3.10/site-packages/scrapy/crawler.py", line 238, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/home/hamashou/.local/share/virtualenvs/test-5h1bg2cX/lib/python3.10/site-packages/scrapy/crawler.py", line 313, in _create_crawler
return Crawler(spidercls, self.settings, init_reactor=True)
File "/home/hamashou/.local/share/virtualenvs/test-5h1bg2cX/lib/python3.10/site-packages/scrapy/crawler.py", line 82, in init
default.install()
File "/home/hamashou/.local/share/virtualenvs/test-5h1bg2cX/lib/python3.10/site-packages/twisted/internet/epollreactor.py", line 256, in install
installReactor(p)
File "/home/hamashou/.local/share/virtualenvs/test-5h1bg2cX/lib/python3.10/site-packages/twisted/internet/main.py", line 32, in installReactor
raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed