WARNING: This is a almost a copy of scrapy-slash but works with https://github.com/prerender/prerender
Scrapy & JavaScript integration through Prerender |
---|
Installation |
Install scrapy-prerender:
$ python setup.py install
Add the Prerender server address to
settings.py
of your Scrapy project like this:PRERENDER_URL = 'http://192.168.59.103:8050'
Enable the Prerender middleware by adding it to
DOWNLOADER_MIDDLEWARES
in yoursettings.py
file and changing HttpCompressionMiddleware priority:DOWNLOADER_MIDDLEWARES = { 'scrapy_prerender.PrerenderCookiesMiddleware': 723, 'scrapy_prerender.PrerenderMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.
HttpCompressionMiddleware priority should be changed in order to allow advanced response processing; see scrapy/scrapy#1895 for details.
Enable
PrerenderDeduplicateArgsMiddleware
by adding it toSPIDER_MIDDLEWARES
in yoursettings.py
:SPIDER_MIDDLEWARES = { 'scrapy_prerender.PrerenderDeduplicateArgsMiddleware': 100, }
This middleware is needed to support
cache_args
feature; it allows to save disk space by not storing duplicate Prerender arguments multiple times in a disk request queue. If Prerender 2.1+ is used the middleware also allows to save network traffic by not sending these duplicate arguments to Prerender server multiple times.Set a custom
DUPEFILTER_CLASS
:DUPEFILTER_CLASS = 'scrapy_prerender.PrerenderAwareDupeFilter'
If you use Scrapy HTTP cache then a custom cache storage backend is required. scrapy-prerender provides a subclass of
scrapy.contrib.httpcache.FilesystemCacheStorage
:HTTPCACHE_STORAGE = 'scrapy_prerender.PrerenderAwareFSCacheStorage'
If you use other cache storage then it is necesary to subclass it and replace all
scrapy.util.request.request_fingerprint
calls withscrapy_prerender.prerender_request_fingerprint
.
Note
Steps (4) and (5) are necessary because Scrapy doesn't provide a way to override request fingerprints calculation algorithm globally; this could change in future.
There are also some additional options available. Put them into your settings.py
if you want to change the defaults:
PRERENDER_COOKIES_DEBUG
isFalse
by default. Set toTrue
to enable debugging cookies in thePrerenderCookiesMiddleware
. This option is similar toCOOKIES_DEBUG
for the built-in scarpy cookies middleware: it logs sent and received cookies for all requests.PRERENDER_LOG_400
isTrue
by default - it instructs to log all 400 errors from Prerender. They are important because they show errors occurred when executing the Prerender script. Set it toFalse
to disable this logging.PRERENDER_SLOT_POLICY
isscrapy_prerender.SlotPolicy.PER_DOMAIN
(as object, not just a string) by default. It specifies how concurrency & politeness are maintained for Prerender requests, and specify the default value forslot_policy
argument forPrerenderRequest
, which is described below.
The easiest way to render requests with Prerender is to use scrapy_prerender.PrerenderRequest
:
yield PrerenderRequest(url, self.parse_result,
args={
# optional; parameters passed to Prerender HTTP API
'wait': 0.5,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
endpoint='render.json', # optional; default is render
prerender_url='<url>', # optional; overrides PRERENDER_URL
slot_policy=scrapy_prerender.SlotPolicy.PER_DOMAIN, # optional
)
Alternatively, you can use regular scrapy.Request and 'prerender'
Request meta key:
yield scrapy.Request(url, self.parse_result, meta={
'prerender': {
'args': {
# set rendering arguments here
'html': 1,
'png': 1,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
# optional parameters
'endpoint': 'render.json', # optional; default is render.json
'prerender_url': '<url>', # optional; overrides PRERENDER_URL
'slot_policy': scrapy_prerender.SlotPolicy.PER_DOMAIN,
'prerender_headers': {}, # optional; a dict with headers sent to Prerender
'dont_process_response': True, # optional, default is False
'dont_send_headers': True, # optional, default is False
'magic_response': False, # optional, default is True
}
})
Use request.meta['prerender']
API in middlewares or when scrapy.Request subclasses are used (there is also PrerenderFormRequest
described below). For example, meta['prerender']
allows to create a middleware which enables Prerender for all outgoing requests by default.
PrerenderRequest
is a convenient utility to fill request.meta['prerender']
; it should be easier to use in most cases. For each request.meta['prerender']
key there is a corresponding PrerenderRequest
keyword argument: for example, to set meta['prerender']['args']
use PrerenderRequest(..., args=myargs)
.
meta['prerender']['args']
contains arguments sent to Prerender. scrapy-prerender adds some default keys/values toargs
:- 'url' is set to request.url;
- 'http_method' is set to 'POST' for POST requests;
- 'body' is set to to request.body for POST requests.
You can override default values by setting them explicitly.
Note that by default Scrapy escapes URL fragments using AJAX escaping scheme. If you want to pass a URL with a fragment to Prerender then set
url
inargs
dict manually. This is handled automatically if you usePrerenderRequest
, but you need to keep that in mind if you use rawmeta['prerender']
API.Prerender 1.8+ is required to handle POST requests; in earlier Prerender versions 'http_method' and 'body' arguments are ignored. If you work with
/execute
endpoint and want to support POST requests you have to handlehttp_method
andbody
arguments in your Lua script manually.meta['prerender']['cache_args']
is a list of argument names to cache on Prerender side. These arguments are sent to Prerender only once, then cached values are used; it allows to save network traffic and decreases request queue disk memory usage. Usecache_args
only for large arguments which don't change with each request;lua_source
is a good candidate (if you don't use string formatting to build it). Prerender 2.1+ is required for this feature to work.meta['prerender']['endpoint']
is the Prerender endpoint to use. In case of PrerenderRequest render is used by default. If you're using raw scrapy.Request then render.json is a default (for historical reasons). It is better to always pass endpoint explicitly.See Prerender HTTP API docs for a full list of available endpoints and parameters.
meta['prerender']['prerender_url']
overrides the Prerender URL set insettings.py
.meta['prerender']['prerender_headers']
allows to add or change headers which are sent to Prerender server. Note that this option is not for setting headers which are sent to the remote website.meta['prerender']['slot_policy']
customize how concurrency & politeness are maintained for Prerender requests.Currently there are 3 policies available:
scrapy_prerender.SlotPolicy.PER_DOMAIN
(default) - send Prerender requests to downloader slots based on URL being rendered. It is useful if you want to maintain per-domain politeness & concurrency settings.scrapy_prerender.SlotPolicy.SINGLE_SLOT
- send all Prerender requests to a single downloader slot. It is useful if you want to throttle requests to Prerender.scrapy_prerender.SlotPolicy.SCRAPY_DEFAULT
- don't do anything with slots. It is similar toSINGLE_SLOT
policy, but can be different if you access other services on the same address as Prerender.
meta['prerender']['dont_process_response']
- when set to True, PrerenderMiddleware won't change the response to a custom scrapy.Response subclass. By default for Prerender requests one of PrerenderResponse, PrerenderTextResponse or PrerenderJsonResponse is passed to the callback.meta['prerender']['dont_send_headers']
: by default scrapy-prerender passes request headers to Prerender in 'headers' JSON POST field. For all render.xxx endpoints it means Scrapy header options are respected by default (http://prerender.readthedocs.org/en/stable/api.html#arg-headers). In Lua scripts you can useheaders
argument ofprerender:go
to apply the passed headers:prerender:go{url, headers=prerender.args.headers}
.Set 'dont_send_headers' to True if you don't want to pass
headers
to Prerender.meta['prerender']['http_status_from_error_code']
- set response.status to HTTP error code whenassert(prerender:go(..))
fails; it requiresmeta['prerender']['magic_response']=True
.http_status_from_error_code
option is False by default if you use raw meta API; PrerenderRequest sets it to True by default.meta['prerender']['magic_response']
- when set to True and a JSON response is received from Prerender, several attributes of the response (headers, body, url, status code) are filled using data returned in JSON:- response.headers are filled from 'headers' keys;
- response.url is set to the value of 'url' key;
- response.body is set to the value of 'html' key, or to base64-decoded value of 'body' key;
- response.status is set to the value of 'http_status' key. When
meta['prerender']['http_status_from_error_code']
is True andassert(prerender:go(..))
fails with an HTTP error response.status is also set to HTTP error code.
Original URL, status and headers are available as
response.real_url
,response.prerender_response_status
andresponse.prerender_response_headers
.This option is set to True by default if you use PrerenderRequest.
render.json
andexecute
endpoints may not have all the necessary keys/values in the response. For non-JSON endpoints, only url is filled, regardless of themagic_response
setting.
Use scrapy_prerender.PrerenderFormRequest
if you want to make a FormRequest
via prerender. It accepts the same arguments as PrerenderRequest
, and also formdata
, like FormRequest
from scrapy:
>>> PrerenderFormRequest('http://example.com', formdata={'foo': 'bar'})
<POST http://example.com>
PrerenderFormRequest.from_response
is also supported, and works as described in scrapy documentation.
scrapy-prerender returns Response subclasses for Prerender requests:
- PrerenderResponse is returned for binary Prerender responses - e.g. for /render.png responses;
- PrerenderTextResponse is returned when the result is text - e.g. for /render responses;
- PrerenderJsonResponse is returned when the result is a JSON object - e.g. for /render.json responses or /execute responses when script returns a Lua table.
To use standard Response classes set meta['prerender']['dont_process_response']=True
or pass dont_process_response=True
argument to PrerenderRequest.
All these responses set response.url
to the URL of the original request (i.e. to the URL of a website you want to render), not to the URL of the requested Prerender endpoint. "True" URL is still available as response.real_url
.
PrerenderJsonResponse provide extra features:
response.data
attribute contains response data decoded from JSON; you can access it likeresponse.data['html']
.- If Prerender session handling is configured, you can access current cookies as
response.cookiejar
; it is a CookieJar instance. - If Scrapy-Prerender response magic is enabled in request (default), several response attributes (headers, body, url, status code) are set automatically from original response body:
- response.headers are filled from 'headers' keys;
- response.url is set to the value of 'url' key;
- response.body is set to the value of 'html' key, or to base64-decoded value of 'body' key;
- response.status is set from the value of 'http_status' key.
When response.body
is updated in PrerenderJsonResponse (either from 'html' or from 'body' keys) familiar response.css
and response.xpath
methods are available.
To turn off special handling of JSON result keys either set meta['prerender']['magic_response']=False
or pass magic_response=False
argument to PrerenderRequest.
Prerender itself is stateless - each request starts from a clean state. In order to support sessions the following is required:
- client (Scrapy) must send current cookies to Prerender;
- Prerender script should make requests using these cookies and update them from HTTP response headers or JavaScript code;
- updated cookies should be sent back to the client;
- client should merge current cookies wiht the updated cookies.
For (2) and (3) Prerender provides prerender:get_cookies()
and prerender:init_cookies()
methods which can be used in Prerender Lua scripts.
scrapy-prerender provides helpers for (1) and (4): to send current cookies in 'cookies' field and merge cookies back from 'cookies' response field set request.meta['prerender']['session_id']
to the session identifier. If you only want a single session use the same session_id
for all request; any value like '1' or 'foo' is fine.
For scrapy-prerender session handling to work you must use /execute
endpoint and a Lua script which accepts 'cookies' argument and returns 'cookies' field in the result:
function main(prerender)
prerender:init_cookies(prerender.args.cookies)
-- ... your script
return {
cookies = prerender:get_cookies(),
-- ... other results, e.g. html
}
end
PrerenderRequest sets session_id
automatically for /execute
endpoint, i.e. cookie handling is enabled by default if you use PrerenderRequest, /execute
endpoint and a compatible Lua rendering script.
If you want to start from the same set of cookies, but then 'fork' sessions set request.meta['prerender']['new_session_id']
in addition to session_id
. Request cookies will be fetched from cookiejar session_id
, but response cookies will be merged back to the new_session_id
cookiejar.
Standard Scrapy cookies
argument can be used with PrerenderRequest
to add cookies to the current Prerender cookiejar.
Get HTML contents:
import scrapy
from scrapy_prerender import PrerenderRequest
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield PrerenderRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
# response.body is a result of render call; it
# contains HTML processed by a browser.
# ...
Get HTML contents and a screenshot:
import json
import base64
import scrapy
from scrapy_prerender import PrerenderRequest
class MySpider(scrapy.Spider):
# ...
prerender_args = {
'html': 1,
'png': 1,
'width': 600,
'render_all': 1,
}
yield PrerenderRequest(url, self.parse_result, endpoint='render.json',
args=prerender_args)
# ...
def parse_result(self, response):
# magic responses are turned ON by default,
# so the result under 'html' key is available as response.body
html = response.body
# you can also query the html result as usual
title = response.css('title').extract_first()
# full decoded JSON data is available as response.data:
png_bytes = base64.b64decode(response.data['png'])
# ...
Run a simple Prerender Lua Script:
import json
import base64
from scrapy_prerender import PrerenderRequest
class MySpider(scrapy.Spider):
# ...
script = """
function main(prerender)
assert(prerender:go(prerender.args.url))
return prerender:evaljs("document.title")
end
"""
yield PrerenderRequest(url, self.parse_result, endpoint='execute',
args={'lua_source': script})
# ...
def parse_result(self, response):
doc_title = response.body_as_unicode()
# ...
More complex Prerender Lua Script example - get a screenshot of an HTML element by its CSS selector (it requires Prerender 2.1+). Note how are arguments passed to the script:
import json
import base64
from scrapy_prerender import PrerenderRequest
script = """
-- Arguments:
-- * url - URL to render;
-- * css - CSS selector to render;
-- * pad - screenshot padding size.
-- this function adds padding around region
function pad(r, pad)
return {r[1]-pad, r[2]-pad, r[3]+pad, r[4]+pad}
end
-- main script
function main(prerender)
-- this function returns element bounding box
local get_bbox = prerender:jsfunc([[
function(css) {
var el = document.querySelector(css);
var r = el.getBoundingClientRect();
return [r.left, r.top, r.right, r.bottom];
}
]])
assert(prerender:go(prerender.args.url))
assert(prerender:wait(0.5))
-- don't crop image by a viewport
prerender:set_viewport_full()
local region = pad(get_bbox(prerender.args.css), prerender.args.pad)
return prerender:png{region=region}
end
"""
class MySpider(scrapy.Spider):
# ...
yield PrerenderRequest(url, self.parse_element_screenshot,
endpoint='execute',
args={
'lua_source': script,
'pad': 32,
'css': 'a.title'
}
)
# ...
def parse_element_screenshot(self, response):
image_data = response.body # binary image data in PNG format
# ...
Use a Lua script to get an HTML response with cookies, headers, body and method set to correct values; lua_source
argument value is cached on Prerender server and is not sent with each request (it requires Prerender 2.1+):
import scrapy
from scrapy_prerender import PrerenderRequest
script = """
function main(prerender)
prerender:init_cookies(prerender.args.cookies)
assert(prerender:go{
prerender.args.url,
headers=prerender.args.headers,
http_method=prerender.args.http_method,
body=prerender.args.body,
})
assert(prerender:wait(0.5))
local entries = prerender:history()
local last_response = entries[#entries].response
return {
url = prerender:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = prerender:get_cookies(),
html = prerender:html(),
}
end
"""
class MySpider(scrapy.Spider):
# ...
yield PrerenderRequest(url, self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
headers={'X-My-Header': 'value'},
)
def parse_result(self, response):
# here response.body contains result HTML;
# response.headers are filled with headers from last
# web page loaded to Prerender;
# cookies from all responses and from JavaScript are collected
# and put into Set-Cookie response header, so that Scrapy
# can remember them.
If you need HTTP Basic Authentication to access Prerender, use Scrapy's HttpAuthMiddleware.
Another option is meta['prerender']['prerender_headers']
: it allows to set custom headers which are sent to Prerender server; add Authorization header to prerender_headers
if HttpAuthMiddleware doesn't fit for some reason.
Why not use the Prerender HTTP API directly? =========================================
The obvious alternative to scrapy-prerender would be to send requests directly to the Prerender HTTP API. Take a look at the example below and make sure to read the observations after it:
import json
import scrapy
from scrapy.http.headers import Headers
RENDER_HTML_URL = "http://127.0.0.1:8050/render"
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
body = json.dumps({"url": url, "wait": 0.5}, sort_keys=True)
headers = Headers({'Content-Type': 'application/json'})
yield scrapy.Request(RENDER_HTML_URL, self.parse, method="POST",
body=body, headers=headers)
def parse(self, response):
# response.body is a result of render call; it
# contains HTML processed by a browser.
# ...
It works and is easy enough, but there are some issues that you should be aware of:
- There is a bit of boilerplate.
- As seen by Scrapy, we're sending requests to
RENDER_HTML_URL
instead of the target URLs. It affects concurrency and politeness settings:CONCURRENT_REQUESTS_PER_DOMAIN
,DOWNLOAD_DELAY
, etc could behave in unexpected ways since delays and concurrency settings are no longer per-domain. - As seen by Scrapy, response.url is an URL of the Prerender server. scrapy-prerender fixes it to be an URL of a requested page. "Real" URL is still available as
response.real_url
. scrapy-prerender also allows to handleresponse.status
andresponse.headers
transparently on Scrapy side. - Some options depend on each other - for example, if you use timeout Prerender option then you may want to set
download_timeout
scrapy.Request meta key as well. - It is easy to get it subtly wrong - e.g. if you won't use
sort_keys=True
argument when preparing JSON body then binary POST body content could vary even if all keys and values are the same, and it means dupefilter and cache will work incorrectly. - Default Scrapy duplication filter doesn't take Prerender specifics in account. For example, if an URL is sent in a JSON POST request body Scrapy will compute request fingerprint without canonicalizing this URL.
- Prerender Bad Request (HTTP 400) errors are hard to debug because by default response content is not displayed by Scrapy. PrerenderMiddleware logs content of HTTP 400 Prerender responses by default (it can be turned off by setting
PRERENDER_LOG_400 = False
option). - Cookie handling is tedious to implement, and you can't use Scrapy built-in Cookie middleware to handle cookies when working with Prerender.
- Large Prerender arguments which don't change with every request (e.g.
lua_source
) may take a lot of space when saved to Scrapy disk request queues.scrapy-prerender
provides a way to store such static parameters only once. - Prerender 2.1+ provides a way to save network traffic by caching large static arguments on server, but it requires client support: client should send proper
save_args
andload_args
values and handle HTTP 498 responses.
scrapy-prerender utlities allow to handle such edge cases and reduce the boilerplate.
- for problems with rendering pages read "Prerender FAQ" page
- for Scrapy-related bugs take a look at "reporting Scrapy bugs" page
Best approach to get any other help is to ask a question on Stack Overflow
Source code and bug tracker are on github: https://github.com/scrapy-plugins/scrapy-prerender
To run tests, install "tox" Python package and then run tox
command from the source checkout.
To run integration tests, start Prerender and set PRERENDER_URL env variable to Prerender address before running tox
command:
docker run -d --rm -p8050:8050 scrapinghub/prerender:3.0
PRERENDER_URL=http://127.0.0.1:8050 tox -e py36