scrapinghub / splash Goto Github PK

View Code? Open in Web Editor NEW

4.0K 213.0 507.0 4.67 MB

Lightweight, scriptable browser as a service with an HTTP API

License: BSD 3-Clause "New" or "Revised" License

Python 92.29% Lua 3.58% JavaScript 1.29% Shell 0.88% Jupyter Notebook 0.70% CSS 0.25% Dockerfile 0.77% Qt Script 0.24%

splash's Introduction

Splash - A javascript rendering service

Join the chat at https://gitter.im/scrapinghub/splash

Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python 3 using Twisted and QT5.

It's fast, lightweight and state-less which makes it easy to distribute.

Documentation

Documentation is available here: https://splash.readthedocs.io/

Using Splash with Scrapy

To use Splash with Scrapy, please refer to the scrapy-splash library.

Support

Open source support is provided here in GitHub. Please create a question issue.

Commercial support is also available by Scrapinghub.

splash's People

Contributors

Stargazers

Watchers

Forkers

scraping-xx parsing netconstructor kmike andresp99999 open-source-gis openscripts-xx mapping wh1100717 reverland ivaano shobhit atassumer irics madeadi dustinthughes shirk3y microhello martindale getwingm pawelmhm immerrr sibiryakov cgc1983 novadata jumango hotsushi sunu jp111 smarthomekit curtiszimmerman dscottpi kfei codeops cp2587 dvdbng michaelgerace fanjianing yerffejytnac 0xmilk mitchellzen sunchen009 eavie kod3r hkjallbring maxssage made-by-love oogee78 dwdm mrg7 leeomar greshem durban89 takaaptech starrify sunilsharma07 bugonee cripy ruairif rasmusx rdowinton yoka24443 sundisee bzisch sergeynenashev erwinjulius trinitycomputers cloverseer yaoxingqi booox landryraccoon mike1808 lobin mnabil anukat2015 joorei rightside-data hongthaiphi kingwenchen qwshy rbs392 ii0 huanglg openrijal topiaruss ragib95 dkachan1941 saurindashadia mar664 ericvalente coeusdata nkhuyu dvska gz51837844 santymouni vsalvans drazcmd nibircse icyc9 tonyjhuang

splash's Issues

Add ability to changes user agent like changing proxies.

Add setting in *.ini for user agents.

[useragents]
chrome:'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36'
firefox:'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'

http://localhost:8050/render.html?url=http://mywebsite.com/page-with-javascript.html&proxy=mywebsite&UA=chrome

Example on changes user agent on QNetworkAccessManager:

QNetworkAccessManager* mgr = new QNetworkAccessManager();
....
QNetworkRequest req;
req.setUrl(QUrl("enter url") );
req.setRawHeader( "User-Agent" , "Mozilla Firefox" );
mgr->get(req);

The addition will have to be two part addition by adding way to parse the config .INI file and the setting the raw header for QT.

Document installation steps

I'm trying to get the debian package installed from the github repo. After quite a bit of googling around, it seems I have gotten dpckg install to work and yet splash is still not recognized. Here are the steps I took:

Cloned the splash repo
sudo apt-get install devscripts
sudo dpkg-buildpackage -b (from the repo root)
sudo dpkg -i splash_1.0_all.deb (notice that dependencies not installed)
sudo apt-get -f install
sudo apt-get install equivs
mk-build-deps
sudo dpkg -i splash-build-deps_1.0_all.deb

I was under the impression that this would install splash so I can start it as an upstart job but "start splash" does nothing

I see the upstart in /etc/init/splash.conf.dpkg-new
but not recognized in initctl list even after reloading conf sudo initctl reload-configuration
So then I do init-checkconf /etc/init/splash.conf.dpkg-new to get
ERROR: file must end in .conf
Then I change the file to end in conf, reload the config again sudo initctl reload-configuration
and now I see it in the list initctl list as splash stop/waiting

I start splash

$ sudo start splash
splash start/running, process 32079

Looks successful but not according to the logs!

chown: cannot access `/var/log/splash': No such file or directory
chown: cannot access `/etc/splash/proxy-profiles': No such file or directory
chown: cannot access `/var/cache/splash': No such file or directory
chown: cannot access `/etc/splash/js-profiles': No such file or directory
chown: cannot access `/var/log/splash': No such file or directory
chown: cannot access `/etc/splash/proxy-profiles': No such file or directory
chown: cannot access `/var/cache/splash': No such file or directory
chown: cannot access `/etc/splash/js-profiles': No such file or directory
...

So I manually made the directories and then started splash and it worked. I'd be happy to update the README with install instructions but I'm thinking that maybe I'm just not knowledgable about idiosyncrasies regarding installing deb packages from source. Can you please comment and let me know:

Is there currently a much easier way or did I in fact surface a cumbersome installation process? If so, I'd love to know.
If not, do all the steps above make sense? And is there an issue with preinstall script or something that is supposed to create those directories?

Again, if the current installation process involves all this work, I can boil it down to a few steps and update the README

S3 cache

Cache responses on S3 for faster retrieval.

an index of all cached urls which allow optimal handling without need to lookup to s3 whether we have that url in cache or not
handling of expiration at s3 level for storage optimization and thumbnail renewal, and as consequence also at the index
allow to pass to splash a list of get parameters to avoid to include them in the key generation of the index (for example, in order to avoid each user apikey in the HS url to be included in the key)

Wait for some time after window.onload by default

What do you think about setting defaults.WAIT_TIME to something like 0.5?
This has 2 advantages:

it helps with #14;
it matches more closely how browser works for user: some dynamically-generated content (like lazy-loaded iframes) will become available by default.

I run python -m splash.tests.stress and 0.5 wait didn't result in a slowdown, as expected.

This change could bring up the memory required by splash because more requests are stayed in memory at the same time; also, it will require more processing power to execute javascript for 0.5s in real-word webpages.

Error 5: the operation was canceled via calls to abort() or close() before it was finished

I installed splash on a bare instance of Ubuntu (EC2). After getting everything up and running, I noticed I was receiving this error on every request:

Error 5: the operation was canceled via calls to abort() or close() before it was finished.

The error is documented here as QNetworkReply::OperationCanceledError

I'm invoking splash with this command:

curl http://localhost:8050/render.html?url=http://www.getsidewalk.com

Full log below:

2014-01-10 21:39:49+0000 [network] Error 203: the remote content was not found at the server (similar to HTTP error 404) (http://www.getsidewalk.com/assets/layouts/default/wing-left-eaa1fe4bec1a41ce80b911bff557e710.png)
2014-01-10 21:39:49+0000 [network] Error 203: the remote content was not found at the server (similar to HTTP error 404) (http://www.getsidewalk.com/assets/layouts/default/wing-right-baa18b10d7709bc86454d66ecaded0d6.png)
2014-01-10 21:39:49+0000 [stats] {"maxrss": 63352, "load": [0.0, 0.01, 0.05], "fds": 27, "qsize": 0, "rendertime": 0.6076819896697998, "active": 0, "path": "/render.html", "args": {"url": ["http://www.getsidewalk.com"]}, "_id": 29090232}
2014-01-10 21:39:49+0000 [-] 127.0.0.1 - - [10/Jan/2014:21:39:49 +0000] "GET /render.html?url=http://www.getsidewalk.com HTTP/1.1" 200 3910 "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
2014-01-10 21:39:49+0000 [network] Error 5: the operation was canceled via calls to abort() or close() before it was finished. (http://www.google-analytics.com/collect?v=1&_v=j15&a=18518730&t=pageview&_s=1&dl=http://www.getsidewalk.com/&ul=en-us&de=UTF-8&dt=Sidewalk&sd=8-bit&sr=640x480&vp=1024x768&je=0&_u=ME~&cid=732692695.1389389989&tid=UA-41559438-1&z=1586397198)

Note that I do not get this error when running it on OS X

Any idea how to fix this?

Add to readme undocumented parameters

There are several parameters that are not documented in the Readme, those can be seen by running the help option from the command line.

Usage: server.py [options]

Options:
  -h, --help            show this help message and exit
  -f LOGFILE, --logfile=LOGFILE
                        log file
  -m MAXRSS, --maxrss=MAXRSS
                        exit if max RSS reaches this value (in KB) (default:
                        0)
  -p PORT, --port=PORT  port to listen to (default: 8050)
  -s SLOTS, --slots=SLOTS
                        number of render slots (default: 50)
  --cache               enable local cache (active by default)
  --no-cache            disable local cache
  -c CACHE_PATH, --cache-path=CACHE_PATH
                        local cache folder
  --cache-size=CACHE_SIZE
                        maximum cache size in Kb (default: 51200)
  --proxy-profiles-path=PROXY_PROFILES_PATH
                        path to a folder with proxy profiles

Passing User-Agent header doesn't work in proxy mode

iframes support

We need to support retrieving the content of iframes. Currently, render.html only returns the main frame content.

Problem executing JS code

Hi,

I'm facing some problems executing a JS on a specific website:

➜  ~  curl -X POST -H 'content-type: application/javascript' -d 'showModalDimmer(); dojo.publish("showResultsForPageNumber",[{pageNumber:"2",pageSize:"12", linkId:"WC_SearchBasedNavigationResults_pagination_link_right_categoryResults"}]);' 'http://localhost:8050/render.html?url=http://www.hhgregg.com/appliances-home/washers&timeout=60&wait=0.5'

<html><head>
<title>Invalid URL</title>
</head><body>
<h1>Invalid URL</h1>
The requested URL "/appliances-home/washers", is invalid.<p>
Reference #9.9b043bbb.1394120505.1585c2c6

</p></body></html>%

This JS code is executed when I click the "Next" button.

However, it works with a simpler JS code:

➜  ~  curl -X POST -H 'content-type: application/javascript' -d 'document.write("hello");' 'http://localhost:8050/render.html?url=http://www.hhgregg.com/appliances-home/washers&timeout=60&wait=0.5'

<html><head></head><body>hello</body></html>%

Also, splash doesn't seem to properly render the page when I visualize it using the browser.

Any thoughts?

Test Qt objects are properly garbage collected

Add support for pluggable proxy handling rules

I started implementing proxy support here: https://github.com/kmike/splash/tree/proxy-support. The idea was to have

a config for a single proxy server;
a blacklist of regexpes - if url matches some of regexp from blacklist it goes through default proxy server (usually non);
a whitelist of regexpes - if url doesn't match a regexp from whitelist it goes through proxy, and it doesn't go through proxy otherwise.

but it strikes me that it is too opinionated and project-specific (e.g. no multiple proxies to choose from, the blacklist/whitelist thing).

Also, this requires some project-specific configuration, namely there should be a way to specify proxy params, blacklist and whitelist. It is hard to do it via command line; in order to be consistent and easy to use this would probably require adding "settings.py" support to splash. Simple INI files are not a good fit because of blacklist/whitelit being lists.

What do you think about not adding settings.py support to splash and refactoring code to make it possible to use custom QNetworkProxyFactory subclasses instead? This changes philosophy a bit - instead of just installing and using splash user is supposed to customize it to project needs by writing some code.

no way to get the http status code when working as proxy

e.g.
when run curl -x http://localhost:8051 -H "X-Splash-render: json" -H "X-Splash-html: 0" http://steinmetz-maxwald.at/materialien/, ideally splash should send the 404 status code back. but currently the splash always return 200 for the url above,

Splash as a proxy doesn't return the response HTTP headers

When using Splash as a proxy the response HTTP headers are not returned. They can be returned either as headers in the Splash response or as a fields in the JSON response.

Splash timeout while rendering slow pages or with timeouts

I am trying to render a page with Splash using the Crawlera service as a proxy, however I am getting into timeouts. I think we should support a greater default timeout.

after some consecutive days that splash server is up, it starts to fail to render anything

Document how to turn logging on or off.

Our staff would like to know if there is a way to turn logging off or on for this software.
Adding this information to the document would be very helpful.

Unexpected result when combining POST with gzip encoding.

When using splash as a proxy, and making a POST request with gzip encoding header, it seems the content does not get decoded before taking the screenshot. This does not happen when the method is GET or the encoding is deflate. The render method html is affected too.

For example:

POST request with deflate encoding

curl -x localhost:8051 -X POST -H 'Accept-Encoding: deflate'  -H 'X-Splash-render: png' -H 'X-Splash-wait: 1' http://www.facebook.com

POST request with gzip encoding

curl -x localhost:8051 -X POST -H 'Accept-Encoding: gzip'  -H 'X-Splash-render: png' -H 'X-Splash-wait: 1' http://www.facebook.com

Flash objects are not displayed in Splash render

Try the following URL:

http://splash.scrapinghub.com:8050/render.png?wait=5&url=http://copyboost.com/ad-examples/good-flash-ad-examples/

The flash objects are not displayed, if you go to the URL directly you can see them: http://copyboost.com/ad-examples/good-flash-ad-examples/

We may want to add a setting to turn on/off flash support on Splash requests.

rename qtrender2 to qtrender and remove old qtrender

It seems unlikely we'll use code from qtrender.py in future.

exceptions.RuntimeError warnings while running tests

Twisted==11.1.0, qt4reactor==1.0

test_whitelist (splash.tests.test_proxy.BlackWhiteProxyFactoryTest) ... ok
test_blacklist (splash.tests.test_proxy.HtmlProxyRenderTest) ... Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 586, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 199, in doRead
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 572, in dataReceived
    return self.rawDataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 503, in rawDataReceived
    self.handleResponseEnd()
  File "/usr/lib/python2.7/dist-packages/twisted/web/proxy.py", line 88, in handleResponseEnd
    self.father.finish()
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish
    "Request.finish called on a request after its connection was lost; "
exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 586, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 199, in doRead
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 572, in dataReceived
    return self.rawDataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 503, in rawDataReceived
    self.handleResponseEnd()
  File "/usr/lib/python2.7/dist-packages/twisted/web/proxy.py", line 88, in handleResponseEnd
    self.father.finish()
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish
    "Request.finish called on a request after its connection was lost; "
exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 586, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 199, in doRead
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 572, in dataReceived
    return self.rawDataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 503, in rawDataReceived
    self.handleResponseEnd()
  File "/usr/lib/python2.7/dist-packages/twisted/web/proxy.py", line 88, in handleResponseEnd
    self.father.finish()
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish
    "Request.finish called on a request after its connection was lost; "
exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
ok
test_insecure (splash.tests.test_proxy.HtmlProxyRenderTest) ... ok
test_no_proxy_settings (splash.tests.test_proxy.HtmlProxyRenderTest) ... ok
test_nonexisting (splash.tests.test_proxy.HtmlProxyRenderTest) ... ok
test_proxy_works (splash.tests.test_proxy.HtmlProxyRenderTest) ... Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 586, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 199, in doRead
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 572, in dataReceived
    return self.rawDataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 503, in rawDataReceived
    self.handleResponseEnd()
  File "/usr/lib/python2.7/dist-packages/twisted/web/proxy.py", line 88, in handleResponseEnd
    self.father.finish()
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish
    "Request.finish called on a request after its connection was lost; "
exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
ok
test_basic (splash.tests.test_render.IframesRenderTest) ... ok

tests: use ports from ephemeral ports range instead of hardcoded ports

Currently tests can fail or behave incorrectly when splash instance is running. It could be better to use temporary ports for splash server, mock server and proxy server.

>>> import socket
>>> s = socket.socket()
>>> s.bind(("", 0))
>>> s.getsockname()
('0.0.0.0', 54485)

For proxy server it may require more code because its port is in config that is stored in VCS; it may be OK not to fix this issue for proxy server.

Add Wiki page or someting similar to explain how to use splash with scrapy

Although splash and scrapy are independent project, they are often used together.

But the setup is apparently not that straightforward (see for example https://stackoverflow.com/questions/21564234/how-to-integrate-javascript-rendering-module-to-scrapy)

It would be nice to have a step-by-step guide on how to make scrapy and splash work together nicely.

Make splash work as proxy

The idea is by @nramirezuy. It will allow to use HTTP methods other than GET, access response headers (and e.g. preserve cookies between requests), etc.

some sites doesn't render properly

when I render bestbuy.ca and some other sites, the html rendered is not what I expected. elements sprawl and overlap, it doesn't resemble anything like the screenshot.

give it a try

render.html?url=http://www.bestbuy.ca

is it possible to fix this somehow? what is causing such malformed html?

add support for rendering images

only HTML is supported so far

run javascript on each get request

I have a javascript that inserts <base> tag inside the <head>

I want to make this run on every page that I request.

For example, if there's an image and it's relative url, it won't get rendered. running this javascript helps me.

Right now it requires POST request. however I need it to run on each get request when I request render.html?js=/etc/js-profile/new&url=http://a.com

HtmlProxyRenderTest.test_blacklist can fail

I can't reproduce it myself, but this failure happened:

======================================================================
FAIL: test_blacklist (splash.tests.test_proxy.HtmlProxyRenderTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/buildbot/slave/builders/splash/build/splash/tests/test_proxy.py", line 72, in test_blacklist
    self.assertProxied(frame['html'])
  File "/var/lib/buildbot/slave/builders/splash/build/splash/tests/test_proxy.py", line 94, in assertProxied
    assert 'PROXY_USED' in html
AssertionError: 
-------------------- >> begin captured logging << --------------------
requests.packages.urllib3.connectionpool: INFO: Starting new HTTP connection (1): localhost
requests.packages.urllib3.connectionpool: DEBUG: "GET /render.json?url=http%3A%2F%2Flocalhost%3A8998%2Fiframes&html=1&iframes=1&proxy=test HTTP/1.1" 200 None
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------

PNG render displays overly large footer

I'm using latest revision and rendering this pages shows a difference in the footer layout being overly large:

Is there a workaround to avoid this behavior?

Doesn't seem like splash can handle multi-byte chars in URLs

This returns a "not found" when hitting Splash:
http://www.yelp.com/biz/abra%C3%A7o-new-york-2

but gets the response with curl
$ curl http://www.yelp.com/biz/abra%C3%A7o-new-york-2

Rendering error on pages with redirects

The following Splash calls return "Error rendering page":

The target page have in common that they redirect to another URL, not sure if that is the cause of the issue.

Use a single instance of QNetworkDiskCache

We need to use a single instance of QNetworkDiskCache to avoid reloading cache on each request.

My simple fix on 11a9b8d didn't work because the network manager takes ownership of the QNetworkDiskCache instance and deletes it once it dies.

I suspect the solution would be around having a single network manager instance, which may require to rethink or rewrite the allowed_domains functionality.

Ideas welcome!

splash segfaults on second call to pages with custom fonts on Mac OS X 10.8.4

First request to scrapinghub.com finishes successfully, but the second leads to segfault on my machine. Python 2.7.5, Qt 4.8.5 and PyQT 4.10.2 installed using homebrew.

It seems that has something to do with custom fonts: https://gist.github.com/kmike/311343897a5051890c12

Looks more like PyQT/QT bug, but anyways.

Check why deferred can be already called in WebpageRender._loadFinished and _loadFinishedOK

These lines shouldn't be necessary:

splash/splash/qtrender2.py

Line 90 in 0c4f67a

if self.deferred.called:
splash/splash/qtrender2.py

Line 99 in 0c4f67a

if self.deferred.called:

Add an option to execute js in page context and get some results back

I don't know if it can be done in PyQT, but it could enable many cool things (e.g. taking screenshots of individual HTML elements).

Add an option to render full webpage as png image

This feature was removed here: #5

I was able to reproduce this issue. But setting non-zero "wait" parameter (implemented here: #13) fixed this problem for me.

What do you think about the following plan?

render full pages when vwidth/vheight are not set;
add small "wait" timeout by default;
if contentsSize() still fails (==returns zero QSize()) then fallback to some default value like 1024x768;
rename vwidth and vheight parameters to a single "viewport" parameter that'll accept values like "1024x768" - it seems that vwidth without vheight and vice versa could be hard to support if (1)-(3) are implemented.

Limit concurrent renders

We need to limit the number of renders that run concurrently, to avoid getting the servers overloaded when visiting a page with many thumbnails on the panel.

Is there a way to disable splash trying to render images, mp3s, etc.?

I noticed this is a lot slower and causes crashing. I'm mainly using it for the javascript rendering

run tests on Travis CI

sentry support

make splash2 log errors to sentry

Javascript code gets wrong coordinate for page elements when viewport is full

I have a splash call to render.json that gets the full screenshot of the page (viewport=full) and at the same time uses a Javascript function to get the position of certain elements in the page.

However when I compare the coordinates obtained with the generated screenshot they don't match.

I think the problem is that the javascript code is executed before the viewport is applied. Therefore the javascript is getting the elements position with the default viewport and the screenshot is generated with the full viewport.

Splash proxy mode: request headers are not passed to remote pages for GET requests

Request headers are passed to remote pages only for POST requests when splash works in proxy mode.

Pages that take a lot of CPU to process don't respect timeout

Big XML sitemap files can take a lot of CPU time to render; timeout doesn't work for them. It actually happened that such 4MB sitemap (https://gust.com/sitemap.xml) took 100% CPU for 30 minutes, and splash became unresponsive. We must fix this somehow.

loadFinished triggered twice for some pages

This happens with amazon.com, sometimes. I haven't yet figured out exactly what is causing loadFinished to get triggered twice but it's related with the sign in box that appears when you go into amazon.com.

when rendering some pages, it works only once, and then it is not possible to retrieve anything else from the same domain

When rendering some pages, it works only once, and then it is not possible to retrieve anything else from the same domain, unless i restart the server.

curl "http://33.33.33.10:8050/render.png?url=http://panel.scrapinghub.com/" > render.png

first time ok. Second time, response is 0 length, and the splash2 process output shows the following:

[33.33.33.10] out: QPainter::begin: Paint device returned engine == 0, type: 3
[33.33.33.10] out: QPainter::setRenderHint: Painter must be active to set rendering hints
[33.33.33.10] out: QPainter::setBrush: Painter not active
[33.33.33.10] out: QPainter::pen: Painter not active
[33.33.33.10] out: QPainter::setPen: Painter not active
[33.33.33.10] out: QPainter::end: Painter not active, aborted

If that helps to give a clue, I first found the problem using splash2 for rendering html code retrieved from my vm HS server and i solved using the parameter baseurl with the base url of the html code i retrieved. So i uploaded an example for you to test with storage.scrapinghub.com:8002. However, in this case i could not reproduce the bug in that server. Only happens with my vm version. Anyway this is the test page for rendering:

http://storage.scrapinghub.com:8002/collections/645/cs/Pages/075b2960be370059076be43cfd65a11a6ea62cc3/body?apikey=&format=html"

which, rendered with splash2 using curl:

curl --get --data-urlencode "url=http://storage.scrapinghub.com:8002/collections/645/cs/Pages/075b2960be370059076be43cfd65a11a6ea62cc3/body?apikey=&format=html" --data-urlencode "baseurl=http://www.icone.com/" http://33.33.33.10:8050/render.png > render.png

and adding baseurl:

curl --get --data-urlencode "url=http://33.33.33.11:8002/collections/9/cs/Pages/075b2960be370059076be43cfd65a11a6ea62cc3/body?apikey=ffffffffffffffffffffffffffffffff&format=html" --data-urlencode "baseurl=http://www.icone.com/" http://33.33.33.10:8050/render.png

Implementation should be quite straightforward; it'll probably involve extending SplashQNetworkAccessManager and making it pluggable.

server cannot be started with the sip 4.15.5 and pyqt 4.10.4 comb

the setup is centos 5 python2.7

$ rpm -qa | grep qt
qt4-devel-4.7.1-0
qt4-4.7.1-0

to execute the server with xvfb-run

$ /usr/local/bin/xvfb-run -a -s "-screen 0 640x480x8" python -m splash.server
2014-03-18 22:00:05+0800 [-] Log opened.
2014-03-18 22:00:05+0800 [-] Open files limit: 1024000
2014-03-18 22:00:05+0800 [-] Can't bump open files limit
2014-03-18 22:00:05+0800 [-] Traceback (most recent call last):
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/runpy.py", line 162, in _run_module_as_main
2014-03-18 22:00:05+0800 [-]     "__main__", fname, loader, pkg_name)
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
2014-03-18 22:00:05+0800 [-]     exec code in run_globals
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/site-packages/splash-1.0-py2.7.egg/splash/server.py", line 233, in <module>
2014-03-18 22:00:05+0800 [-]     main()
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/site-packages/splash-1.0-py2.7.egg/splash/server.py", line 224, in main
2014-03-18 22:00:05+0800 [-]     proxy_portnum=opts.proxy_portnum)
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/site-packages/splash-1.0-py2.7.egg/splash/server.py", line 144, in default_splash_server
2014-03-18 22:00:05+0800 [-]     manager = network_manager.FilteringQNetworkAccessManager()
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/site-packages/splash-1.0-py2.7.egg/splash/network_manager.py", line 105, in __init__
2014-03-18 22:00:05+0800 [-]     super(FilteringQNetworkAccessManager, self).__init__()
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/site-packages/splash-1.0-py2.7.egg/splash/network_manager.py", line 48, in __init__
2014-03-18 22:00:05+0800 [-]     self.sslErrors.connect(self._sslErrors)
2014-03-18 22:00:05+0800 [-] TypeError: pyqtSignal must be bound to a QObject, not 'FilteringQNetworkAccessManager'

however to examine from with ipython

In [8]: from PyQt4.QtNetwork import QNetworkAccessManager, QNetworkProxyQuery, QNetworkReply
In [9]: import inspect                                                                                                                                                              
In [10]: inspect.getmro(QNetworkAccessManager)                                                                                                                                      
Out[10]: 
(PyQt4.QtNetwork.QNetworkAccessManager,
 PyQt4.QtCore.QObject,
 sip.wrapper,
 sip.simplewrapper,
 object)

Have no clue what's wrong, it's the latest splash master

make tests work on dev.scrapinghub.com

"nosetests" fail for some reason on dev.scrapinghub.com, even through the service works fine there.

Need to investigate why, I suspect processes spawned from tests not being managed properly.