Coder Social home page Coder Social logo

scrapinghub / splash Goto Github PK

View Code? Open in Web Editor NEW
4.0K 213.0 506.0 4.67 MB

Lightweight, scriptable browser as a service with an HTTP API

License: BSD 3-Clause "New" or "Revised" License

Python 92.29% Lua 3.58% JavaScript 1.29% Shell 0.88% Jupyter Notebook 0.70% CSS 0.25% Dockerfile 0.77% Qt Script 0.24%

splash's Introduction

Splash - A javascript rendering service

Build Status

Coverage report

Join the chat at https://gitter.im/scrapinghub/splash

Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python 3 using Twisted and QT5.

It's fast, lightweight and state-less which makes it easy to distribute.

Documentation

Documentation is available here: https://splash.readthedocs.io/

Using Splash with Scrapy

To use Splash with Scrapy, please refer to the scrapy-splash library.

Support

Open source support is provided here in GitHub. Please create a question issue.

Commercial support is also available by Scrapinghub.

splash's People

Contributors

ahivert avatar andresp99999 avatar arturgaspar avatar chekunkov avatar dangra avatar dvdbng avatar gallaecio avatar imduffy15 avatar immerrr avatar ivanprado avatar jp111 avatar kmike avatar laerte avatar lagenar avatar laurentsenta avatar lucywang000 avatar mehaase avatar mike1808 avatar pablohoffman avatar pawelmhm avatar pyexplorer avatar redapple avatar sardok avatar shaylevi2 avatar sibiryakov avatar sortafreel avatar starrify avatar sunu avatar whalebot-helmsman avatar zscholl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

splash's Issues

Add authentication to splash server

Would authentication be a welcomed feature or do you suggest it's handled elsewhere? I did some quick poking around and it doesn't seem too difficult to add. I'm still evaluating splash but happy to do it if I move forward with it!

Add an option to limit network access based on urls regexes

Some sites are quite bloated, and it takes a long time to fully load them including all ads and tracking codes; sometimes pages don't even fit into timeout.

What do you think about adding "network access profiles" similar to "proxy profiles" with a whitelist and blacklist for network access?

Implementation should be quite straightforward; it'll probably involve extending SplashQNetworkAccessManager and making it pluggable.

Wait for some time after window.onload by default

What do you think about setting defaults.WAIT_TIME to something like 0.5?
This has 2 advantages:

  1. it helps with #14;
  2. it matches more closely how browser works for user: some dynamically-generated content (like lazy-loaded iframes) will become available by default.

I run python -m splash.tests.stress and 0.5 wait didn't result in a slowdown, as expected.

This change could bring up the memory required by splash because more requests are stayed in memory at the same time; also, it will require more processing power to execute javascript for 0.5s in real-word webpages.

Add an option to render full webpage as png image

This feature was removed here: #5

I was able to reproduce this issue. But setting non-zero "wait" parameter (implemented here: #13) fixed this problem for me.

What do you think about the following plan?

  1. render full pages when vwidth/vheight are not set;
  2. add small "wait" timeout by default;
  3. if contentsSize() still fails (==returns zero QSize()) then fallback to some default value like 1024x768;
  4. rename vwidth and vheight parameters to a single "viewport" parameter that'll accept values like "1024x768" - it seems that vwidth without vheight and vice versa could be hard to support if (1)-(3) are implemented.

tests: use ports from ephemeral ports range instead of hardcoded ports

Currently tests can fail or behave incorrectly when splash instance is running. It could be better to use temporary ports for splash server, mock server and proxy server.

>>> import socket
>>> s = socket.socket()
>>> s.bind(("", 0))
>>> s.getsockname()
('0.0.0.0', 54485)

For proxy server it may require more code because its port is in config that is stored in VCS; it may be OK not to fix this issue for proxy server.

Error 5: the operation was canceled via calls to abort() or close() before it was finished

I installed splash on a bare instance of Ubuntu (EC2). After getting everything up and running, I noticed I was receiving this error on every request:

Error 5: the operation was canceled via calls to abort() or close() before it was finished.

The error is documented here as QNetworkReply::OperationCanceledError

I'm invoking splash with this command:

curl http://localhost:8050/render.html?url=http://www.getsidewalk.com

Full log below:

2014-01-10 21:39:49+0000 [network] Error 203: the remote content was not found at the server (similar to HTTP error 404) (http://www.getsidewalk.com/assets/layouts/default/wing-left-eaa1fe4bec1a41ce80b911bff557e710.png)
2014-01-10 21:39:49+0000 [network] Error 203: the remote content was not found at the server (similar to HTTP error 404) (http://www.getsidewalk.com/assets/layouts/default/wing-right-baa18b10d7709bc86454d66ecaded0d6.png)
2014-01-10 21:39:49+0000 [stats] {"maxrss": 63352, "load": [0.0, 0.01, 0.05], "fds": 27, "qsize": 0, "rendertime": 0.6076819896697998, "active": 0, "path": "/render.html", "args": {"url": ["http://www.getsidewalk.com"]}, "_id": 29090232}
2014-01-10 21:39:49+0000 [-] 127.0.0.1 - - [10/Jan/2014:21:39:49 +0000] "GET /render.html?url=http://www.getsidewalk.com HTTP/1.1" 200 3910 "-" "curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3"
2014-01-10 21:39:49+0000 [network] Error 5: the operation was canceled via calls to abort() or close() before it was finished. (http://www.google-analytics.com/collect?v=1&_v=j15&a=18518730&t=pageview&_s=1&dl=http://www.getsidewalk.com/&ul=en-us&de=UTF-8&dt=Sidewalk&sd=8-bit&sr=640x480&vp=1024x768&je=0&_u=ME~&cid=732692695.1389389989&tid=UA-41559438-1&z=1586397198)

Note that I do not get this error when running it on OS X

Any idea how to fix this?

iframes support

We need to support retrieving the content of iframes. Currently, render.html only returns the main frame content.

Limit concurrent renders

We need to limit the number of renders that run concurrently, to avoid getting the servers overloaded when visiting a page with many thumbnails on the panel.

Use a single instance of QNetworkDiskCache

We need to use a single instance of QNetworkDiskCache to avoid reloading cache on each request.

My simple fix on 11a9b8d didn't work because the network manager takes ownership of the QNetworkDiskCache instance and deletes it once it dies.

I suspect the solution would be around having a single network manager instance, which may require to rethink or rewrite the allowed_domains functionality.

Ideas welcome!

Document installation steps

I'm trying to get the debian package installed from the github repo. After quite a bit of googling around, it seems I have gotten dpckg install to work and yet splash is still not recognized. Here are the steps I took:

  1. Cloned the splash repo
  2. sudo apt-get install devscripts
  3. sudo dpkg-buildpackage -b (from the repo root)
  4. sudo dpkg -i splash_1.0_all.deb (notice that dependencies not installed)
  5. sudo apt-get -f install
  6. sudo apt-get install equivs
  7. mk-build-deps
  8. sudo dpkg -i splash-build-deps_1.0_all.deb

I was under the impression that this would install splash so I can start it as an upstart job but "start splash" does nothing

I see the upstart in /etc/init/splash.conf.dpkg-new
but not recognized in initctl list even after reloading conf sudo initctl reload-configuration
So then I do init-checkconf /etc/init/splash.conf.dpkg-new to get
ERROR: file must end in .conf
Then I change the file to end in conf, reload the config again sudo initctl reload-configuration
and now I see it in the list initctl list as splash stop/waiting

I start splash

$ sudo start splash
splash start/running, process 32079

Looks successful but not according to the logs!

chown: cannot access `/var/log/splash': No such file or directory
chown: cannot access `/etc/splash/proxy-profiles': No such file or directory
chown: cannot access `/var/cache/splash': No such file or directory
chown: cannot access `/etc/splash/js-profiles': No such file or directory
chown: cannot access `/var/log/splash': No such file or directory
chown: cannot access `/etc/splash/proxy-profiles': No such file or directory
chown: cannot access `/var/cache/splash': No such file or directory
chown: cannot access `/etc/splash/js-profiles': No such file or directory
...

So I manually made the directories and then started splash and it worked. I'd be happy to update the README with install instructions but I'm thinking that maybe I'm just not knowledgable about idiosyncrasies regarding installing deb packages from source. Can you please comment and let me know:

  1. Is there currently a much easier way or did I in fact surface a cumbersome installation process? If so, I'd love to know.
  2. If not, do all the steps above make sense? And is there an issue with preinstall script or something that is supposed to create those directories?

Again, if the current installation process involves all this work, I can boil it down to a few steps and update the README

when rendering some pages, it works only once, and then it is not possible to retrieve anything else from the same domain

When rendering some pages, it works only once, and then it is not possible to retrieve anything else from the same domain, unless i restart the server.

curl "http://33.33.33.10:8050/render.png?url=http://panel.scrapinghub.com/" > render.png

first time ok. Second time, response is 0 length, and the splash2 process output shows the following:

[33.33.33.10] out: QPainter::begin: Paint device returned engine == 0, type: 3
[33.33.33.10] out: QPainter::setRenderHint: Painter must be active to set rendering hints
[33.33.33.10] out: QPainter::setBrush: Painter not active
[33.33.33.10] out: QPainter::pen: Painter not active
[33.33.33.10] out: QPainter::setPen: Painter not active
[33.33.33.10] out: QPainter::end: Painter not active, aborted

If that helps to give a clue, I first found the problem using splash2 for rendering html code retrieved from my vm HS server and i solved using the parameter baseurl with the base url of the html code i retrieved. So i uploaded an example for you to test with storage.scrapinghub.com:8002. However, in this case i could not reproduce the bug in that server. Only happens with my vm version. Anyway this is the test page for rendering:

http://storage.scrapinghub.com:8002/collections/645/cs/Pages/075b2960be370059076be43cfd65a11a6ea62cc3/body?apikey=&format=html"

which, rendered with splash2 using curl:

curl --get --data-urlencode "url=http://storage.scrapinghub.com:8002/collections/645/cs/Pages/075b2960be370059076be43cfd65a11a6ea62cc3/body?apikey=&format=html" --data-urlencode "baseurl=http://www.icone.com/" http://33.33.33.10:8050/render.png > render.png

and adding baseurl:

curl --get --data-urlencode "url=http://33.33.33.11:8002/collections/9/cs/Pages/075b2960be370059076be43cfd65a11a6ea62cc3/body?apikey=ffffffffffffffffffffffffffffffff&format=html" --data-urlencode "baseurl=http://www.icone.com/" http://33.33.33.10:8050/render.png

Add support for pluggable proxy handling rules

I started implementing proxy support here: https://github.com/kmike/splash/tree/proxy-support. The idea was to have

  1. a config for a single proxy server;
  2. a blacklist of regexpes - if url matches some of regexp from blacklist it goes through default proxy server (usually non);
  3. a whitelist of regexpes - if url doesn't match a regexp from whitelist it goes through proxy, and it doesn't go through proxy otherwise.

but it strikes me that it is too opinionated and project-specific (e.g. no multiple proxies to choose from, the blacklist/whitelist thing).

Also, this requires some project-specific configuration, namely there should be a way to specify proxy params, blacklist and whitelist. It is hard to do it via command line; in order to be consistent and easy to use this would probably require adding "settings.py" support to splash. Simple INI files are not a good fit because of blacklist/whitelit being lists.

What do you think about not adding settings.py support to splash and refactoring code to make it possible to use custom QNetworkProxyFactory subclasses instead? This changes philosophy a bit - instead of just installing and using splash user is supposed to customize it to project needs by writing some code.

make tests work on dev.scrapinghub.com

"nosetests" fail for some reason on dev.scrapinghub.com, even through the service works fine there.

Need to investigate why, I suspect processes spawned from tests not being managed properly.

Make splash work as proxy

The idea is by @nramirezuy. It will allow to use HTTP methods other than GET, access response headers (and e.g. preserve cookies between requests), etc.

loadFinished triggered twice for some pages

This happens with amazon.com, sometimes. I haven't yet figured out exactly what is causing loadFinished to get triggered twice but it's related with the sign in box that appears when you go into amazon.com.

Unexpected result when combining POST with gzip encoding.

When using splash as a proxy, and making a POST request with gzip encoding header, it seems the content does not get decoded before taking the screenshot. This does not happen when the method is GET or the encoding is deflate. The render method html is affected too.

For example:

  • POST request with deflate encoding

    curl -x localhost:8051 -X POST -H 'Accept-Encoding: deflate'  -H 'X-Splash-render: png' -H 'X-Splash-wait: 1' http://www.facebook.com
    

    Result as expected.

  • POST request with gzip encoding

    curl -x localhost:8051 -X POST -H 'Accept-Encoding: gzip'  -H 'X-Splash-render: png' -H 'X-Splash-wait: 1' http://www.facebook.com
    

    Unexpected result.

some sites doesn't render properly

when I render bestbuy.ca and some other sites, the html rendered is not what I expected. elements sprawl and overlap, it doesn't resemble anything like the screenshot.

give it a try

render.html?url=http://www.bestbuy.ca

is it possible to fix this somehow? what is causing such malformed html?

run javascript on each get request

I have a javascript that inserts <base> tag inside the <head>

I want to make this run on every page that I request.

For example, if there's an image and it's relative url, it won't get rendered. running this javascript helps me.

Right now it requires POST request. however I need it to run on each get request when I request render.html?js=/etc/js-profile/new&url=http://a.com

HtmlProxyRenderTest.test_blacklist can fail

I can't reproduce it myself, but this failure happened:

======================================================================
FAIL: test_blacklist (splash.tests.test_proxy.HtmlProxyRenderTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/buildbot/slave/builders/splash/build/splash/tests/test_proxy.py", line 72, in test_blacklist
    self.assertProxied(frame['html'])
  File "/var/lib/buildbot/slave/builders/splash/build/splash/tests/test_proxy.py", line 94, in assertProxied
    assert 'PROXY_USED' in html
AssertionError: 
-------------------- >> begin captured logging << --------------------
requests.packages.urllib3.connectionpool: INFO: Starting new HTTP connection (1): localhost
requests.packages.urllib3.connectionpool: DEBUG: "GET /render.json?url=http%3A%2F%2Flocalhost%3A8998%2Fiframes&html=1&iframes=1&proxy=test HTTP/1.1" 200 None
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------

Add to readme undocumented parameters

There are several parameters that are not documented in the Readme, those can be seen by running the help option from the command line.

Usage: server.py [options]

Options:
  -h, --help            show this help message and exit
  -f LOGFILE, --logfile=LOGFILE
                        log file
  -m MAXRSS, --maxrss=MAXRSS
                        exit if max RSS reaches this value (in KB) (default:
                        0)
  -p PORT, --port=PORT  port to listen to (default: 8050)
  -s SLOTS, --slots=SLOTS
                        number of render slots (default: 50)
  --cache               enable local cache (active by default)
  --no-cache            disable local cache
  -c CACHE_PATH, --cache-path=CACHE_PATH
                        local cache folder
  --cache-size=CACHE_SIZE
                        maximum cache size in Kb (default: 51200)
  --proxy-profiles-path=PROXY_PROFILES_PATH
                        path to a folder with proxy profiles

Problem executing JS code

Hi,

I'm facing some problems executing a JS on a specific website:

➜  ~  curl -X POST -H 'content-type: application/javascript' -d 'showModalDimmer(); dojo.publish("showResultsForPageNumber",[{pageNumber:"2",pageSize:"12", linkId:"WC_SearchBasedNavigationResults_pagination_link_right_categoryResults"}]);' 'http://localhost:8050/render.html?url=http://www.hhgregg.com/appliances-home/washers&timeout=60&wait=0.5'

<html><head>
<title>Invalid URL</title>
</head><body>
<h1>Invalid URL</h1>
The requested URL "/appliances-home/washers", is invalid.<p>
Reference #9.9b043bbb.1394120505.1585c2c6

</p></body></html>%

This JS code is executed when I click the "Next" button.

However, it works with a simpler JS code:

➜  ~  curl -X POST -H 'content-type: application/javascript' -d 'document.write("hello");' 'http://localhost:8050/render.html?url=http://www.hhgregg.com/appliances-home/washers&timeout=60&wait=0.5'

<html><head></head><body>hello</body></html>%

Also, splash doesn't seem to properly render the page when I visualize it using the browser.

Any thoughts?

Javascript code gets wrong coordinate for page elements when viewport is full

I have a splash call to render.json that gets the full screenshot of the page (viewport=full) and at the same time uses a Javascript function to get the position of certain elements in the page.

However when I compare the coordinates obtained with the generated screenshot they don't match.

I think the problem is that the javascript code is executed before the viewport is applied. Therefore the javascript is getting the elements position with the default viewport and the screenshot is generated with the full viewport.

exceptions.RuntimeError warnings while running tests

Twisted==11.1.0, qt4reactor==1.0

test_whitelist (splash.tests.test_proxy.BlackWhiteProxyFactoryTest) ... ok
test_blacklist (splash.tests.test_proxy.HtmlProxyRenderTest) ... Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 586, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 199, in doRead
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 572, in dataReceived
    return self.rawDataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 503, in rawDataReceived
    self.handleResponseEnd()
  File "/usr/lib/python2.7/dist-packages/twisted/web/proxy.py", line 88, in handleResponseEnd
    self.father.finish()
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish
    "Request.finish called on a request after its connection was lost; "
exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 586, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 199, in doRead
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 572, in dataReceived
    return self.rawDataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 503, in rawDataReceived
    self.handleResponseEnd()
  File "/usr/lib/python2.7/dist-packages/twisted/web/proxy.py", line 88, in handleResponseEnd
    self.father.finish()
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish
    "Request.finish called on a request after its connection was lost; "
exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 586, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 199, in doRead
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 572, in dataReceived
    return self.rawDataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 503, in rawDataReceived
    self.handleResponseEnd()
  File "/usr/lib/python2.7/dist-packages/twisted/web/proxy.py", line 88, in handleResponseEnd
    self.father.finish()
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish
    "Request.finish called on a request after its connection was lost; "
exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
ok
test_insecure (splash.tests.test_proxy.HtmlProxyRenderTest) ... ok
test_no_proxy_settings (splash.tests.test_proxy.HtmlProxyRenderTest) ... ok
test_nonexisting (splash.tests.test_proxy.HtmlProxyRenderTest) ... ok
test_proxy_works (splash.tests.test_proxy.HtmlProxyRenderTest) ... Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 586, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 199, in doRead
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 572, in dataReceived
    return self.rawDataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 503, in rawDataReceived
    self.handleResponseEnd()
  File "/usr/lib/python2.7/dist-packages/twisted/web/proxy.py", line 88, in handleResponseEnd
    self.father.finish()
  File "/usr/lib/python2.7/dist-packages/twisted/web/http.py", line 866, in finish
    "Request.finish called on a request after its connection was lost; "
exceptions.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.
ok
test_basic (splash.tests.test_render.IframesRenderTest) ... ok

server cannot be started with the sip 4.15.5 and pyqt 4.10.4 comb

the setup is centos 5 python2.7

$ rpm -qa | grep qt
qt4-devel-4.7.1-0
qt4-4.7.1-0

to execute the server with xvfb-run

$ /usr/local/bin/xvfb-run -a -s "-screen 0 640x480x8" python -m splash.server
2014-03-18 22:00:05+0800 [-] Log opened.
2014-03-18 22:00:05+0800 [-] Open files limit: 1024000
2014-03-18 22:00:05+0800 [-] Can't bump open files limit
2014-03-18 22:00:05+0800 [-] Traceback (most recent call last):
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/runpy.py", line 162, in _run_module_as_main
2014-03-18 22:00:05+0800 [-]     "__main__", fname, loader, pkg_name)
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
2014-03-18 22:00:05+0800 [-]     exec code in run_globals
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/site-packages/splash-1.0-py2.7.egg/splash/server.py", line 233, in <module>
2014-03-18 22:00:05+0800 [-]     main()
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/site-packages/splash-1.0-py2.7.egg/splash/server.py", line 224, in main
2014-03-18 22:00:05+0800 [-]     proxy_portnum=opts.proxy_portnum)
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/site-packages/splash-1.0-py2.7.egg/splash/server.py", line 144, in default_splash_server
2014-03-18 22:00:05+0800 [-]     manager = network_manager.FilteringQNetworkAccessManager()
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/site-packages/splash-1.0-py2.7.egg/splash/network_manager.py", line 105, in __init__
2014-03-18 22:00:05+0800 [-]     super(FilteringQNetworkAccessManager, self).__init__()
2014-03-18 22:00:05+0800 [-]   File "/usr/local/lib/python2.7/site-packages/splash-1.0-py2.7.egg/splash/network_manager.py", line 48, in __init__
2014-03-18 22:00:05+0800 [-]     self.sslErrors.connect(self._sslErrors)
2014-03-18 22:00:05+0800 [-] TypeError: pyqtSignal must be bound to a QObject, not 'FilteringQNetworkAccessManager'

however to examine from with ipython

In [8]: from PyQt4.QtNetwork import QNetworkAccessManager, QNetworkProxyQuery, QNetworkReply
In [9]: import inspect                                                                                                                                                              
In [10]: inspect.getmro(QNetworkAccessManager)                                                                                                                                      
Out[10]: 
(PyQt4.QtNetwork.QNetworkAccessManager,
 PyQt4.QtCore.QObject,
 sip.wrapper,
 sip.simplewrapper,
 object)

Have no clue what's wrong, it's the latest splash master

no way to get the http status code when working as proxy

e.g.
when run curl -x http://localhost:8051 -H "X-Splash-render: json" -H "X-Splash-html: 0" http://steinmetz-maxwald.at/materialien/, ideally splash should send the 404 status code back. but currently the splash always return 200 for the url above,

S3 cache

Cache responses on S3 for faster retrieval.

  1. an index of all cached urls which allow optimal handling without need to lookup to s3 whether we have that url in cache or not
  2. handling of expiration at s3 level for storage optimization and thumbnail renewal, and as consequence also at the index
  3. allow to pass to splash a list of get parameters to avoid to include them in the key generation of the index (for example, in order to avoid each user apikey in the HS url to be included in the key)

Add ability to changes user agent like changing proxies.

Add setting in *.ini for user agents.

[useragents]
chrome:'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36'
firefox:'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'

http://localhost:8050/render.html?url=http://mywebsite.com/page-with-javascript.html&proxy=mywebsite&UA=chrome

Example on changes user agent on QNetworkAccessManager:

QNetworkAccessManager* mgr = new QNetworkAccessManager();
....
QNetworkRequest req;
req.setUrl(QUrl("enter url") );
req.setRawHeader( "User-Agent" , "Mozilla Firefox" );
mgr->get(req);

The addition will have to be two part addition by adding way to parse the config .INI file and the setting the raw header for QT.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.