jamesturk / scrapelib Goto Github PK

View Code? Open in Web Editor NEW

205.0 20.0 43.0 965 KB

⛏ a library for scraping unreliable pages

Home Page: https://jamesturk.github.io/scrapelib/

License: BSD 2-Clause "Simplified" License

Python 99.64% Dockerfile 0.36%

python scraper http

scrapelib's Introduction

scrapelib is a library for making requests to less-than-reliable websites.

Source: https://github.com/jamesturk/scrapelib

Documentation: https://jamesturk.github.io/scrapelib/

Issues: https://github.com/jamesturk/scrapelib/issues

Features

scrapelib originated as part of the Open States project to scrape the websites of all 50 state legislatures and as a result was therefore designed with features desirable when dealing with sites that have intermittent errors or require rate-limiting.

Advantages of using scrapelib over using requests as-is:

HTTP(S) and FTP requests via an identical API
support for simple caching with pluggable cache backends
highly-configurable request throtting
configurable retries for non-permanent site failures
All of the power of the suberb requests library.

Installation

scrapelib is on PyPI, and can be installed via any standard package management tool:

poetry add scrapelib

or:

pip install scrapelib

Example Usage

  import scrapelib
  s = scrapelib.Scraper(requests_per_minute=10)

  # Grab Google front page
  s.get('http://google.com')

  # Will be throttled to 10 HTTP requests per minute
  while True:
      s.get('http://example.com')

scrapelib's People

Contributors

Stargazers

Watchers

scrapelib's Issues

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

add option to for parameters to be considered in cache key

right now, params of a get request are not considered in setting and retrieving a cache

scrapelib/scrapelib/__init__.py

Line 389 in 3208143

request_key = self.key_for_request(method, url)

it would be helpful to add an option to change the behavior so that parameters are considered. the default key_for_request method can already handle params.

Throttling should only apply when not reloading from cache

I think the inheritence order for Scraper between CachingSession and ThrottledSession should be reversed, so that when a cached document is reloaded from cache, there's no wait until the next one.

Thus, only requests really loaded from the network would be throttled.

Does this make sense ?

May want to consider requiring requests security extras

Hello, and thank you for contributing and maintaining such a helpful project.

I found out about scrapelib via the unitedstates/congress project, which I'm just beginning to explore. That project requires scrapelib>=0.1.0,<1.0.0 and my virtual environment picked up scrapelib-0.10.1. The first command I attempted to run was to pull down the fdsys sitemaps, which resulted in the following SSL error emerging out of scrapelib:

File "unitedstates-congress/.env/lib/python2.7/site-packages/scrapelib/__init__.py", line 201, in request
raise exception_raised
SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:590)

I found this StackOverflow post which recommends installing the requests package's extra security packages:

pip install requests[security]

That worked for my immediate use case. Since this wasn't an issue with unitedstates/congress, I thought to recommend that scrapelib update its requests module requirement to include the optional security packages. If my understanding of the issue is correct, this issue could crop up in other downstream projects for similar user configurations (OSX 10.12.3 Sierra w/ stock OpenSSL 0.9.8zh running Python 2.7.10 w/ requests 2.13.0).

(Separately, there seems to be a mismatch between the project README's requirement for requests > 2.0 and the requirements.txt requirement for requests >= 1.2.2)

Thank you again, and I hope this is helpful!

robotparser is wrong?

The two most recent times I've tried to use scrapelib defaulting to honoring the robots.txt files, I've gotten blocked, even though my read of the Robots files says I shouldn't.

I found this on SO: http://stackoverflow.com/questions/15344253/robotparser-doesnt-seem-to-parse-correctly and checked out reppy, and i do get conflicting answers for my specific case:

>>> url = 'https://en.wikipedia.org/wiki/List_of_United_States_counties_and_county-equivalents'
>>> user_agent = 'scrapelib 0.9.0 python-requests/1.2.3 CPython/2.7.5 Darwin/13.0.0'
>>> from reppy.cache import RobotsCache
>>> robots = RobotsCache()
>>> robots.allowed(url,user_agent)
True

>>> import robotparser
>>> parser = robotparser.RobotFileParser()
>>> parser.set_url('http://en.wikipedia.org/robots.txt')
>>> parser.read()
>>> parser.can_fetch(user_agent,url)
False

Think it's worth switching to use reppy instead? It sounds like the robots.txt spec is just not very well articulated, but the Wikipedia robots.txt file says specifically "Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please" ...

Allow caching to be controlled by server's cache headers

It'd be super great if we could have caching with

Conditional requests https://ruturajv.wordpress.com/2005/12/27/conditional-get-request/

set attribute to indicate if response was pulled from cache or not

it is sometimes useful to know if a response was pulled from a cache the cache, particularly if the last modified code path is used in the sqlite cache. for example you may only want to process new data.

it would be nice to set an attribute on responses hydrated from a cache that indicated that it was pulled from a cache or not

requests incompatibility

Seems like there's an issue with the version of requests being used. Just did a pip install scrapelib on a clean virtualenv. From the console:

(scrapelibtest)Toms-MacBook-Air:scrapelibtest tomlee$ python
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import scrapelib
Traceback (most recent call last):
File "", line 1, in
File "/Users/tomlee/Projects/scrapelibtest/lib/python2.7/site-packages/scrapelib/init.py", line 193, in
class FTPSession(requests.Session):
File "/Users/tomlee/Projects/scrapelibtest/lib/python2.7/site-packages/scrapelib/init.py", line 195, in FTPSession
requests.defaults.SCHEMAS.append('ftp')
AttributeError: 'module' object has no attribute 'defaults'

Please document the use of CachingSession

I'm not sure the CachingSession is still useable now.

Would you care to either document its use or get rid from it ?

Thanks in advance.

requests.sessions api change causes breakage

requests.sessions api changed between 1.2.1 and 1.2.2. requests.sessions.merge_kwargs is now called requests.sessions.merge_settings. Change is from this commit:

kennethreitz/requests@9811424

Someone using a fresh virtualenv install will eventually get this traceback:

File ".../site-packages/scrapelib/__init__.py", line 348, in request
    headers = requests.sessions.merge_kwargs(headers, self.headers)
AttributeError: 'module' object has no attribute 'merge_kwargs'

Suggested quick fix is to change requirements.txt to:

requests>=1.0,<=1.2.1

or of course change the requests api scrapelib uses and say requests>=1.2.2 instead.

Time for a release, maybe?

Hello,

I saw someone's request to package it for Debian repositories. I am not sure if that's still needed? Is it?

If so, could you please make a new release? Because the last release predates to Nov, 2018.

Handling of requests exceptions

If a request raises

requests.HTTPError, 
requests.ConnectionError,
requests.Timeout

and a subsequent retry cannot recover, then one of these exceptions is raised even if self.raise_errors == False.

Instead, could we build a kind of response object when we get these errors, and handle error raising the same way we handle valid http error status codes?

Recent changes break urlopen?

Using HEAD on Python 2.7.5:

  File "utils.py", line 297, in lxmlize
    entry = scraper.urlopen(url)
  File "/Users/james/.virtualenvs/scrapers_ca_app/src/scrapelib/scrapelib/__init__.py", line 307, in urlopen
    resp = self.request(method, url, data=body, retry_on_404=retry_on_404, **kwargs)
  File "/Users/james/.virtualenvs/scrapers_ca_app/src/scrapelib/scrapelib/__init__.py", line 282, in request
    headers = requests.sessions.merge_setting(headers, self.headers)
AttributeError: 'module' object has no attribute 'merge_setting'

scrapelib incorrectly caches kwargs to requests

I initialized a scraper, and fetched a URL whose server has not correctly configured their HTTPS certificates. (They're missing the intermediates, so desktop browsers will fetch them, but most client libs will not.)

scraper.get("https://www.ignet.gov/sites/default/files/files/Cloud%20Computing%20Initiative%20Report.pdf")

This hangs, as it goes through a retry cycle, and will eventually fail. If I pass the verify=False option, as documented in requests, it will work and spit out a warning:

scraper.get("https://www.ignet.gov/sites/default/files/files/Cloud%20Computing%20Initiative%20Report.pdf", verify=False)

/home/eric/.virtualenvs/inspectors/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py:734: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
  InsecureRequestWarning)

However, if I then remove the verify flag, or even if I add it back with verify=True, I can't coax scrapelib into going back into verify mode:

scraper.get("https://www.ignet.gov/sites/default/files/files/Cloud%20Computing%20Initiative%20Report.pdf", verify=True)

/home/eric/.virtualenvs/inspectors/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py:734: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
  InsecureRequestWarning)

Raw requests.get() calls do not display this behavior. So, something in scrapelib's caching behavior is caching keyword options.

To be clear, I think the correct behavior here is not to cache kwargs at all, as I wouldn't want verify to "stick" whether or not I was explicit about it later.

This is with the latest version of scrapelib:

requests==2.5.1
scrapelib==0.10.1

Allow for more control of status codes to not retry on

Hi James,

Would you be amenable to allowing the user to configure status codes to not retry on.

I think I would approach this by adjusting this code section:

scrapelib/scrapelib/__init__.py

Lines 155 to 156 in b952150

    
           if self.accept_response(resp) or (resp.status_code == 404 and not retry_on_404): 
        
               break

to be something like

                no_retry = (self.accept_response(resp) or
                            (resp.status_code == 404 and not retry_on_404) or
                            resp.status_code in self.quick_fail_status_codes)
                if no_retry:
                    break

Add ability to cache via redis or memcache

Any preference? My pupa scraper is running on heroku, so filesystem-based caches are non-ideal :)

Would you accept a PR for one of these options?

Follow <meta> redirects

scrapelib doesn't seem to follow <meta> redirects. While this is a somewhat old and non-standard way to do redirection in 2013, it's still out there on a few government sites.

Here's one example:

$ curl http://www.risch.senate.gov
<html>
<head>
<meta http-equiv="Refresh"
content="0;url=http://www.risch.senate.gov/public/">
</head>

There's an example on StackOverflow for how this could be implemented.

Currently also being discussed in unitedstates/congress-legislators#85

adding to debian/ubuntu

Hi,

I've really become excited about the Sunlight Foundation in the past few months, and think it's a great project. I'd love to see these packages in both Debian and Ubuntu, since the goals of those projects are pretty complementary to Sunlight's, and it seems like a natural fit for Debian/Ubuntu users to want to contribute to Sunlight.

In my mind, the best way forward is to get the packages into the distro to reduce any friction developers might encounter that would prevent them from installing and playing with Sunlight.

I've already done most of the work in packaging up scrapelib, which included writing a manpage for scrapeshell, and turning on the tests during package building. Additionally, I've set up the package to be maintained collectively by the Debian Python Modules Team, so that anyone in the team can step up and help, and so that things aren't bottlenecked on me.

My plan was to find someone to help with the last few steps of actually uploading the package to the archive, but I figured I'd check in you first. I don't think this should be a problem but it's always nice to reach out to the upstream to coordinate a bit. :)

cheers!
/ac

Comparison with Scrapy

I couldn't find any comparison with a more popular scrapy library.
I see that Scrapy is much older than scrapelib.
Could you please provide key advantages of scrapelib over Scrapy?

Scaper.urlopen should support timeout

requests provides a timeout parameter for .request but there's no way to pass that through Scraper.urlopen. Perhaps Scraper.urlopen should accept **kwargs and pass those onto Scraper.request?

release to pypi?

Hi @jamesturk ,

would you mind making a minor release to pypi that has the june changes in there. thanks for your consideration!

AttributeError: 'module' object has no attribute 'Scraper'

Hello,

I've just used pip install requests and pip install scrapelib in a new virtualenv, with Python 3.4.

I got this error:

Traceback (most recent call last):
File "C:/Users/natalie/Documents/Encore-in-Google/scrapelib.py", line 1, in
import scrapelib
File "C:/Users/natalie/Documents/Encore-in-Google\scrapelib.py", line 2, in
s = scrapelib.Scraper(requests_per_minute=10)
AttributeError: 'module' object has no attribute 'Scraper'

Not sure what I'm doing wrong? I'm using equests 2.3.0 and Python 3.4 so I think both are compatible with scrapelib? I just copied the example usage at https://github.com/sunlightlabs/scrapelib

I've not used virtualenv before so I'm wondering if that's the problem though both requests and scraplib said they'd installed.

Thanks!

remove urlopen/RequestStr

these are holdovers from a pre-requests era and should go away now.

attributes of RequestStr resp:

treating resp like a string (use resp.text)
_scraper - can go away, was 'private'
resp.bytes -> new.content
resp.response -> new
resp.encoding -> new.encoding
resp.response.requested_url -> new.history[0].url if new.history else new.url
resp.response.code -> new.status_code

next version of scrapelib

My current thinking is that the next version of scrapelib should support httplib3 2.0 as well as 1.0 (as recommended here: https://urllib3.readthedocs.io/en/stable/v2-migration-guide.html), that'd probably be 2.3.

I've also had the idea to decouple from requests entirely (allowing direct usage of httplib3 and/or httpx), but would save that for a 3.0 release (or new library).

If you're seeing this issue, I'd love to know what is/isn't useful about the current version. Why do you use it & when do you hit pain points?

	if self.accept_response(resp) or (resp.status_code == 404 and not retry_on_404):
	break